From Source to AST: Lexer and Parser

2008-11-20

code

compiler

In this part I’ll explain how I’ve implemented the lexer and parser for Exoself in a simplified manner. I’ll again use the example of a mathematical expression parser to explain the basics and later describe some more advanced problems.

If you want to follow this in an interactive manner download ANTLRWorks. That’s an IDE for grammar development using ANTLR. After copy & pasting, click on ‘Grammar->Check Grammar’ in the menu bar, to see if everything is ok. Then go to ‘Debugger->Debug…’. Enter some input in the new dialog, select the start rule (‘start’) and click ‘ok’. Now you can control the debugger in the bottom half of the window.

Parsing simple mathematical expressions

For now our goal is to parse simple mathematical expressions like 9 – 3 * 3 + 84 / 2 and similar constructs. To make it more interesting I’ll also include parentheses. So our tokens will be

NUMBERS
PLUS
MINUS
STAR
SLASH
LPAREN
RPAREN

Using the syntax of ANTLR this results in the following code for the lexer:

// common header for lexer and parser
grammar math2ast;
options
{
    output = AST;
}
 
// lexer rules
NUMBER: ('0'..'9')+;
LPAREN: '(';
RPAREN: ')';
PLUS: '+';
MINUS: '-';
STAR: '*';
SLASH: '/';
WHITESPACE: (' ' | '\n' | '\r')+ {$channel=HIDDEN;}; // ignore white space

Lexer rules must start with a capital letter as you may have already guessed.

Now let’s write the parser. That’s a bit more complicated since we have to encode the operator precedence by hand. When ignoring multiplication, division and parentheses the parser would look like this:

start: expr EOF; // entry point followed by 'end of file'
 
expr: atom ((PLUS^ | MINUS^) atom)*;
atom: NUMBER;

Parser rules must start with a lower case letter to distinguish them from lexer rules as both are contained in the same grammar file.

The code above is essentially EBNF with a small addition: The ^ operator is used to define the structure of the resulting AST. Without further instructions the parser generates a flat list of AST nodes. Using the ‘^’ operator we can create a tree structure while parsing. Specifically the ^ operator makes the marked node the root node of the current rule.

So when the expression 2 + 3 is parsed, ANTLR starts with the expr rule and then tries to match the atom rule, which just means the token NUMBER. 2 is a NUMBER, so that works. Returning from recursion to the expr rule the parser tries to match either PLUS or MINUS. Here it will match PLUS and then the token for the 3.

Since the atom rule has no AST rewrite operation it yields just a flat list of exactly one token: NUMBER. Before leaving the ‘expr’ rule ANTLR transforms the flat list inside this rule in a way, so that the PLUS token becomes a parent node for the two atom rules.

The result looks like this:

(PLUS (NUMBER 2) (NUMBER 3))

Now we can extend the parser to recognize the other operations. Parenthesis have the highest precedence, then multiplication / division and at last addition / subtraction.

If we start at the rule expr the lowest binding operators come first. Why? Every binary arithmetic operation first recurses on the left side until we hit an atom. Then we go back up until we match an operator.

The result looks like this:

expr: term ((PLUS^ | MINUS^) term)*;
term: atom ((STAR^ | SLASH^) atom)*;
atom: NUMBER | LPAREN expr RPAREN;

This parser will create ASTs for valid inputs which we can evaluate with the technique I presented in my Compiler Structure post.

Extending the parser

Essentially the above code was also the starting point for my compiler. At first it’s really simple to extend this parser with new features but there are some more problematic cases.

Let’s see what happens when we add the power operator signified by **.

DOUBLESTAR: '**'; // must be inserted after STAR
 
term: power ((STAR^ | SLASH^) power)*;
power: atom (DOUBLESTAR power -> ^(DOUBLESTAR atom power) | /* nothing */ -> atom);

I’ll first describe the new syntax before explaining the curios way of implementing the power operator. -> is another AST rewrite operator. It is used when more powerful rewrites are necessary. On the right hand side of it you create an arbitrary AST consisting of the tokens of the current rule.

Here the rewrite operation creates an (DOUBLESTAR atom power) node if the power operator was matched, otherwise it just passes the result of atom back to our caller.

But why not just use the recipe of expr / term to implement power? Power is right associative instead of left associative. Calculate (2 ** 4) ** 2 and compare it to 2 ** (4 ** 2). That makes quite a difference.

Problems while implementing Exoself

Several times I hit the problem described in How to remove global backtracking from your grammar. This happens when infinite look ahead is not enough to determine an unambiguous path in the parse tree.

A good example was when I tried to implement assignment to arbitrary expressions combined with multi assign

f(x)[g(y)] = answer = 42

Starting from something like

simple_assign: (simple_assign_expr ASSIGN)+ expr;
simple_assign_expr: (variable_name)
    | (STAR variable_name)
    | (variable_name LBRACKET (variable_name | integer_constant) RBRACKET);

I constantly got errors when replacing the code between the LBRACKET and RBRACKET with an generic expression. Only being told that the code in the parenthesis of simple_assign is causing this behavior without any details wasn’t that helpful.

The solution came quite unexpected: Since I had implemented the parser step by step the – much later – obvious way to do it was far easier than expected:

simple_assign: expr (ASSIGN expr)+;

This was earlier not possible since both my type checker and code generator refused to work with code like this. But with pointer support this statement is very important.

Having solved this problem the next one was that I also allowed expressions as statements. ANTLR again told me to do the usual thing left refactoring / enabling backtracking / using syntactic predicates. But now it was quite obvious what to do: Simply using a common start rule for the expression statement and simple_assign.

The next step from source code to executable is AST post processing and that will be the topic of my next post.