Semantic Analysis

2008-11-24

code

compiler

In the previous stages the Lexer and Parser rejected invalid inputs. That’s also the job of this phase: We want to find out if the program adheres to certain semantic rules – the meaning of language constructs.

Let’s start simple. What is wrong with the following top level code?

def f(x as int32) as int32
{
    if x == 0
    {
        break;
    }
    return 42;
}

What does the break statement there? It’s not contained in a loop (or switch) so by using it in this context we are violating the rule that a break must be associated with a loop. But nevertheless this code will parse without any problems.

Another important job of this phase, at least in statically compiled languages, is to gather type information for the code generator. Also the type checking is implemented in this phase.

x as int32;
x = 4.0;

While traversing the AST we see that a symbol named x is defined as an int32. After inserting this information into a symbol table we continue with the next statement.

Now we should assign the value 4.0, which is of type float64, to the symbol x. After a lookup in the symbol table we know that x is of type int32. Now the type system has to decide if this is a valid language construct. In Exoself this construct is not valid: There are no implicit conversions from any floating point type to an integer type. As a result the type checker will emit an error.

So what we have done here is walking the AST in postorder: Before evaluating any node all its children must be evaluated. In Exoself after determining the type of an AST node I add the attribute esType to that node. So every parent node can easily access the type information.

Here’s the code I’m using for the while statement. A while statement consists of it’s own AST node (ast), the expression to evaluate (expression) and a block to execute. The expression must be of type bool or there must exist an implicit cast to bool.

def _onWhile(self, ast, expression, block):
    self._dispatch(expression)
 
    esType = expression.esType
    bool = self._findSymbol(name=u'bool', type_=ESType)
    if not esType.isEquivalentTo(bool, False):
        # types did not match, try implicit cast
        if not estypesystem.canImplicitlyCast(esType, bool):
            self._raiseException(RecoverableCompileError, tree=expression, inlineText='incompatible type, expected bool')

The first thing I’m doing is calling the correct handler for the expression using dispatch. Then I’m retrieving the type from the expression AST and look up the bool type in the symbol table.

The next step is more interesting: I’m trying to find out if the type of the expression is identical to bool. In this case I’m doing a name based comparison instead of a structure based comparison.

This check will very often fail since programmers are too lazy to cast the expression manually to bool in while statements. Therefore my code tries to insert an implicit cast to bool. If that fails we raise an exception which in turn will print an error message what exactly went wrong and why.

An example that is a bit more difficult is the assignment statement. Lets look at its code, too:

def _onAssignHelper(self, assigneeExpr, exprNode):
    self._dispatch(exprNode)
    esType = exprNode.esType
 
 
    # FIXME make assigneeExpr eval more general and move it to astwalker!
    if assigneeExpr.type == TreeType.VARIABLE:
        varNameNode = assigneeExpr.children[0]
        var = self._findSymbol(fromTree=varNameNode, type_=ESVariable, mayFail=True)
 
        if not var:
            # create new variable with type of expression
            var = ESVariable(varNameNode.text, esType)
            self._addSymbol(fromTree=varNameNode, symbol=var)
        else:
            if not var.esType.isEquivalentTo(esType, False):
                self._insertImplicitCastNode(exprNode, var.esType)
    elif assigneeExpr.type == TreeType.DEREFERENCE:
        self._dispatch(assigneeExpr)
 
        if not assigneeExpr.esType.isEquivalentTo(esType, False):
            self._insertImplicitCastNode(exprNode, assigneeExpr.esType)
    else:
        print assigneeExpr.text
        raise NotImplementedError('TODO')

As you can see, still work in progress. (_onAssign, the real handler for simple assignments, is just a thin wrapper around this function to avoid code duplication)

I start again by determining the type of the expression. Then we have to be careful: Variables can be defined by assignment in Exoself, but this makes no sense for pointer expressions.

If the assignee is a variable we try to find the associated symbol, but that may fail. If the variable does not yet exist we add an entry to the symbol table with the type of the expression. As you can see type inferencing is trivial on this level and more languages should provide it. Otherwise, if the variable was already defined check if types match and if not try to insert an implicit cast. In case of the pointer expression just dispatch it to determine its type and compare it in the same manner.

Type system requirements

You now know how to type check – at least in principle. But how can wee determine if two types are the same? There are two ways to type checking ignoring any details.

The first way is to use name based equivalence. The types of two structs are only considered equal in C if they have the same name. The contents are completely ignored:

struct X {int a};
struct Y {int b};
struct X x;
struct Y y;
x = y; // error, x not of type struct Y

The other way to compare data types is by their structure. Would C use structural equivalence for structs instead, the above assignment would be legal. That’s quite confusing since the type system would not prevent many errors. It’s important to understand that the names of variables do not matter for structural equivalence, only their types and order in structs.

All primitive data types of C are compared using structural equivalence. If you introduce a typedef for int called MyInt there’s no way to enforce anyone to use the type MyInt instead of a plain int. Let’s call this an alias.

Regarding the type system of Exoself, which is similar to that of D, I want to use name based equivalence by default but with the option to introduce an alias. The alias statement introduces another name which is equivalent to the old one (even with name based comparison) and the typedef statement introduces a new type which is only structurally equivalent to the old one.

Another requirement for the type system of Exoself was to be able to represent const / invariable data, even if I’m not using that in the near future. Pointers and structs make this again a bit complicated but there’s an elegant solution to this problem.

Type representation

The basic idea is to represent any type by a tree.

By introducing the part with the typedef I’m breaking structural equivalence to the original type.

Again the struct part breaks structural equivalence and order of the fields is of course relevant.

So we need to define a class which can store this relationship: On a first read you might want to skip over it.

class ESType(object):
    ''' represents types of data, not variables! '''
 
    def __init__(self, parents, payload):
        ''' do not call directly! use construction methods '''
        assert(isinstance(parents, list))
        for x in parents:
            assert(isinstance(x, ESType))
        self.parents = parents
        self.payload = payload
 
 
    def derivePointer(self):
        return ESType([self], ('pointer', None))
 
 
    def dereference(self):
        assert(self.isPointer())
 
        return self.parents[0]
 
    def deriveTypedef(self, name):
        # break structural equivalence
        return ESType([self], ('typedef', name))
 
 
    @staticmethod
    def createStruct(name, parts):
        return ESType(parts, ('struct', name))
 
 
    @staticmethod
    def createFunction(returnTypes, paramTypes):
        assert(len(returnTypes) >= 1)
        parts = []
        parts.extend(returnTypes)
        parts.extend(paramTypes)
        return ESType(parts, ('function', len(returnTypes)))
 
    @staticmethod
    def createSelfPointer():
        ''' only valid inside structs! '''
        return ESType([], ('selfpointer', None))
 
     
    def isEquivalentTo(self, other, structural):
        # structural equivalence: Skip any typedefs and ignore different struct names
 
        t1 = self
        t2 = other
 
        if structural:
            while t1.payload[0] == 'typedef':
                assert(len(t1.parents) == 1)
                t1 = t1.parents[0]
 
            while t2.payload[0] == 'typedef':
                assert(len(t2.parents) == 1)
                t2 = t2.parents[0]
 
 
        if structural and t1.payload[0] == 'struct' and t2.payload[0] == 'struct':
            pass
        elif t1.payload != t2.payload:
            return False
 
        if len(t1.parents) != len(t2.parents):
            return False
 
        for i in range(len(t1.parents)):
            if not t1.parents[i].isEquivalentTo(t2.parents[i], structural):
                return False
        return True

 
    def __eq__(self, other):
        raise NotImplementedError('use isEquivalentTo')
 
    def __ne__(self, other):
        raise NotImplementedError('use isEquivalentTo')
 
    def isPointer(self):
        p = self
        while p.payload[0] == 'typedef':
            p = p.parents[0]
 
        return p.payload[0] == 'pointer'

Essentially I’ve defined a class that has two members. A ordered list of parents and a payload. The payload is a tuple exactly as in the images above denoting the type of the current node. The parents member is a list containing references to other ESType instances to represent the tree.

The deriveX methods can create new types from existing ones. By calling derivePointer on an existing instance a new instance of ESType will be created with the payload (‘pointer’, None) and the parent of this new instance will be the current one. That way we can represent a pointer to an arbitrary type. That’s only almost correct as we’ll see in a moment.

Since structs need multiple parents I’ve added a static method to ESType to do this work for me. Just pass all elements in the order they should be defined. But what happens with structs containing pointers to themselves? That would create a loop in the tree and that’s not a good idea.

I’m using a trick here: I provide a special type for self pointers. These are valid only inside structs and get later translated to the real type.

Now checking if two types are equal is easy: Just compare the payloads and all parents recursively until something doesn’t match. Under certain circumstances you’ll still need structural equivalence. For example when assigning a variable of type byte a uint8 constant using an explicit cast. The cast will do nothing since the structures are equivalent. So while comparing just skip typedef nodes and anything containing names.

more AST annotation

Now we have an AST with type information but there are still some things missing. An important data structure that I already mentioned is the symbol table. In Exoself every pair of curly braces introduces a new scope. Variables defined inside a scope are not available outside a scope just as in C. Some other AST nodes like the module itself, function definitions and for loops also introduce a scope.

So all these AST nodes must have a symbol table. This could be as simple as a dict mapping symbol names to instances of variable / function classes. But the lookup of symbols is a bit more complicated. You have to search all symbol tables recursively.

My idea to solve this problem was to maintain a list of all parents of the current node. By traversing through this list starting at the end we move with every step towards the root node of the AST. Using the hasattr function in Python we can simply check if the current AST node has a symbol table or not. If there is one, search it for the requested symbol. If the symbol was found we can stop the search and return the result, otherwise continue.

Adding symbols is done by searching for the nearest symbol table. Additionally Exoself forbids shadowing of variables, so an additional search is done to check if that name is already used.

In the next post I’ll describe the code generation phase using LLVM and llvm-py.

Semantic Analysis

Type system requirements

Type representation

x as int32

x as int32*

alias int as int32

typedef byte as uint8

struct X {a as int32; b as float32}

more AST annotation