by Florian Nöding
For writing my compiler I’m using the following tools
Besides these some of my standard tools for writing code are a Linux environment (Kubuntu) with vim and git.
The choice of implementation language for my compiler was limited by the available bindings for my tools and my personal preferences. I like to work with both Python and C++, but prefer Python if raw execution speed is not necessary. Ease of development is far more important when writing a compiler, so I took Python. Luckily there are bindings available for all essential tools.
Another important question I thought about was whether to write the lexer and parser myself or use a parser generator. The problem when writing a lexer / parser for a new programming language is that the grammar definition can change often. Having an easy way to write, extend and maintain the lexer and parser makes things certainly simpler.
My main reason for choosing ANTLR was that I had already prior experience with it, there’s enough documentation available and it has support for Python. Some other convenient features of ANTLR are the simple grammar definition syntax which is similar to EBNF and error recovery during parsing. The generated error messages are also (most of the time) understandable, which is another plus.
Another important fact to keep in mind is that there are different parsing techniques. ANTLR generates LL(*) parsers which use a top down approach. I think those are easier to understand than bottom up parsers. But there are some limits in regard to what language constructs can be parsed efficiently by this type of parser.
In ambiguous cases the grammar definition can be extended with syntactic predicates or by using back tracking. This makes the generated parser slower but allows it to parse complexer grammars. Still there are grammars which can not be handled by ANTLR, but my language will have a grammar of LL(k) with k being a small integer to make parsing easier for me and potentially others.
If you want to read more about EBNF and parser types I suggest EBNF.
The Low Level Virtual Machine is a framework consisting of a compilation strategy and a virtual instruction set (LLVM IR). It simplifies the development of compiler backends since it provides very convenient classes to generate LLVM IR, which is a high level assembly language, and tools which work on code in this intermediate representation.
There are several tools contained in LLVM: The assembler / disassembler convert between the two on disk representations of LLVM IR. One is human readable while the other format (bitcode) can be parsed by a computer more efficiently.
The llvm linker can combine several bitcode files into a single one. Before and after this step the optimizer can apply transformations on the bitcode. Finally there is a bitcode interpreter for directly executing programs and a compiler that generates native code.
LLVM provides backends for several targets like x86, x86_64, PowerPC and some more. For some platforms even a just in time compiler exists that is used by the bitcode interpreter when available.
To access LLVM from Python I’m using llvmpy.
For debugging parser and AST transformation problems I’m using graphviz / dot to visualize the AST. That really makes it easier to understand certain problems.
Another very important tool is my test runner: I’m not doing test driven development as in “never write code that does not fix a failing test case”. It’s more like “have always a program that does it’s job correctly, even if the job is trivial”. So while writing new code I’m making sure that I get fast feedback about new bugs and can use it to verify that new features really work.
So, this was my tool set. If you are also working on a programming language or related topic please write a comment about which tools you are using.
In my next post I’ll write about some implementation details of my lexer and parser.