Compiler Testing

How do you make sure that a compiler really works? You write some kind of automated tests. But what kind of test is useful and is not too much additional work?

The goal of testing is to make sure that the generated programs do what was specified in the source code. Here we need to make sure that we don’t introduce any errors on several stages during compilation: Lexing / parsing, desugaring, type checking / AST annotation and finally code generation.

The ideas presented here are generally applicable to non-interactive command line programs. Even if you are writing a GUI application it can be a good idea to write first a shared library with a command line interface. That makes testing much easier. Later you can still write a GUI that wraps the functionality in a simple way for the end user.

Test Driven Development

This post will refer to TDD as writing tests first and only then writing new code. Of course in practice there are situations where you’ll not write tests or write them after you discover bugs. The strictness of this approach depends on the needs of your project and its need for correctness.

TDD is an iterative approach to writing code: First write some new tests which will fail due to missing functionality. Then write code until the new tests pass and finally check if all other tests still pass. To make this testing as simple as possible it’s important to have an automated way of running the tests.

In Exoself the tests are integrated into the build system. It’s trivially easy to either run the whole test suite or just run a few selected tests. The build system even knows when parts of the compiler implementation changed and automatically rebuilds the tests on demand.

Running only a subset of all tests gets more important when your project grows and the test suite takes to long to complete. Even one ore two minutes might be too long to get feedback regarding changes. So often you’ll want to run the whole test suite on a dedicated machine after source code checkins. A tool like Buildbot can automate this.

Of course there are many ways to write tests. I’ll present two techniques here: unit testing and behavior testing.

Unit Testing

This method tries to make sure that individual units of your program work correctly. Often this means a unit test per class, source file or module. Ideally it’s a very small or even elementary part of you application which you test.

When testing classes you write test cases for the public interface of the class, not the implementation details. When writing tests first this also defines the public interface of the class from a user perspective instead of the class implementor perspective which makes sure that the interface is usable.

Most classes depend on other classes which provide file / database / network access which should not be tested. To avoid this dependence often mock objects are introduced which implement the same interface as the original class, but only make sure that methods are called in the right order and with correct parameters.

I don’t like this way of writing tests: It’s too much additional work and too low level for me.

Behavior Testing

Testing the behavior means that I verify that the output of my program is correct and not any part or unit. In the case of a compiler that’s very simple: Write code in the source language, compile the program and run it. During runtime check with assert statements if runtime calculations match the expected results. Instead of runtime checking you can also of course compare the output of your compiler / program with a reference file.

If you’ve participated in programming competitions like ACM or TopCoder you’ll already know how it works. In these competitions you write a small program and submit it to an automated judge. Then it’s compiled and run for every test case which is a plain text input. The output of your program is then compared to a reference output and you’ll either pass or fail. The major difference to our situation is, that the test inputs and expected outputs are unknown as are the test case which failed.

The behavior testing approach has several benefits compared to unit testing: The program as a whole is tested, not any unit. Of course that means when a test fails it’s more difficult to find the bug in the code but on the other hand you know that certain inputs work correctly. You don’t need to provide mock objects and it’s often less work than unit testing: Just writing programs in the source language of your compiler should be easy.

Making sure that compiled programs work is difficult with a new programming language: No input / output and maybe even missing control flow / assert makes it hard. But for the first tests you can always construct the AST of the source by hand and check that or use the return value of the program.

Test Coverage

Especially with unit testing it can be tempting to get 100% test coverage. Coverage may be defined in several ways, for example functions / statement / path coverage. Path coverage is the most interesting one to guarantee correctness but means you’ll need a really big test suite – in practice that’s either impossible or just too much work for any non trivial program. And a compiler has the added problem that there are infinitely many valid inputs which makes things not really easier…

Behavior testing is easier: You have some kind of definition of the input language, which yields a finite number of basic tests which define the grammar and semantics. Adding some composite tests to the test suite should provide a good enough indicator for correctness of the compiler.

Testing just provides a lower bound of correctness: Everything which you checked works – everything also may or may not work. You can always fool your test suite by hard coding all inputs, then all tests work but nothing else.

Automatic Testing

Instead of writing many behavior tests by hand, this task can also be automated to some degree. For example to test operator precedence you could write a program that generates random mathematical expressions. The generated source code consists of the expression and the expected result.

Of course to compute the expected result you have to duplicate the language semantics in the expression generator. That’s easy if the expression generator is written in a language that has the same semantics for these expressions. Writing this program in your new language is probably a bad idea, since you can not trust the expected values this way.

As I’ll most probably use Python to write the random expression generator it’s not that trivial: Even simple expressions behave different, due to overflow of basic types like int32 or int64 which can not occur in Python. I’ll probably make the generator generic, so that I can generate code in different languages. As I expect the C / C++ implementation to be correct I’ll just use it to calculate the expected values. Exoself will often have the same semantics – at least in regard to the mathematical expressions.

This idea can be extended even to whole programs including control flow: Volatile-Testing Tools. This program generates random programs to check whether a C compiler handles volatiles correctly.

Conclusion

The Exoself test suite consists of many behavior tests, that prove constantly to be invaluable. It’s so much easier to make changes, small or large, and know that everything still works. The test suite is also a replacement, at least for now, for a language specification.

It will be interesting to see how I can improve my test suite by adding random generated programs in the future. If you know about any tools in this area, I’d be interested to hear about them.