How to TDD a compiler: learning to read

One of the first things I decided to dive deep into was how to get the textual representation of a program into a sensible representation in memory. This is known as parsing.

I got side tracked a lot along the way and ended up writing my own general purpose parser implementation based on the Marpa parsing algorithm.


During this experience, I implemented a scanner that was built around a Nondeterministic Finite Automaton (NFA) encoded in a Binary Decision Diagram (BDD). Using a BDD, I was able to transform an entire set of NFA frontiers to a new set given a newly scanned character in constant time.

Finite Automata

Usually, a Deterministic Finite Automaton (DFA) is used to perform transitions given a newly scanned character. Typically this is generated from an NFA and minimised for performance reasons.

NFAs have a smaller memory footprint than their corresponding DFAs. This is because the DFA must uniquely encode each possible word as a single path to an accepting state, whereas an NFA allows multiple paths up until a word is fully recognised. This smaller memory footprint of an NFA can make it more likely to result in CPU cache hits, drastically speeding up processing since main memory accesses are avoided more often.

However, because of the multiple paths in an NFA, this means it needs to keep track of sets of states rather than just one state as in a DFA. This is usually considered a problem since if the NFA has a large number of states, it could take a quadratic amount of time to process a single character in the worst case, in terms of the number of nodes in the NFA.

At least they are finite

BDD encoding

By using a BDD to encode the NFA, operations are performed on sets of nodes in constant time, and so the time to process a single character is constant.

You can have a look at the scanner implementation here, which I recently tidied up a bit. I made use of JDD, a Java BDD library.

The steps to build up the scanner went something like this:

  1. For each possible symbol type, define a regular expression.
  2. Convert the regular expressions to an NFA with nondeterministic “epsilon transitions”. This is a well-understood process called Thompson’s algorithm.
  3. Remove all the epsilon transitions from the NFA by allowing the NFA to encode sets of states rather than just individual states.
  4. Remove unreachable states.
  5. Relabel states so that the most frequent ones have the lowest identifier.
  6. Relabel transitions so that the most frequent ones have the lowest identifier.
  7. Encode the NFA using a BDD. The BDD variables are chosen to correspond with bit indexes of the binary representation of the state and transition identifiers.
  8. Order BDD variables so that the ones that provide the most entropy reduction have the lowest identifier.

Phew. Easy, right?

The relabelling steps (5, 6, 8) were an attempt to get the BDD variable representation of the transition table to be as compact as possible. This can be achieved by trying to get the representation to share as many nodes as possible.

Representing a row in the transition table, for example, requires a conjunction of every single BDD variable as either being present or not present. If the BDD variables corresponded directly to the states and transitions then this would be a very sparse representation, since each row represents only one transition and there would be very little node reuse.

This is why I encoded the BDD variables as bit indexes instead; a small number of bits can represent a lot of states. The representation of transition table rows will be less sparse since the binary representation of integers spans multiple bits. The lower order bits will be the most commonly populated, which is why the most frequent states were relabelled to have the lowest identifier in steps 5 and 6. The downside of this is that the BDD variables don’t have any intuitive meaning; this is just a hack to keep the number of BDD variables small.

The entropy reduction ordering of the BDD variables in step 8 is designed to place BDD variables that are infrequently used at the top of the BDD. These will typically be the higher order bits of the states, which we constructed to be the least frequently used. This means that the bulk of the BDD representations will say “not present” for the least frequently used variables, eliminating those variables from the graph in a small number of steps. The more frequently used states will then cluster after this, using a small number of variables to represent a large number of similar states.

Executing code by hand is not very fun

ZDD encoding

Digging deeper into how I could improve on the encoding, I discovered a slightly different data structure, called a Zero-suppressed Decision Diagram (ZDD). There is an excellent talk by Donald Knuth about these called “Fun with Zero-Suppressed Binary Decision Diagrams”. Unfortunately, I can’t seem to find an active link to it at the moment; it used to be here. They turn out to be excellent at efficiently representing sparse boolean functions, such as those that represent families of sets. Some good introductory notes on ZDDs can be found here.

Although ZDDs are extremely interesting in their own right, they turned out to not be suitable for the encoding we are using. They could work well if we kept one ZDD variable per state; then the sparseness would be well represented. We would also have some intuitive meaning back; the transition table would be a family of sets that represent all the pieces needed for a transition (from, via, to) as sets of states.

Unfortunately, I wasn’t able to figure out how to implement the frontier transition using the ZDD operations that JDD provides. In response to that, I started to have a go at implementing my own ZDD base with the operations I needed, which you can find here. I took the naive approach and coded in an object-oriented way rather than the low-level approach JDD takes.

Everything was going well until I went down the rabbit hole of caching. Now everything is a big mess and I haven’t looked at it in a long time. If I get back to it, I will probably contribute the operations I needed back to JDD rather than rolling my own, since an object oriented approach is very unlikely to be able to compete with the performance of a low-level implementation.


Moving on from the scanning side of things, when I started looking into parsing, the Marpa algorithm caught my attention. Marpa, envisaged by Jeffrey Kegler, builds on the earlier work of Jay Earley, Joop Leo, John Aycock and R. Nigel Horspool. It parses a wide variety of grammars in linear time, generated from a Backus Normal Form (BNF). The algorithm is designed to maintain lots of contextual information about the parse. For example, rules and symbols that have been recognised or are expected next, with their locations. This will come in handy for error reporting, which historically is very poor in many compilers.

At the time of writing, Marpa is only officially implemented in C and Perl. So you know what comes next… I had a go at writing an implementation in Java! You can have a look at the parser implementation here.

It took me lots of trial and error to wrap my head around the algorithm, but it was worth it in the end. The Marpa paper was a good guide; I was able to lift the core algorithm straight out of the pseudo code in the paper. Also, Early Parsing Explained by Loup Vaillant was very useful in understanding Early Parsers in general.

I’m not going to go into too much detail on this one, the implementation
was relatively straightforward and the resources above are very detailed. I did have some fun running a profiling tool over some performance tests at this point. My takeaways from that experience were to avoid unnecessary object allocations and to precompute as much as possible when constructing the parser.

I got a bit carried away micro optimising, for example, I found that Java 8 method references had a tendency to create a new anonymous object for each usage of a method reference. I found that I could beat the performance of the Stream API by handcrafting the equivalent imperative code, so I did that for a handful of operations that were repeated most frequently. I also found that the Marpa algorithm naturally contained opportunities to use the flyweight pattern.

It all made sense at the time

That’s all for now. I guess you might have noticed, so far there hasn’t been much TDD going on! We will get there I promise, for now, I’m just catching you up with the initial spikes and research that I was doing. On the timeline, we are up to about November 2015, which is around the time I stopped working on the scanning and parsing experiments.

How to TDD a compiler: the story so far

In this series, I will be documenting my experiences and progress in writing a compiler for a yet-to-be-named programming language.

A while ago now, sometime in 2015, I started to toy with the idea of writing my own programming language. I like to code (a lot!) in my spare time; I find it a great way to meditate and I get a lot of satisfaction out of expressing my thoughts in code.

I think it is fair to say that our industry is still very much in its infancy. We are slowly converging upon some good concepts, such as Test Driven Development (TDD), but I think we can do a lot better. When writing code, the language itself has a huge impact on how the code is written, simply because the language determines how easy or hard it is to express our thoughts.

At one extreme, machine code is a set of instructions that literally tell a processor what to do. This is great if what you are trying to do is tell a processor what to do, but in reality, what we are really trying to do is express our thoughts in code to achieve some objective; the fact that the code will end up as a set of instructions is just an implementation detail.

At the other extreme, wouldn’t it be great if you could just code up what your intention is, rather than what the implementation of your intention is?

I dream of a world where code is intention, not implementation.

There have been some steps towards this in industry, with the mainstream acceptance of functional programming paradigms, such as the Java 8 Streams API.

Unfortunately, Murphy’s law still applies. It is not uncommon in functional programming (and perhaps even encouraged) to see problems solved using types like Tuple with ridiculous constructs such as the wonderful fst and snd in Haskell. These kinds of constructs are back down the path of implementation rather than intention.

I am a huge fan of Tiny Types. They are one step closer to the goal of expressing what our code is supposed to do. Unfortunately, in most languages, the boilerplate code required to introduce a Tiny Type is so much so that most people won’t bother to use them at all!

This got me to thinking, what if the cost of introducing a Tiny Type was zero? This led me to the following realisation: in code that expresses intent well using types, variable names are redundant.

This is a powerful idea. How many times have you seen code like FirstName firstName = new FirstName("name");?

Wouldn’t it be so much nicer to see code like FirstName = "name";?

I want to make this kind of thing a reality. Maybe I’ll find out why it hasn’t been done this way before, or maybe I’ll start a new programming paradigm, who knows. For me at least, it is all about the journey.

This seemingly small change to the way we code has some far reaching implications. Let’s explore some of them.

Types, like variable names in conventional programming, must have a unique name within the same scope. This means that, in a particular scope, when you see a type, you know that it refers to precisely one concept in that scope. To some extent, you can think of it as just swapping the responsibility of variable names to be type introduction instead. You can imagine a syntax like FirstName: String being used to declare a new type in a method signature. This doesn’t introduce any problems in method or class scope, but things get interesting with global scope.

For example, any type that is part of a public API, say:

interface Farm {
    Integer countChickens();
    Integer countEggs();

is no longer possible because there is a name clash; does Integer refer to the thing returned by Integer countChickens(); the thing returned by Integer countEggs();?

To resolve this, a new type must be introduced in each case:

interface Foo {
    ChickenCount: Integer countChickens();
    EggCount: Integer countEggs();

It is also worth mentioning at this point, the idea with these kinds of type aliases is that it is possible for an instance to act as any of the aliases in its alias chain.

So in our running example, if we have a method Sum: Integer sum(First: Integer, Second: Integer) then it is perfectly possible to invoke it like so: Sum = sum(ChickenCount, EggCount)

That’s all for now, I’ll explore more of the ideas I have in future posts.