One of the first things I decided to dive deep into was how to get the textual representation of a program into a sensible representation in memory. This is known as parsing.
I got side tracked a lot along the way and ended up writing my own general purpose parser implementation based on the Marpa parsing algorithm.
During this experience, I implemented a scanner that was built around a Nondeterministic Finite Automaton (NFA) encoded in a Binary Decision Diagram (BDD). Using a BDD, I was able to transform an entire set of NFA frontiers to a new set given a newly scanned character in constant time.
Usually, a Deterministic Finite Automaton (DFA) is used to perform transitions given a newly scanned character. Typically this is generated from an NFA and minimised for performance reasons.
NFAs have a smaller memory footprint than their corresponding DFAs. This is because the DFA must uniquely encode each possible word as a single path to an accepting state, whereas an NFA allows multiple paths up until a word is fully recognised. This smaller memory footprint of an NFA can make it more likely to result in CPU cache hits, drastically speeding up processing since main memory accesses are avoided more often.
However, because of the multiple paths in an NFA, this means it needs to keep track of sets of states rather than just one state as in a DFA. This is usually considered a problem since if the NFA has a large number of states, it could take a quadratic amount of time to process a single character in the worst case, in terms of the number of nodes in the NFA.
By using a BDD to encode the NFA, operations are performed on sets of nodes in constant time, and so the time to process a single character is constant.
The steps to build up the scanner went something like this:
- For each possible symbol type, define a regular expression.
- Convert the regular expressions to an NFA with nondeterministic “epsilon transitions”. This is a well-understood process called Thompson’s algorithm.
- Remove all the epsilon transitions from the NFA by allowing the NFA to encode sets of states rather than just individual states.
- Remove unreachable states.
- Relabel states so that the most frequent ones have the lowest identifier.
- Relabel transitions so that the most frequent ones have the lowest identifier.
- Encode the NFA using a BDD. The BDD variables are chosen to correspond with bit indexes of the binary representation of the state and transition identifiers.
- Order BDD variables so that the ones that provide the most entropy reduction have the lowest identifier.
Phew. Easy, right?
The relabelling steps (5, 6, 8) were an attempt to get the BDD variable representation of the transition table to be as compact as possible. This can be achieved by trying to get the representation to share as many nodes as possible.
Representing a row in the transition table, for example, requires a conjunction of every single BDD variable as either being present or not present. If the BDD variables corresponded directly to the states and transitions then this would be a very sparse representation, since each row represents only one transition and there would be very little node reuse.
This is why I encoded the BDD variables as bit indexes instead; a small number of bits can represent a lot of states. The representation of transition table rows will be less sparse since the binary representation of integers spans multiple bits. The lower order bits will be the most commonly populated, which is why the most frequent states were relabelled to have the lowest identifier in steps 5 and 6. The downside of this is that the BDD variables don’t have any intuitive meaning; this is just a hack to keep the number of BDD variables small.
The entropy reduction ordering of the BDD variables in step 8 is designed to place BDD variables that are infrequently used at the top of the BDD. These will typically be the higher order bits of the states, which we constructed to be the least frequently used. This means that the bulk of the BDD representations will say “not present” for the least frequently used variables, eliminating those variables from the graph in a small number of steps. The more frequently used states will then cluster after this, using a small number of variables to represent a large number of similar states.
Digging deeper into how I could improve on the encoding, I discovered a slightly different data structure, called a Zero-suppressed Decision Diagram (ZDD). There is an excellent talk by Donald Knuth about these called “Fun with Zero-Suppressed Binary Decision Diagrams”. Unfortunately, I can’t seem to find an active link to it at the moment; it used to be here. They turn out to be excellent at efficiently representing sparse boolean functions, such as those that represent families of sets. Some good introductory notes on ZDDs can be found here.
Although ZDDs are extremely interesting in their own right, they turned out to not be suitable for the encoding we are using. They could work well if we kept one ZDD variable per state; then the sparseness would be well represented. We would also have some intuitive meaning back; the transition table would be a family of sets that represent all the pieces needed for a transition (from, via, to) as sets of states.
Unfortunately, I wasn’t able to figure out how to implement the frontier transition using the ZDD operations that JDD provides. In response to that, I started to have a go at implementing my own ZDD base with the operations I needed, which you can find here. I took the naive approach and coded in an object-oriented way rather than the low-level approach JDD takes.
Everything was going well until I went down the rabbit hole of caching. Now everything is a big mess and I haven’t looked at it in a long time. If I get back to it, I will probably contribute the operations I needed back to JDD rather than rolling my own, since an object oriented approach is very unlikely to be able to compete with the performance of a low-level implementation.
Moving on from the scanning side of things, when I started looking into parsing, the Marpa algorithm caught my attention. Marpa, envisaged by Jeffrey Kegler, builds on the earlier work of Jay Earley, Joop Leo, John Aycock and R. Nigel Horspool. It parses a wide variety of grammars in linear time, generated from a Backus Normal Form (BNF). The algorithm is designed to maintain lots of contextual information about the parse. For example, rules and symbols that have been recognised or are expected next, with their locations. This will come in handy for error reporting, which historically is very poor in many compilers.
At the time of writing, Marpa is only officially implemented in C and Perl. So you know what comes next… I had a go at writing an implementation in Java! You can have a look at the parser implementation here.
It took me lots of trial and error to wrap my head around the algorithm, but it was worth it in the end. The Marpa paper was a good guide; I was able to lift the core algorithm straight out of the pseudo code in the paper. Also, Early Parsing Explained by Loup Vaillant was very useful in understanding Early Parsers in general.
I’m not going to go into too much detail on this one, the implementation
was relatively straightforward and the resources above are very detailed. I did have some fun running a profiling tool over some performance tests at this point. My takeaways from that experience were to avoid unnecessary object allocations and to precompute as much as possible when constructing the parser.
I got a bit carried away micro optimising, for example, I found that Java 8 method references had a tendency to create a new anonymous object for each usage of a method reference. I found that I could beat the performance of the Stream API by handcrafting the equivalent imperative code, so I did that for a handful of operations that were repeated most frequently. I also found that the Marpa algorithm naturally contained opportunities to use the flyweight pattern.
That’s all for now. I guess you might have noticed, so far there hasn’t been much TDD going on! We will get there I promise, for now, I’m just catching you up with the initial spikes and research that I was doing. On the timeline, we are up to about November 2015, which is around the time I stopped working on the scanning and parsing experiments.