2007-10-31
Javascript Packrat Parser
I've made some changes to my parser combinator library written in Javascript that I wrote about previously.
I was extending the example code from that post to parse more of the ECMAScript 3 grammar to test it out and was hitting performance issues. I ended up modifying the combinator library to be a Packrat style parser. The Parsing Expression Grammar Wikipedia page has a description of PEGs and what a Packrat parser does. Basically results of each parse step are memoized to prevent reparsing after backtracing, sacrificing memory for speed:
Any parsing expression grammar can be converted directly into a recursive descent parser[citation needed]. Due to the unlimited lookahead capability that the grammar formalism provides, however, the resulting parser could exhibit exponential time performance in the worst case.
By memoizing the results of intermediate parsing steps and ensuring that each parsing function is only invoked at most once at a given input position, however, it is possible to convert any parsing expression grammar into a packrat parser, which always runs in linear time at the cost of substantially greater storage space requirements.
A packrat parser[1] is a form of parser similar to a recursive descent parser in construction, except that during the parsing process it memoizes the intermediate results of all invocations of the mutually recursive parsing functions. Because of this memoization, a packrat parser has the ability to parse many context-free grammars and any parsing expression grammar (including some that do not represent context-free languages) in linear time.
I changed a few other things in the library to more closely map to the vocabulary of PEGs. 'alternative' is now called 'choice' for example. There are still quite a few loose ends to tidy up and documentation of course.
The updated library can be retrieved from my git repository:
git clone git://github.com/doublec/jsparse.git
It can run in a browser but I've mostly been testing it using Mozilla Rhino.
I've included three basic examples. They all operate on the example expression grammar from the wikipedia article:
Value := [0-9]+ / '(' Expr ')'
Product := Value (('*' / '/') Value)*
Sum := Product (('+' / '-') Product)*
Expr := Sum
The first, example1.js, is a direct translation of that grammer. It produces a pretty ugly default Abstract Syntax Tree however:
var Value = choice(repeat1(range('0','9')), Expr);
var Product = sequence(Value, repeat0(sequence(choice('*', '/'), Value)));
var Sum = sequence(Product, repeat0(sequence(choice('+', '-'), Product)));
var Expr = Sum;
The Expr parser can be called by passing it a string to be parsed wrapped in a 'ParserState' object. This object is used to keep track of the current parse position and the memoized results. A helper function, 'ps', can be used to construct it:
var result = Expr(ps("1+2*3"));
The second example, example2.js adds to this to produce a better AST. It also uses the 'chainl' parser combinator to handle grouping correctly. A quick online page demonstrating this example is here. Enter an expression matching the grammar (there is no error checking yet), and press the button to see the AST in JSON format.
The third example, example3.js, evaluates the expression as it parses instead of generating an AST. This is also available online to try.
I've also included in the repository the work in progress of the ECMAScript 3 grammar. It is not complete or correct yet but I use it for testing the library.
Based on what I've learnt from doing this I plan to revisit the way I did Parser Combinators in Factor.