2006-12-16
Parser Combinator enhancements
While working on the Factor to Javascript compiler I struck some performance issues with how I was using Parser Combinators.
The problem manifested when parsing word definitions. For example:
: foo ( a b -- c d ) one two ;
This would parse very quickly. But as the identifiers became longer, or more got added, it got much slower:
: foo ( a b -- c d ) oneoneoneoneone twotwotwotwotwo ;
The reason for this is related to how the 'identifier' parsing word was defined. It can be demonstrated with a simple number parser as well:
[ digit? ] satisfy <+>
This is a parser that accepts one or more digits in a sequence. Like all parsers it returns a lazy list of successful parses. If at some point a parser fails to parse something successfully then the lazy list is traversed trying each parse in turn. For the digit parser you can see it creates an item in the list for each possible string of numbers:
"1234567890" [ digit? ] satisfy <+> parse [ . ] leach
=> T{ parse-result f { 49 50 51 52 53 54 55 56 57 48 } { ... } }
T{ parse-result f { 49 50 51 52 53 54 55 56 57 } { ... } }
T{ parse-result f { 49 50 51 52 53 54 55 56 } { ... } }
T{ parse-result f { 49 50 51 52 53 54 55 } { ... } }
T{ parse-result f { 49 50 51 52 53 54 } { ... } }
T{ parse-result f { 49 50 51 52 53 } { ... } }
T{ parse-result f { 49 50 51 52 } { ... } }
T{ parse-result f { 49 50 51 } { ... } }
T{ parse-result f { 49 50 } { ... } }
T{ parse-result f { 49 } { ... } }
These are all valid parse results for the given input string. Most of the time when parsing a number you dont want all those extra results. You either want the entire number of nothing at all. There's no point having all those extra results being backtracked and taking time. By knowing in advance that we are only interested in the all or nothing result we can use a different word to that of <+>
. The document I based the Factor parser combinators on outlines the solution to this and I implemented it in Factor. First a parser combinator called 'only-first' allows us to only get the first succesful result:
TUPLE: only-first-parser p1 ;
M: only-first-parser (parse) ( input parser -- list )
#! Transform a parser into a parser that only yields
#! the first possibility.
only-first-parser-p1 parse 1 swap ltake ;
It takes an existing parser and uses 'ltake' to only return the first result in the list of successes. The existing combinators for matching sequences, <*>
and <+>
, can have alternatives for only matching the longest sequence:
LAZY: <!*> ( parser -- parser ) <*> only-first ;
LAZY: <!+> ( parser -- parser ) <+> only-first ;
Using <!+>
instead of <+>
in the number parser is now more efficient:
"1234567890" [ digit? ] satisfy <!+> parse [ . ] leach
=> T{ parse-result f { 49 50 51 52 53 54 55 56 57 48 } { ... } }
"1234567890" [ digit? ] satisfy <+> parse list>array length .
=> 10
"1234567890" [ digit? ] satisfy <!+> parse list>array length .
=> 1
By making sure identifiers and numbers used these optimised parser combinators I was able to make parsing of longer word definitions much more efficient. The new words have been added to the 'parser-combinator' module.