Factor Parser Combinator example

2006-11-22

Factor Parser Combinator example

I was asked on IRC how to use parser combinators to write a simple translator from s-expressions to Factor quotations. The translation was quite simple - here are some examples of how it was supposed to work:

(set a 10) 
  => [ "set" "a" 10 ]
(set square (lambda (n) (* n n))) 
  => [ "set" "square" [ "lambda" [ "n" ] [ "*" "n" "n" ] ] ]

This looked like it should be fairly simple so I put together a quick implementation.

The parser combinators were recently rewritten so this code is likely to work best with the Darcs version of Factor. Instructions on how to get this going are at the Factor website. Basic instructions for an Intel Linux based machine would go something like:

darcs get http://www.factorcode.org/repos
cd repos/Factor
make linux-x86
wget http://www.factorcode.org/boot.image.x86
./f boot.image.x86
./f

To run the examples from within Factor you will need to load the parser-combinators module and use it:

"contrib/parser-combinators" require
USE: parser-combinators
USE: lazy-lists

A parser combinator is a tuple that has specialised the 'parse' generic word. This word has stack effect ( input parser -- result ). The input is a sequence of tokens (usually a string) and the result is what is known as a lazy list of successes.

A 'list of sucesses' is a list of all possible sucessful parse results. It being lazy, each result in the list is not computed (ie. the parse is not done) until that list element is requested. This is how parser combinators handle backtracking. If the first parse result in the list is not what is required, the second can be tried, etc.

Parsers can be written manually or existing parsers can be combined together using combinators. A number of existing parsers and combinators are provided in the 'parser-combinators' vocabulary to build simple parsers.

To start with the s-expression parser we can handle the '(' and ')' using the 'token' word. This word returns a parser that parses only the given string:

LAZY: 'lparens' ( -- parser )
  "(" token ;

LAZY: 'rparens' ( -- parser )
  ")" token ;

Notice these are defined with the 'LAZY:' word instead of ':'. This means the result of the word is a promise and needs to be forced to get the actual value. Although not strictly necessary with these simple parsers it is required for those that recursively call themselves so I generally always define parsers using it. Examples of usage:

"(" 'lparens' parse car . => T{ parse-result f "(" T{ slice f "(" 1 1 } }

The result of the parse is a lazy list. Returning the 'car' of that list shows the first parse result. The 'parse-result' tuple has two slots. The first slot is the parse tree returned by the parser. The second is the rest of the input string remaining to be parsed if any. Usually the latter is a slice for efficiency purposes. The parsers created with 'token' return the token itself as the parse tree.

The s-expression parser needs to parse numbers and identifiers. Numbers are digits repeated multiple times. A parser for digits could be:

LAZY: 'digit' ( -- parser )
  [ digit? ] satisfy ;

This uses the 'satisfy' parser generator word. Where 'token' parses an input string for a specified string, 'satisfy' will call a given quotation with the first character of the input string. If the quotation returns 'true' then the parse is succesful otherwise it fails. Examples of using this definition of 'digit' are:

"123" 'digit' parse car parse-result-parsed . => 49

The 'parse-result-parsed' word returns just the parse tree of the result from the 'parse-result' tuple. The result, '49', is the character code of the successfully parsed digit. Ideally I want the actual number returned rather than the character code.

The <@ word applies a transformation to the parse tree of a parser. It converts an existing parser to one that does what the original parser did but has the transformation applied. <@ takes a quotation on the stack with stack effect ( old-tree -- new-tree ). This quotation is called to perform the transformation. One way to remember the effect of <@ is to note that < sign points to the parser being transformed. Here is 'digit' converted to return a number:

LAZY: 'digit' ( -- parser )
  [ digit? ] satisfy [ digit> ] &lt;@ ;

"123" 'digit' parse car parse-result-parsed . => 1

digit> is a standard Factor word that coverts the character code of a digit into the digit itself. A number is one or more digits in a row. The <+> word takes a parser as input and returns a parser that parses one or more instances of the original. It has the same meaning as '+' in regular expressions. This allows a 'number' parser word to be written as:

LAZY: 'number' ( -- parser )
  'digit' &lt;+> ;

"123" 'number' parse car parse-result-parsed . => { 1 2 3 }

Notice here that the result is an array of the digits in the number. This is the parse tree that <+> generates - an array of the results of the original parser. By using <@ this can be converted into a numeric value using Factor's 'reduce' word (basically a left fold):

LAZY: 'number' ( -- parser )
  'digit' &lt;+> [ 0 [ swap 10 * + ] reduce ] &lt;@ ;

"123" 'number' parse car parse-result-parsed . => 123

Interestingly 'number' actually returns more than one parse result since "123" contains more than one occurrence of digits in a sequence:

"123" 'number' parse [ parse-result-parsed ] lmap [ . ] leach 
  => 123
     12
     1

It returns the longest match first which is what we usually want. As the list is lazy if we don't request anything beyond the first match then the remaining parse results aren't actually computed.

Next on the list is identifiers. These are any sequences of characters that are not numbers, spaces, or parenthesis. The 'satisfy' word can be used for this, combined with <+>:

LAZY: 'identifier' ( -- parser )
  [ 
    [ blank? not ] keep 
    [ digit? not ] keep 
    [ CHAR: ( = not ] keep 
    CHAR: ) = not 
    and and and    
  ] satisfy <+> [ >string ] <@ ;

"foo" 'identifier' parse car parse-result-parsed . => "foo"

The <@> word is used to transform the result into a string otherwise we'd get an array of the character codes in the identifier as a result.

Often it is desired to parse either a number or an identifier. The 'atom' word does this:

LAZY: 'atom' ( -- parser )
  'identifier' 'number' &lt;|> ;

"123" 'atom' parse car parse-result-parsed . => 123
"foo" 'atom' parse car parse-result-parsed . => "foo"

It uses <|>, known as the choice word. It takes two parsers on the stack, and generates a parser which when run will try the first parser on the input string. If it fails it will then try the second parser. It returns the list of successes resulting from both parses.

A first cut at parsing a single expression like '(set a 10)' requires a sequencing parser combinator. This is '<&>'. It takes two parsers and returns a resulting parser which when run will run the first parser on the input string, and then run the second parser on the remaining unparsed portion of the input string of each result from the first parser. A simple expression parser might look like:

LAZY: 'expr1' ( -- parser )
  'lparens' 
  'atom' &lt;&>
  'rparens' &lt;&> ;

"(123)" 'expr1' parse car parse-result-parsed . =>
  { { "(" 123 } ")" }

This results in a very ugly parse tree. The '<&>' word returns a parse tree which is an array of the two parsers it uses. Since we have two '<&>' calls we get nested results. This can be fixed using two variants of '<&>'. They are '<&' and '&>'. These words do the same as '<&>' but don't put the result of one of the parsers in the parse tree. The '>' or '<' point to the parser that will have the result returned. Here's a new version that has a nicer parse tree using these words:

LAZY: 'expr2' ( -- parser )
  'lparens' 
  'atom' &>
  'rparens' &lt;& ;

"(123)" 'expr2' parse car parse-result-parsed . => 123

Notice the variants of '<&>' used both select the result of the 'atom' parser to be included in the parse tree and not the results of the parenthesis parsers.

There is still a problem with the expression parser. It doesn't handle more than one 'atom' in the expression. Similar to '<+>' there is '<>' which takes a parser and returns a new parser which when called parses zero or more instances of the original parser. The parse tree for the result of '<>' is an array of the results of the original parser:

LAZY: 'expr3' ( -- parser )
  'lparens' 
  'atom' sp &lt;*> &>
  'rparens' &lt;& ;

"(set a 123)" 'expr3' parse car parse-result-parsed . => 
  { "set" "a" 123 }

This has one other change in that it needs to handle white space between atoms. The 'sp' word takes a parser (in this case 'atom') and returns a parser that first removes all white space from the start of the input string and then calls the original parser. The effect is to produce a parser that ignores white space.

This gets close to what out test case requires but it returns an array not a quotation. Using '<@' fixes this:

LAZY: 'expr4' ( -- parser )
  'lparens' 
  'atom' sp &lt;*> &>
  'rparens' &lt;& [ >quotation ] &lt;@ ;

"(set a 123)" 'expr4' parse car parse-result-parsed . => 
  [ "set" "a" 123 ]

Unfortunately this fails on our second test case which requires handling nested expressions like '(+ 1 (+ 2 3) 4)'. By recursively calling the 'expr' word this is easy to add:

LAZY: 'expr5' ( -- parser )
  'lparens' 
  'atom' 'expr5' &lt;|> sp &lt;*> &>
  'rparens' &lt;& [ >quotation ] &lt;@ ;

"(+ 10 (+ 20 30))" 'expr5' parse car parse-result-parsed . => 
  [ "+" 10 [ "+" 20 30 ] ]

This change was as simple as replacing the 'atom' parser used between the two parenthesis with a parser which checks for an atom or an expression. Note that this recursively calls itself, which is safe from infinite recursion due to the lazy evaluation. It doesn't actually get evaluated unless the 'atom' test fails first. One final change is to allow for only atoms as well as expressions and then our simple test cases work and it is completed:

LAZY: 'expr' ( -- parser )
  'atom'   
  'lparens' 
  'atom' 'expr' &lt;|> sp &lt;*> &>
  'rparens' &lt;& [ >quotation ] &lt;@ &lt;|> ;

"(set square (lambda (n) (* n n)))" 'expr' parse car parse-result-parsed . =>
 [ "set" "square" [ "lambda" [ "n" ] [ "*" "n" "n" ] ] ]

"123" 'expr' parse car parse-result-parsed . => 
  123

"hello" 'expr' parse car parse-result-parsed . => 
  "hello"

These examples should work using the parser combinators code in the current Factor darcs repository and in the upcoming 0.85 release. I'm currently writing more documentation for the parser combinators and it will be accessable using the standard Factor help system. The source for this example can be found here.

Bluish Coder

Factor Parser Combinator example

Tags