Compilers and Interpreters in Factor

2006-10-29

Compilers and Interpreters in Factor

Writing compilers and interpreters in Factor is quite easy. By writing the grammar using parser combinators, and having those combinators produce an abstract syntax tree (AST), it's easy to write a simple compiler or interpreter for the tree. I've already covered writing simple parser combinators so I'll concentrate on processing the AST for a simple language in this post. A later post might generate a parser for it or it could be an 'exercise for the reader'.

For an example to start with I took the evaluator defined in this Lambda the Ultimate posting and converted it to Factor. For the interpreter I tried two different approaches to compare them. The first was to use pattern matching. For the second I use generic functions and Factor's OO system.

For the compiler I show how to compile to Factor code, which can then be interpreted or compiled by Factor itself, and I also compile to Javascript. This javascript can be run in a standalone Javascript interpreter like Rhino, or in a web browser.

To more easily re-use the AST across the two examples I added the ability to pattern match on tuples to the 'match' contrib library. This can be obtained from the standard Factor darcs repository:

darcs get http://www.factorcode.org/repos

The factor 'latest' boot images can be used to bootstrap from this repository. The instructions on how to do this are on factorcode.org. For an Intel x86 linux machine the steps would be:

darcs get http://www.factorcode.org/repos
cd factor/Factor
make linux-x86   
wget http://www.factorcode.org/images/latest/boot.image.x86
./f boot.image.x86

The examples here should also work with the released Factor 0.85 as long as you replace 'contrib/match/match.factor' with the version from my repository. To load the 'match' library in Factor use:

"contrib/match" require 
USE: match

In the Factor code snippets below I use '=>' to show the result of running Factor words. You shouldn't actually type that. Instead the result will be shown on the Factor stack or you can use '.' to print it.

The code for the examples in this post can be downloaded from interpreter.factor. If you copy it into the 'contrib' directory of your Factor installation you can load it with:

"contrib/interpreter" require

The evaluator that is being implemented has the following description from the Lambda the Ultimate posting:

data Term a where
    Lit  :: Int -> Term Int
    Inc  :: Term Int -> Term Int
    IsZ  :: Term Int -> Term Bool
    If   :: Term Bool -> Term a -> Term a -> Term a
    Pair :: Term a -> Term b -> Term (a,b)
    Fst  :: Term (a,b) -> Term a
    Snd  :: Term (a,b) -> Term b 
  eval :: Term a -> a
eval (Lit i)    = i
eval (Inc t)    = eval t + 1
eval (IsZ t)    = eval t == 0
eval (If b t e) = if eval b then eval t else eval e
eval (Pair a b) = (eval a, eval b)
eval (Fst t)    = fst (eval t)
eval (Snd t)    = snd (eval t)

Basically we have literal values, which in this example are integers. These can be incremented and tested to see if they are zero. A 'pair' can be created and the first and second elements obtained from the pair. An 'if' expression is available for testing an expression and evaluating a true or false clause. These are represented using Factor tuples:

TUPLE: lit i ;
TUPLE: inc t ;
TUPLE: isz t ;
TUPLE: iff b t e ;
TUPLE: pair a b ;
TUPLE: fst t ;
TUPLE: snd t ;

I used the same names for the types and elements as the example but changed 'if' to 'iff' to prevent clashing with Factor's standard 'if' word. An example tree can be created that is the increment of the number 5 with:

5 <lit> <inc>

When evaluated this should produce the result '6'.

The interpreter implementation that uses pattern matching uses 'match-cond'. This requires match variables to be set up which can be used to destructure sequences and tuples. For example:

5 <lit> T{ lit f ?t } match [ ?t ] bind => 5

'T{ lit f 123 }' is the Factor syntax for a tuple. It creates a tuple at parse time and the actual object is stored in the word. This is different from 123 <lit> which will produce the literal object at runtime. '?t' is a matching variable and the 'match' and 'match-cond' words use these to destructure matching elements of tuples and sequences. The match variable can then be used to get the actual value obtained. For more on the Factor pattern matching system you can read the documentation from within Factor with:

\ match help

The Factor implementation of the interpreter using pattern matching is:

: eval1 ( a -- a )
  {
    { T{ lit  f ?i }       [ ?i ] }
    { T{ inc  f ?t }       [ ?t eval1 1+ ] }
    { T{ isz  f ?t }       [ ?t eval1 zero? ] }
    { T{ iff  f ?b ?t ?e } [ ?b eval1 [ ?t ] [ ?e ] if eval1 ] }
    { T{ pair f ?a ?b }    [ ?a eval1 ?b eval1 2array ] }
    { T{ fst  f ?t }       [ ?t eval1 first ] }
    { T{ snd  f ?t }       [ ?t eval1 second ] }
  } match-cond ;

It is pretty much a direct translation of the Haskell example. An example of using it:

5 <lit> <inc> eval1 => 6

The Lambda the Ultimate post gave a usage example:

stuff = Pair (If (IsZ (Inc (Inc (Lit 5))))
                 (Lit 6)
                 (Lit 7))
             (Fst (Snd (Pair (Lit 42)
                             (Pair (Lit 8) (Lit 9)))))

Translated to Factor it is:

: driver ( -- v )
  5 <lit> <inc> <inc> <isz> 
    6 <lit>
    7 <lit>
  <iff> 
  42 <lit> 8 <lit> 9 <lit> <pair> <pair> <snd> <fst> <pair> ;

Running the 'eval1' word on this produces the answer, { 7 8 }:

driver eval1 => { 7 8 }

Although 'eval1' is quite short and easy to understand it has a disadvantage in that to extend it with new AST types you need to modify the 'eval1' word. Using the Factor OO generic word system you extend the interpreter without having to modify existing code. Here's a generic word approach to the interpreter:

GENERIC: eval2 ( a -- a )
M: lit  eval2 ( a -- a ) lit-i ;
M: inc  eval2 ( a -- a ) inc-t eval2 1+ ;
M: isz  eval2 ( a -- a ) isz-t eval2 zero? ;
M: iff  eval2 ( a -- a ) dup iff-b eval2 [ iff-t ] [ iff-e ] if eval2 ;
M: pair eval2 ( a -- a ) dup pair-a eval2 swap pair-b eval2 2array ;
M: fst  eval2 ( a -- a ) fst-t eval2 first ;
M: snd  eval2 ( a -- a ) snd-t eval2 second ;

The 'M:' word is used to define a method for a generic word. This is similar to how generic functions and methods work in CLOS and Dylan. When a generic word is called the actual method invoked is based on the topmost item of the stack. The first symbol past the 'M:' is the type of that object that that method is for. The second is the name of the generic word the method will be added too. The rest of the definitions should be reasonably clear - they break apart the tuples and return results or call 'eval2' recursively. 'eval2' produces the same result as 'eval1':

driver eval2 => { 7 8 }

Which approach to use is really personal preference. Of the two, 'eval2' is the most efficient. This is because methods can be compiled in Factor whereas the current implementation of 'match-cond' cannot be. So 'eval1' will run in the Factor interpreter whereas 'eval2' will be compiled to native code. This can be seen by trying to compile the examples and running a timed test:

\ eval1 compile => "The word eval1 does not have a stack effect"
\ eval2 compile 
\ eval1 compiled? => f
\ eval2 compiled? => t
[ 1000 [ driver eval1 ] times ] time => 487ms run / 5 ms GC time
[ 1000 [ driver eval2 ] times ] time => 3ms run / 0 ms GC time

There is obviously room for improvement in the 'match' routines! I can say that because I wrote them :)

A compiler follows the same structure as the interpreter but instead of computing the result we generate code. This code is then later fed to the compiler. For the compiler examples I'm only going to show the pattern matching based example. The generic word implementation is very similar and would follow the relevant interpreter implementation.

Factor provides a 'make' word that can be used for dynamically appending to a new sequence. From within a 'make' scope you can use ',' to append an element and '%' to splice in a sequence to the constructed sequence. 'make' can be used for constructing arrays, strings, quotations, etc. The type of the constructed sequence is identified by an exemplar sequence passed to 'make'. For example:

[ 1 , 2 , { 3 4 } % { 5 } , ] { } make => { 1 2 3 4 { 5 } }

The top level 'compile' word will set up a 'make' scope for the AST compile routines to append to. Here's the compiler to Factor:

: (compile1) ( a -- )
  {
    { T{ lit  f ?i }       [ ?i , ] }
    { T{ inc  f ?t }       [ ?t (compile1) \ 1+ , ] }
    { T{ isz  f ?t }       [ ?t (compile1) \ zero? , ] }
    { T{ iff  f ?b ?t ?e } [ ?b (compile1) 
                             [ ?t (compile1) ] [ ] make , 
                             [ ?e (compile1) ] [ ] make , 
                             \ if , ] }
    { T{ pair f ?a ?b }    [ ?a (compile1) ?b (compile1) \ 2array , ] }
    { T{ fst  f ?t }       [ ?t (compile1) \ first , ] }
    { T{ snd  f ?t }       [ ?t (compile1) \ second , ] }
  } match-cond ;  

: compile1 ( a -- quot )
  [ (compile1) ] [ ] make ;

As you can see it is very similar to the interpreter. Instead of returning the result of the AST evaluation '(compile1)' appends the equivalent Factor code to the constructed quotation created in the 'compile1' word.

The implementation for handling 'lit' just appends the numeric value directly. For 'inc' it calls '(compile1)' on the term to be incremented so that gets appended to the quotation. It then appends the Factor '1+' word which will be used to increment it. The '\' is an escape mechanism to tell Factor to store the word itself rather than call it.

The main complication is in the implementation of 'iff'. This word must delay the computation of the two terms of the 'iff' statement. To to this it creates Factor quotations with 'make' and then recursively calls '(compile1)' for the terms. The results of the compilation for this recursive call will be stored in the most recently created quotation rather than the one in the top level 'compile1' word. That is, nested 'make' calls are dynamically scoped.

An example of usage of the compiler:

5 <lit> <inc> compile1 => [ 5 1 + ] 
[ 5 1 + ] call => 6

The 'driver' AST shown previously compiles to Factor correctly:

driver compile1 => [
  5 1+ 1+ zero? [
    6
  ] [ 
    7
  ] if 42 8 9 2array 2array second first 2array
]

This can be compiled to native code using Factor or run in the Factor interpreter:

driver compile1 compile-quot execute => { 7 8 } 
driver compile1 call => { 7 8 }

The final example is a compiler to Javascript. Again it follows the same basic outline of the interpreter. Instead of generating a Factor quotation it uses 'make' to generate a string:

: (compile2) ( a -- )
  {
    { T{ lit  f ?i }       [ ?i number>string % ] }
    { T{ inc  f ?t }       [ ?t (compile2) "+1" % ] }
    { T{ isz  f ?t }       [ ?t (compile2) "== 0" % ] }
    { T{ iff  f ?b ?t ?e } [ 
        "function() {if(" % 
        ?b (compile2) 
        ") {return " % 
        ?t (compile2)
        "} else { return " %
        ?e (compile2)
        "}}()" %
      ] 
    }
    { T{ pair f ?a ?b }    [ "{ first:" % ?a (compile2) ",second:" % ?b (compile2) " }" % ] }
    { T{ fst  f ?t }       [ ?t (compile2) ".first" % ] }
    { T{ snd  f ?t }       [ ?t (compile2) ".second" % ] }
  } match-cond ;  

: compile2 ( a -- quot )
  [ "(" % (compile2) ")" % ] "" make ;

The main complication with this compiler was handling 'if'. The Javascript 'if' has no result so cannot be assigned immediately. So I wrap the 'if' in a function which is called immediately. The two terms of the 'if' use 'return' to return the result from the function. A simple example:

5 <lit> <inc> compile2 => "(5+1)"

And 'driver' also compiles successfully:

driver compile2 => 
"({ 
    first:function() {
      if(5+1+1== 0) {
        return 6
      } else { 
        return 7
      }
    }(),
    second: { 
      first:42,
      second: { first:8,second:9 } 
    }.second.first 
})"

This result evaluates successfully in a browser.

Bluish Coder

Compilers and Interpreters in Factor

Tags