2007-10-03
Javascript Parser Combinators
Cleaning up my hard drive I came across some old libraries I'd written. One of them was a simple set of parser combinators written in Javascript. I put it in a git repository in case they prove useful to someone:
git clone git://github.com/doublec/jsparse.git
The library pulls ideas from parser combinators written in various languages and is pretty simple. But even though it is a small amount of code it works quite well. The repository includes an example of parsing numbers and strings based on the grammar in the EcmaScript specification.
A parser in this library is a function that takes an input string and returns a result object. The result object contains three fields. They are:
- remaining: the remaining part of the input string to be parsed
- matched: the part of the input string that was successfully parsed by the parser
- ast: The abstract syntax tree produced by the parsr
Parser combinators combine parsers together to enable parsing more complex grammars. A number of standard parsers and combinators are provided.
'token' is a combinator that takes a string and returns a parser that will successfully parse an instance of that string:
js> load("jsparse.js")
js> var p = token("begin")
js> uneval(p("begin ... end"))
({remaining:" ... end", matched:"begin", ast:"begin"})
The AST produced by parser is the string that it parsed.
'range' returns a parser that parses single characters within the range of the upper and lower character bounds (inclusive) given:
js> var p = range("0", "9")
js> uneval(p("5"))
({remaining:"", matched:"5", ast:"5"})
js> uneval(p("a"))
false
'negate' takes an existing single character parser, and returns once which parses anything but that which the original parser parsed. For example, 'negate(range("a", "z"))' will return a parser which parses anything except the letters from a to z inclusive:
js> var p = negate(range("a", "z"))
js> uneval(p("g"))
false
js> uneval(p("5"))
({remaining:"", matched:"5", ast:"5"})
'sequence' takes any number of parsers as arguments and returns a parser which suceeds if all the given parsers succeed in order. The AST it returns is an array of the results of each of the parsers.
js> var p = sequence(token("a"), token("b"), token("c"))
js> uneval(p("abcdef"))
({remaining:"def", matched:"abc", ast:["a", "b", "c"]})
js> uneval(p("abdef"))
false
'alternate' provides choice between parsers. It takes any number of parsers as arguments and will try each of them in order. The first one that succeeds results in a successful parse, and its result is the AST:
js> var p = alternate(token("a"), token("b"), token("c"))
js> uneval(p("a123"))
({remaining:"123", matched:"a", ast:"a"})
js> uneval(p("b123"))
({remaining:"123", matched:"b", ast:"b"})
js> uneval(p("c123"))
({remaining:"123", matched:"c", ast:"c"})
js> uneval(p("d123"))
false
'repeat0' does the equivalent of *
in regular expressions. It takes a parser and returns a parser which will parse zero or more occurrences of the original parser. The AST is an array containing the AST result of the original parser for each successful occurrence:
js> var p = repeat0(range("0", "9"))
js> uneval(p("12345abcd"))
({remaining:"abcd", matched:"12345", ast:["1", "2", "3", "4", "5"]})
js> uneval(p("123abcd"))
({remaining:"abcd", matched:"123", ast:["1", "2", "3"]})
js> uneval(p("abcd"))
({remaining:"abcd", matched:"", ast:[]})
'repeat1' does the equivalent of '+' in regular expressions. It takes a parser and results one which will parse one or more occurences of the original parser:
js> var p = repeat1(range("0", "9"))
js> uneval(p("12345abcd"))
({remaining:"abcd", matched:"12345", ast:["1", "2", "3", "4", "5"]})
js> uneval(p("abcd"))
false
'optional' takes a parser and returns one which matches exactly zero or one instances of the original. The AST result is 'false' for the case where there is no match or the result of the original parser if there is a match:
js> var p = sequence(optional(alternate("+", "-")), repeat1(range("0", "9")))
js> uneval(p("1234"))
({remaining:"", matched:"1234", ast:[false, ["1", "2", "3", "4"]]})
js> uneval(p("-1234"))
({remaining:"", matched:"-1234", ast:["-", ["1", "2", "3", "4"]]})
js> uneval(p("+1234"))
({remaining:"", matched:"+1234", ast:["+", ["1", "2", "3", "4"]]})
js> uneval(p("*1234"))
false
You'll notice in this example that I pass "+" and "-" directly instead of token("+") and token("-"). The parsers in this library will automatically convert strings to parsers where needed to make for terser and more readable code.
The AST produced by some of the generated parsers can be non-optimal. For example, a simple parser will produce an array of strings for each digit:
js> var p = repeat1(range("0", "9"))
js> p("123").ast
1,2,3
js> uneval(p("123").ast)
["1", "2", "3"]
The 'action' combinator takes a parser and a function. The function is called with the result of the AST produced by the parser, and the result of the function becomes the new AST. For example, compare these two:
js> var p = action(range("0", "9"), function(ast) { return parseInt(ast) })
js> uneval(p("1"))
({remaining:"", matched:"1", ast:1})
js> var p = range("0", "9")
js> uneval(p("1"))
({remaining:"", matched:"1", ast:"1"})
In the first the result is an actual number, in the second it is a string. Any object can be returned and used for the AST. A abstract syntax tree for the parsed language for example.
There are other combinators provided in the library but these basics do most of what is needed.
The 'example1.js' file shows a translation of some of the grammar productions in the EcmaScript grammar:
var zero
= action("0", function(ast) { return 0; });
var decimal_digit
= action(range("0", "9"), function(ast) { return parseInt(ast); });
var non_zero_digit
= action(range("1", "9"), function(ast) { return parseInt(ast); });
var decimal_digits
= repeat1(decimal_digit);
var decimal_integer_literal
= alternate(zero, sequence(non_zero_digit, optional(decimal_digits)));
var signed_integer
= alternate(decimal_digits,
sequence("+", decimal_digits),
sequence("-", decimal_digits));
var exponent_indicator
= alternate("e", "E");
var exponent_part
= sequence(exponent_indicator, signed_integer);
var decimal_literal =
alternate(sequence(decimal_integer_literal,
".",
optional(decimal_digits),
optional(exponent_part)),
sequence(".",
decimal_digits,
optional(exponent_part)),
sequence(decimal_integer_literal,
optional(exponent_part)));
var hex_digit
= alternate(range("0", "9"),
range("a", "f"),
range("A", "F"));
var hex_integer_literal
= sequence(alternate("0x", "0X"),
repeat1(hex_digit));
var numeric_literal
= alternate(hex_integer_literal, decimal_literal);
var single_escape_character
= alternate("'", "\"", "\\", "b", "f", "n", "r", "t", "v");
var non_escape_character
= negate(single_escape_character);
var character_escape_sequence
= alternate(single_escape_character, non_escape_character);
var hex_escape_sequence
= sequence("x", hex_digit, hex_digit);
var unicode_escape_sequence
= sequence("u", hex_digit, hex_digit, hex_digit, hex_digit);
var escape_sequence
= alternate(hex_escape_sequence,
unicode_escape_sequence,
character_escape_sequence);
var single_string_character
= alternate(negate(alternate("\'", "\\", "\r", "\n")),
sequence("\\", escape_sequence));
var double_string_character
= alternate(negate(alternate("\"", "\\", "\r", "\n")),
sequence("\\", escape_sequence));
var single_string_characters
= repeat1(single_string_character);
var double_string_characters
= repeat1(double_string_character);
var string_literal
= alternate(sequence("\"", optional(double_string_characters), "\""),
sequence("'", optional(single_string_characters), "'"));
var null_literal
= token("null");
var boolean_literal
= alternate("true", "false");
var literal
= alternate(null_literal,
boolean_literal,
numeric_literal,
string_literal);
I'd like to extend the library a bit and provide more examples. Any comments or ideas would be appreciated.