Grammar for Positron::Expression

2013-04-23

The Positron templates need an expression language. Most templating engines allow you to access more than just simple values. For example Smarty has a syntax like

{$map.key}

to access the key member of the map variable. Template Toolkit has the same thing with a different syntax:

[% map.key %]

Now I could go the HTML::Template route and only allow simple variables, but I've learned to like the power that comes with full expressions. So I need some of my own.

The first step is the parsing. Since regular expressions are not enough (I want to have nested expressions, say as function arguments), I turned to Parse::RecDescent. This is quite a powerful module, downright scary even, but it certainly gets the job done. This is one of my first attempts at a grammar for expressions:

expression: alternative ( [?:] alternative )(s?)
alternative: operand | '!' alternative
operand: string | number | lterm ( '\.' rterm )(s?)
lterm: identifier | '$' lterm | '(' expression ')' | funccall
lterm: identifier | '$' lterm | index | '(' expression ')' | funccall
funccall: lterm '(' argument(s?) ')'
argument: expression(s? /\s*,\s*/)
string: '"' /[^"]/ '"'
number: /\d+/
identifier: /[a-zA-Z_][a-zA-Z_0-9]+/
index: /\d+/

…which in fact does not work. No, don't even try it, I don't even think this thing compiles.

One thing to keep in mind, for example (which I forgot here) is that by default, Parse::RecDescent will try to match your text against the grammar, and if it matches, say "Yep, that matches all right", and leave you standing there. You have to take extra steps to find out how your text actually matched, and what it means.

The documentation mentions the <autottree> directive, explaning that

A commonly needed autoaction is one that builds a parse-tree. It is moderately tricky to set up such an action (which must treat terminals differently from non-terminals), so Parse::RecDescent simplifies the process by providing the <autotree> directive.

The tree-of-objects that autotree builds are a bit overkill, in my opinion, and on the other hand I don't feel it's "moderately tricky". So the first step is to add actions. For now, I'm building a simple tagged data structure:

expression: alternative '?' expression {  ['and', $item[1], $item[3]];} 
          | alternative ':' expression { ['or', $item[1], $item[3]]; }
          | alternative
alternative: '!' alternative { ['not', $item[1]]; }
           | operand
operand: string | number
       | lterm '.' rterm { ['dot', $item[1], $item[3]] } | lterm
…

Changing the expression rule gives me left association for free, leaving me with a nice binary tree for my "ternary operator" thingy.

Next is strings: the first version does not allow quotes inside a string. However, I'm fairly certain that (1) most users of these templates will prefer to embed escaped quotes, and (2) the parser should be up to the task. After some fiddling, I ended up with

string: '"' string_body '"' { $item[2] }
string_body: '\\' /./ string_body(s?)  { $item[2] . join('', @{$item[3]}) }
           | /[^"\\]/ string_body(s?) { $item[1] . join('', @{$item[2]}) } 

These rules actually go through the string character by character, and try to match a backslash followed by anything (for example `\\`, `\"` or `\a`, which is just an `a` again), or a character that is neither a quote or a backslash. (I've also eventually tripled these rules, since I want strings to be terminable by double quotes, single quotes, or backticks.)

After this worked, I tried adding function calls. I want to take basically any sub-expression that could evaluate to an anonymous function, and append an argument list to it to evaluate it. Of course, this function could evaluate to an anonymous function as well, so I'd like to be able to write

f(1)(2)(3)
(f(1).methods.0)(2,3)

and similar constructs. That originally lead to the rules

lterm: '(' expression ')' { $item[2] } | identifier 
     | '$' lterm { ['env', $item[2]] }
     | funccall
funccall: lterm '(' argument(s?) ')'

So an "lterm" ("left term", which cannot be a number, unlike an rterm, nevermind that now) can be an identifier like var, a complete new expression in parentheses like (var.field), an environment indirection like $var ("the variable whose name is stored in var") or a function call like f(0,1,2), where f can be anything that could evaluate to a function. Hey, we have a rule for that, that's an lterm again!

Unfortunately, this makes these rules left recursive, meaning the parser will never stop. It will try to match f(0) as an lterm, in this case a funccall, which starts with an lterm, which in this case looks like a funccall, which starts with… you get the idea.

Parse::RecDescent is smart enought to detect this, and refuses to use this grammar. It even gives a nice hint how to correct this situation, even if it's not applicable in my case. I'd have to break up the left recursion myself.

Now it just so happens that I actually have a copy of Compilers: Principles, Techniques, and Tools on my bookshelf, which shows how to deal with this problem. On the other hand, it just so happens that I'm writing this code on the train on my way to work. And of course it just so happens that I stumble on this on the going to work leg of the journey, not the coming back one, where the book is waiting. So either I waste the entire day, so to speak, or I figure out something on my own.

In the end I settled for a ruleset like this:

lterm: '(' expression ')' { $item[2] } | funccall 
     | identifier | '$' lterm { ['env', $item[2]] }
funccall: identifier '(' expression(s? /\s*,\s*/) ')'
        { ['funccall', $item[1], @{$item[3]}] }

which limits my functions (or methods) to simple names, with one argument list, as so:

f(0,1)                              (ok)
a.f(0)                              (also ok)
f(1)(2)                             (not ok)
($$indirect.methods.0)("one")(0,0)  (definitely not ok)

This is technically less powerful, but much saner. This should be fine for at least 99% of all use cases anyway.

As soon as the grammar is finished (I'm not sure about the expression in parentheses, for example), I should have more to show.