More grammar for expressions

2013-04-23

Well, the grammar from last time had some flaws. For the following discussion, here's the current version:

# We start with our "boolean / ternary" expressions
expression: <leftop: alternative /([:?])/ alternative> 
     { @{$item[1]} == 1 ? $item[1]->[0] : ['expression', @{$item[1]}]; }
alternative: '!' alternative { ['not', $item[2]]; } | operand

# strings and numbers cannot start a dotted expression
# in fact, numbers can have decimal points.
operand: string | number 
       | lterm ('.' rterm)(s) { ['dot', $item[1], @{$item[2]}] } | lterm

# The first part of a dotted expression is looked up in the environment.
# The following parts are parts of whatever came before, and consequently
# looked up there.
lterm: '(' expression ')' { $item[2] } | funccall | identifier 
     | '$' lterm { ['env', $item[2]] }
rterm: '(' expression ')' { $item[2] } | methcall | key | string | integer
     | '$' lterm { ['env', $item[2]] }

# Strings currently cannot contain their delimiters, sorry.
string: '"' /[^"]*/ '"' { $item[2] } 
      | /\'/ /[^\']*/ /\'/ { $item[2] } 
      | '`' /[^`]*/ '`' { $item[2] }

identifier: /[a-zA-Z_]\w*/ {['env', $item[1]]}
key: /[a-zA-Z_]\w*/ { $item[1] }
number: /[+-]?\d+(?:\.\d+)?/ { $item[1] }
integer: /[+-]?\d+/ { $item[1] }

# We need "function calls" and "method calls", since with the latter, the
# function is *not* looked up in the environment.
funccall: identifier '(' expression(s? /\s*,\s*/) ')'
          { ['funccall', $item[1], @{$item[3]}] }
methcall: key '(' expression(s? /\s*,\s*/) ')'
          { ['methcall', $item[1], @{$item[3]}] }

The first change is the expression rule. I had originally tried matching only the next ? or :, building up a binary tree. It turns out that the tree was the wrong direction, after all. I'm now using the <leftop> directive of Parse::RecDescent to form a complete list, and go through them linearly. This also forced me to really think about what an expression like a ? b : c ? d really means. (In this case: if the "ternary operator" a ? b : c is "true", then the answer is d. Otherwise the whole thing is "false".)

The next changes are in the lterm and rterm rules: in a.b, the a key is looked up in the environment, but b is not. For our purposes, it's literal. Similar with a.b(), which is why the funccall becomes methcall when calling a method.

Then there's the strings: the recursive definition last time allowed me to embed \" in a quote-delimited string, but at a huge cost. In a comparison with the current "can't embed your own delimiter" approach, the recursive version was a hundred times slower. A hundred! And in a twist of irony, I found out that I could not get Parse::RecDescent to accept a single quote in a literal string for love and money. Which is why that version uses regexes for the delimiters.

Finally, I distinguish between "identifiers" and "keys": the former are looked up in the environment, the latter are not. I also differentiate "numbers" (0.02) and "integers" (2). The former can't appear in dotted lists (a.b.0.02?), but integers can, being array index selectors.

Now I've got that working, I'm thinking about whether to provide arithmetic and comparison operators after all…