Goodbye, Parse::RecDescent


I previously outlined the grammar for Positron::Expression as well as an improved version. This grammar is written for the Parse::RecDescent parsing module.

Parse::RecDescent is a very powerful, versatile module. It is well documented and quite user-friendly; plenty of it's features are geared towards ease-of-use and what "most people probably mean". It takes a grammar in a kind of Backus-Naur form, with different production rules, and can be used to match a string against this grammar.

However, there are a few troublesome aspects as well, which are prompting me to replace it by a set of home-grown functions based on regular expressions. In particular, I had the following issues:

  1. By default, Parse::RecDescent just returns "matches" or "doesn't match". Yes, there's the <autotree> directive, but I wanted a pure-data model, so I ended up adding actions by hand.

  2. The grammar I had built does not catch superfluous text after an expression. So thing.attr and then more text! would parse the thing.attr without complaint. However, the text after the syntactically valid expression is a good indication that the user was trying for something more. An error would be appropriate here.

  3. For hardening, I needed to catch any syntax errors and report them with an appropriate context (e.g. the entire expression). Parse::RecDescent has quite intelligent error reporting, but insists on printing it to STDERR. Not even with warn, but really printing to STDERR. I could either fudge together a capture of STDERR (I succeeded after a few tries), which would leave me with another source of errors and maintenance, or use another module (read: "dependency") for this.

  4. I tried adding a few <error> and <commit> directives. This did not work as well as I had hoped.

  5. The grammar is very simple, linguistically speaking. I think aside from strings, one could write a simple DFA which simply consumes a single character at a time and never backtraces (well, DFAs don't do that anyway). The grammar is also very stable by now, and it's only one.

Writing a replacement

All of this prompted me to try replacing Parse::RecDescent. I ended up writing a new subroutine for each production rule in the grammar. I wanted to use perl regular expressions because I didn't want to lose Unicode character class handling. I passed the expression around by alias (using $_[0]), and used the Perl regular expression modifiers g and c and the pos function to navigate around the expression.

Here, for example, is the entire alternative handling function:

sub alternative {
    if ($_[0] =~ m{\G \s* (!) \s*}xmsgc) {
        return ['not', alternative($_[0])];
    } else {
        return operand($_[0]);

And here's the function to parse an integer, together with the helper function _critisize which should help pinpoint where things went wrong:

sub integer {
    if ($_[0] =~ m{ \G \s* ([+-]? \d+ ) \s* }xmscg) {
        return $1;
    } else {
        croak q{Syntax error: Invalid integer } . _critisize($_[0]);

sub _critisize {
    return qq{near '} . substr($_[0], pos($_[0]) || 0, 10) ⏎
    . q{' in '} . $_[0] . q{'};

I can peek ahead using regular non-gc matches, using the next character(s) to guide the DFA.


But how do these two methods stack up? Well, I wrote a little test script, which took a pretty complicated expression, and parsed it with both engines. I know there are benchmarking Perl modules, but I just used Time::HiRes and ran each method several thousand times.

The new method held up quite well. Even though I had not put any effort into optimization, the new method still only took about the same amount of time as the Parse::RecDescent solution. The difference was usually about 10% or so.

Oh wait. I'm timing the old solution both times.

Let's see… fix this, run the test again… and…

The new method is faster. By two orders of magnitude. Yes, doing it with explicit functions and good old regular expressions is a hundred times faster. 990% improvement.

Oh, and during testing, I uncovered an expression which slowed my Parse::RecDescent solution to a crawl (lots of nested parentheses). And one expression which, for some reason I will never be able to find out, would not parse correctly.


I've got a new method for parsing expressions. It's faster, more robust, and more maintainable for my specific purpose. This is not to knock Parse::RecDescent in any way. It's way more flexible than my solution (which is explicitly hard-coded, which is fine), can handle multiple (even dynamic!) grammars, several starting points, and is generally way more powerful, and expressive, than anything I can cook up.

And given all that power and flexibility, it has every right to take more time than a simple, hard-coded solution.

I just don't need all that power.