I want to share some Perl 6 grammar knowledge and a bit of my enthusiasm for what this language is going to let people do. This is based on work I’ve been doing for the Form.pm module, which will be the replacement for Perl 5’s rather clunky format function.
Imagine that Christmas has come: you’re writing a programme in Perl 6 (you can actually do this today, it’s just that there are large chunks of the language which nobody has implemented yet). In your wonderful Perl 6 programme, your world domination plan includes the need to parse some text, so you reach for a grammar, the overpowered, mutant grandson of Perl 5’s regular expressions.
Grammars are really a special kind of class, containing a special kind of subroutine (being regexes, rules or tokens rather than subs, methods or submethods). Say your grammar looks like this:
grammar Statements {
regex TOP {
( <statement> ';' )*
{*}
}
regex statement {
'hello' <ws> <string> {*} #= hello
| 'goodbye' <ws> <string> {*} #= goodbye
}
regex string {
'"' (.*?) '"'
{*}
}
}
That is, you’re going to match against zero or more statements, which are separated by semicolons and consist of either ‘hello’ followed by some whitespace and a double-quoted string, or ‘goodbye’ followed by some whitespace and a double-quoted string. I know the string regex is bad in that it’s awful at delimiter matching, but it’ll do for now in the interests of keeping down the complexity. Also, in real life you’d probably have rather more flexible rules for separating statements, but I’m showing off AST-building here, not fancy regexes.
So what does it all mean? The syntax isn’t too difficult to get your head around: an identifier inside angle brackets invokes another regex as a submatch. | delimits alternatives and * is the zero-or-more quantifier. {*} is where the clever bit comes in. When the matcher sees {*} in a rule, it knows that it should go and call a method with the same name as the current rule in an object which was passed to the matcher when it was invoked. These are called action methods, and offer a fantastic way to use the grammar system for building abstract syntax trees or other constructions which represent your parsed text. To make this even more useful, if the line with the {*} on it ends with a comment like those seen in the statement rule above, the text of the comment will be passed to the action method, so you can signal to it which alternative branch you want it to deal with.
Now in real life your action methods might produce one of several possible classes, maybe from a huge tree of classes which can represent your AST, or from a set of classes which use a common set of roles, or maybe you’re going down the route of some of the calculator examples which have been developed which do the calculations right there in the action methods, and represent nodes by their results which are just numbers. In this example, to avoid having to define a class for statements (which is really quite simple), I’m just going to produce a hash with two keys: verb and string.
So we write a class to hold the action methods. We can call it whatever we like, but the method names and the rule names they’re triggered by have to match.
class Actions {
method TOP($/) {
my @statements = gather for @( $/[0] ) -> $submatch {
take $submatch<statement>.ast;
}
make @statements;
}
method statement($/, $key) {
my %s;
%s<verb> = $key;
%s<string> = $/<string>;
make %s;
}
method string($/) {
make $/[0].ast;
}
}
Now we see the real magic at work. First, each action method has a parameter $/, which is the default name for the match object. Calling it $/ lets you call make without any special considerations because the background match object is just there. A match object, by the way, is what a regex returns when it matches something, and it contains all the information about where and what it matched, including other match objects for subpattern matches.
Next, we see that the statement method has a second parameter, $key – this is where the comment-syntax key strings arrive. It’s entirely optional, so I only put it on the method that would actually get some use from it.
The make function is really key to this business – it alters the match object to provide a different result. The match object itself stays intact, but it has a slot (accessed through the .ast method) which can carry any object you like which represents the matched text. It’s called .ast because of its frequent use in building abstract syntax trees, and here it’s going to carry our hashes.
Let’s look at each method in turn. The action method for string is straightforward. If you look back at the regex for string, you’ll see that it has a captured subpattern between the double-quote delimiters. Subpatterns like this appear in the Match object as array elements, so you can use the array indexing operator to access them. We know here that there will only be one, and as it’s an array it’ll be in slot 0, so we can just say $/[0]. We call .ast on it, because the default result is a string representing what was parsed by that pattern. In this case that’s exactly what we want, so we set it to our result and we’re done.
Next, the statement method is a little more complicated. It creates a new hash, then sets the verb to the passed-in $key. Because string was invoked as a named subpattern, it can be found as a hash element of the match object, so we retreive $/<string>.ast, which will be the string that was produced by the string action method, and store that in the hash before using make to set it in our ast slot.
Finally, TOP has to deal with a list of subpattern captures. Because there’s a repetition quantifier on the capture, $/[0] will be an array. We can iterate over that with a for loop to get at the match object for each repetition, and then use the hash access to retrieve the match object for the statement subrule – which is of course the hash made by the statement action method.
In this case, we use gather for to construct the list lazily (or at least, it will be lazy once Rakudo has lazy lists). gather needs a loop construct, and within that construct you can call take as many times as you like – each time a new element will be added to the end of the list which is being constructed by the entire gather expression. Ultimately that list is assigned to @statements, and we make that as the final result from the match. In this case we take the ast slot from the submatch, but you may notice that there’s a backslash before its $. Why is this? Perl 5 programmers are probably thinking ‘ah, he’s taking a reference so that he can make an array of hashes’, and they’d be sort of right. Without that, the hash wants to flatten into the array, which is not what we want, so we use @\@ which in Perl 6 creates a capture (an astonishing Perl 6 construct which is used for everything from stopping things flattening into lists to passing subroutine parameters around), which in this case will just prevent the hash from flattening into the array. There may be a more elegant way to do this by ensuring that the hash isn’t taken in a context that’s going to cause flattening, but I haven’t figured that part out yet. In any case, the end result is an array of hashes, which is exactly what we want. The captures are entirely transparent to everything that tries to use it.
How do you run this? Very easily:
my $actions = Actions.new;
my $result = Statements.parsefile('somefilename', :action($actions));
for $result.ast -> $statement {
given $statement<verb> {
when 'hello' { say "Hello, $statement<string>"; }
when 'goodbye' { say "Goodbye, $statement<string>"; }
}
}
And that, as they say, is that, except that you can do all this in Rakudo Perl 6 today.