How to Use Regex Named Captures in Perl 5
I ran some Perl 5 regular expression syntax that I’d never seen the other day. It used two features I’d never seen before:
(?{ }), a zero-width, non-capturing assertion that executes arbitrary Perl code.$^N, a variable for getting the contents of the most recent capture in a regular expresion.
The cool thing is that, used in combination, these two features can be used to hack named captures into Perl regular expressions. Here’s an example:
use warnings;
use strict;
use Data::Dumper;
my $string = 'The quick brown fox jumps over the lazy dog';
my %found;
my @captures = $string =~ /
(?: (quick|slow) \s+ (?{ $found{speed} = $^N }) )
(?: (brown|blue) \s+ (?{ $found{color} = $^N }) )
(?: (sloth|fox) \s+ (?{ $found{animal} = $^N }) )
(?: (eats|jumps) (?{ $found{action} = $^N }) )
/xms;
print Dumper \@captures;
print Dumper \%found;
The output of running this program is:
$VAR1 = [
'quick',
'brown',
'fox',
'jumps'
];
$VAR1 = {
'color' => 'brown',
'speed' => 'quick',
'action' => 'jumps',
'animal' => 'fox'
};
So the positional captures are still returned, and we’ve assigned them to keys in a hash. This can be very convenient for complex regular expressions.
This is a cool feature, but there are a few caveats. First, according to
the Perl regular expression
documentation, (?{ }) is a highly
experimental feature that could go away at any time. But more importantly, if
you’re relying on this feature you should be aware of the side effects. What I
mean by that is that, if a regular expression match fails, but there are some
successful matches during execution, then the code in the (?{ })
assertions could still execute. For example, if you changed the
word jumps
to poops
in the above example, the output becomes:
$VAR1 = [];
$VAR1 = {
'color' => 'brown',
'speed' => 'quick',
'animal' => 'fox'
};
Which means that the match failed, but there were still assignments to our
hash, because some of the captures succeeded before the overall match failed.
The upshot is that you should always check the return value from the match
before relying on whatever the code inside the (?{ }) assertions
did.
The problem becomes even more subtle if your regular expressions trigger backgracking. In that case, you might have an optional group match and its value assigned to the hash, and then the next required group fail. Perl will then backtrack to throw out the successfull group match and then see if the next required match succeeds. If so, you can have a successful match and potentially invalid data in your hash. Here’s an example:
my @captures = $string =~ /
(?: (quick|slow) \s+ (?{ $found{speed} = $^N }) )
(?: (brown|blue) \s+ (?{ $found{color} = $^N }) )?
(?: (brown\s+fox) (?{ $found{animal} = $^N }) )
/xms;
print Dumper \@captures;
print Dumper \%found;
And the output is:
$VAR1 = [
'quick',
undef,
'brown fox'
];
$VAR1 = {
'color' => 'brown',
'speed' => 'quick',
'animal' => 'brown fox'
};
So while the second group returned undef for the color
capture, the %foundhash still had the color key in it. This may
or may not be what you want.
Of course, all this seems cool, but since it’s a truly evil hack, you have to be careful. If you can wait, though, perhaps we’ll see named captures in Perl 5.10.
Comments & Trackbacks
Corion wrote:
Aristotle Pagaltzis wrote:
Theory wrote: