home :: computers :: programming :: perl :: regex named captures

How to Use Regex Named Captures in Perl 5

I ran some Perl 5 regular expression syntax that I’d never seen the other day. It used two features I’d never seen before:

  • (?{ }), a zero-width, non-capturing assertion that executes arbitrary Perl code.
  • $^N, a variable for getting the contents of the most recent capture in a regular expresion.

The cool thing is that, used in combination, these two features can be used to hack named captures into Perl regular expressions. Here’s an example:

use warnings;
use strict;
use Data::Dumper;

my $string = 'The quick brown fox jumps over the lazy dog';

my %found;

my @captures = $string =~ /
    (?: (quick|slow) \s+    (?{ $found{speed}  = $^N  }) )
    (?: (brown|blue) \s+    (?{ $found{color}  = $^N  }) )
    (?: (sloth|fox)  \s+    (?{ $found{animal} = $^N  }) )
    (?: (eats|jumps)        (?{ $found{action} = $^N  }) )
/xms;

print Dumper \@captures;
print Dumper \%found;

The output of running this program is:

$VAR1 = [
          'quick',
          'brown',
          'fox',
          'jumps'
        ];
$VAR1 = {
          'color' => 'brown',
          'speed' => 'quick',
          'action' => 'jumps',
          'animal' => 'fox'
        };

So the positional captures are still returned, and we’ve assigned them to keys in a hash. This can be very convenient for complex regular expressions.

This is a cool feature, but there are a few caveats. First, according to the Perl regular expression documentation, (?{ }) is a highly experimental feature that could go away at any time. But more importantly, if you’re relying on this feature you should be aware of the side effects. What I mean by that is that, if a regular expression match fails, but there are some successful matches during execution, then the code in the (?{ }) assertions could still execute. For example, if you changed the word jumps to poops in the above example, the output becomes:

$VAR1 = [];
$VAR1 = {
          'color' => 'brown',
          'speed' => 'quick',
          'animal' => 'fox'
        };

Which means that the match failed, but there were still assignments to our hash, because some of the captures succeeded before the overall match failed. The upshot is that you should always check the return value from the match before relying on whatever the code inside the (?{ }) assertions did.

The problem becomes even more subtle if your regular expressions trigger backgracking. In that case, you might have an optional group match and its value assigned to the hash, and then the next required group fail. Perl will then backtrack to throw out the successfull group match and then see if the next required match succeeds. If so, you can have a successful match and potentially invalid data in your hash. Here’s an example:

my @captures = $string =~ /
    (?: (quick|slow) \s+    (?{ $found{speed}  = $^N  }) )
    (?: (brown|blue) \s+    (?{ $found{color}  = $^N  }) )?
    (?: (brown\s+fox)       (?{ $found{animal} = $^N  }) )
/xms;

print Dumper \@captures;
print Dumper \%found;

And the output is:

$VAR1 = [
          'quick',
          undef,
          'brown fox'
        ];
$VAR1 = {
          'color' => 'brown',
          'speed' => 'quick',
          'animal' => 'brown fox'
        };

So while the second group returned undef for the color capture, the %foundhash still had the color key in it. This may or may not be what you want.

Of course, all this seems cool, but since it’s a truly evil hack, you have to be careful. If you can wait, though, perhaps we’ll see named captures in Perl 5.10.

Comments & Trackbacks

Corion wrote:

Regexp::NamedCaptures

It's been done for Perl 5.8 already, with Regexp::NamedCaptures - Yves took parts of that syntax (which was taken from .Net and/or Ruby) when implementing the named captures for Perl 5.10.

Aristotle Pagaltzis wrote:

It’s easy, if verbose, to overcome the limitation: if you use local to assign values, they will disappear on backtracking. But then they also disappear at the end of the match. So what you do is combine the two: while you try to match, you put things into dynamically scoped storage; once the match has succeed but before it is finished, you save them for posterity.

my %found;
{
    my %pad;
    $string =~ /
        (?: (quick|slow) \s+    (?{ local $pad{speed}  = $^N  }) )
        (?: (brown|blue) \s+    (?{ local $pad{color}  = $^N  }) )?
        (?: (brown\s+fox)       (?{ local $pad{animal} = $^N  }) )
        (?{ %found = %pad })
    /xms;
}

That should work. Pretty? Not exactly.

Theory wrote:

Corion: Thanks, good to know about.

Aristotle: Yes, I forgot to mention local, though I did know about it. What I didn't know about was the need for the final code block to assign it. That bit is key, thouh, yes, U-uu-uu-gly. Thanks for the tip!

—Theory

Powered by KinoSearch