home :: computers :: programming :: perl

SVN::Notify 2.70: Output Filtering and Character Encoding

I’m very pleased to announce the release of SVN::Notify 2.70. You can see an example of its colordiff output here. This is a major release that I’ve spent the last several weeks polishing and tweaking to get just right. There are quite a few changes, but the two most important are imporoved character encoding support and output filtering.

Improved Character Encoding Support

I’ve had a number of bug reports regarding issues with character encodings. Particularly for folks working in Europe and Asia, but really for anyone using multibyte characters in their source code and log messages (and we all do nowadays, don’t we?), it has been difficult to find the proper incantation to get SVN::Notify to convert data from and to their proper encodings. Using a patch from Toshikazu Kinkoh as a starting-point, and with a lot of reading and experimentation, as well as regular and patient tests on Toshikazu’s and Martin Lindhe’s production systems, I think I’ve finally got it nailed down.

Now you can use the --encoding (formerly --charset), --svn-encoding, and --diff-encoding options—as well as --language—to get SVN::Notify to do the right thing. As long as your Subversion server’s OS supports an appropriate locale, you should be golden (mine is old, with no UTF-8 locales :\). And if all else fails, you can still set the $LANG environment variable before executing svnnotify.

There is actually a fair bit to know about encodings to get it to work properly, but if you use UTF-8 throughout and your OS supports UTF-8 locales, you shouldn’t have to do anything. You might have to set --language in order to get it to use the proper locale. See the new documentation of the encoding support for all the details. And if you still have problems, please do let me know.

Output Filtering

Much sexier is the addition of output filtering in SVN::Notify 2.70. I got pretty tired of getting feature requests for what are essentially formatting modifications, such as this one requesting support for KDE-style keyword support. I myself was using Trac wiki syntax in commit messages on a recent project and wanted to see them converted to HTML for messages output by SVN::Notify::HTML::ColorDiff.

So I finally sat down and gave some though on how to implement a simple plugin architecture for SVN::Notify. When I realized that it was generally just formatting that people wanted, it became simpler: I just needed a way to allow folks to write simple output filters. The solution I came up with was to just use Perl. Output filters are simply subroutines named for the kind of output they filter. They live in perl packages. That’s it.

For example, say that your developers write their commit log messages in Textile, and rather than receive them stuck inside <pre> tags, you’d like them converted to HTML. It’s simple. Just put this code in a Perl module file:

package SVN::Notify::Filter::Textile;
use Text::Textile ();

sub log_message {
    my ($notifier, $lines) = @_;
    return $lines unless $notify->content_type eq 'text/html';
    return [ Text::Textile->new->process( join $/, @$lines ) ];
}

Put the file, SVN/Notify/Filter/Textile.pm somewhere in a Perl library directory. Then use the new --filter option to svnnotify to put it to work:

svnnotify -p "$1" -r "$2" --handler HTML::ColorDiff --filter Textile

Yep, that’s it! SVN::Notify will find the filter module, load it, register its filtering subroutine, and then call it at the appropriate time. Of course, there are a lot of things you can filter; consult the complete documentation for all of the details. But hopefully this gives you a flavor for how easy it is to write new filters for SVN::Notify. I’m hoping that all those folks who want featurs can now stop bugging me and writing their own filters to do the job, and uploading them to CPAN for all to share!

To get things started, I scratched my own itch, writing a Trac filter myself. The filter is almost as simple as the Textile example above, but I also spent quite a bit of time tweaking the CSS so that most of the Trac-generated HTML looks good. You can see an example right here. Thanks to a number of bug fixes in Text::Trac, as well as Trac-specific CSS added via a filter on CSS output, it works beautifully. If I’m feeling motivated in the next week or so, I’ll create a separate CPAN distribution with just a Markdown filter and upload it. That will create a nice distriution example for folks to copy to creat their own. Or maybe someone on the Lazy Web Will do it for me! Maybe you?

I wish I’d thought to do this from the beginning; it would have saved me from having to add so many features/cruft to SVN::Notify over the years. Here’s a quick list of the features that likely could have been implemented via filters instead of added to the core:

  • --user-domain: Combine the SVN username with a domain for the From header.
  • --add-header: Add a header to the message.
  • --reply-to: Add a specific header to the message.
  • SVN::Notify::HTML::ColorDiff: Frankly, looking back on it, I don’t know why I didn’t just put this support right into SVN::Notify::HTML. But even if I hadn’t, it could have been implemented via filters.
  • --subject-prefix:: Modify the message subject.
  • --subject-cx: Add the commit context to the subject.
  • --strip-cx-regex: More subject context modification.
  • --no-first-line: Another subject filter.
  • --max-sub-length: Yet another!
  • --max-diff-length: A filter could truncate the diff, although this might be tricky with the HTML formatting.
  • --author-url: Modify the metadata section to add a link to the author URL.
  • --revision-url: Ditto for the revision URL.
  • --ticket-map: Filter the log message for various ticketing system strings to convert to URLs. This also encompasses the old --rt-url, --bugzilla-url, --gnats-url, and --jira-url options.
  • --header: Filter the beginning of the message.
  • --footer: Filter the end of the message.
  • --linkize: Filter the log message to convert URLs to links for HTML messages.
  • --css-url: Filter the CSS to modify it, or filter the start of the HTML to add a link to an external CSS URL.
  • --wrap-log: Reformat the log message for HTML.

Yes, really! That’s about half the functionality right there. I’m glad that I won’t have to add any more like that; filters are a much better way to go.

So download it, install it, write some filters, get your multibyte characters output properly, and enjoy! And as usual, send me your bug reports, but implement your own improvements using filters!

How to Use Regex Named Captures in Perl 5

I ran some Perl 5 regular expression syntax that I’d never seen the other day. It used two features I’d never seen before:

  • (?{ }), a zero-width, non-capturing assertion that executes arbitrary Perl code.
  • $^N, a variable for getting the contents of the most recent capture in a regular expresion.

The cool thing is that, used in combination, these two features can be used to hack named captures into Perl regular expressions. Here’s an example:

use warnings;
use strict;
use Data::Dumper;

my $string = 'The quick brown fox jumps over the lazy dog';

my %found;

my @captures = $string =~ /
    (?: (quick|slow) \s+    (?{ $found{speed}  = $^N  }) )
    (?: (brown|blue) \s+    (?{ $found{color}  = $^N  }) )
    (?: (sloth|fox)  \s+    (?{ $found{animal} = $^N  }) )
    (?: (eats|jumps)        (?{ $found{action} = $^N  }) )
/xms;

print Dumper \@captures;
print Dumper \%found;

The output of running this program is:

$VAR1 = [
          'quick',
          'brown',
          'fox',
          'jumps'
        ];
$VAR1 = {
          'color' => 'brown',
          'speed' => 'quick',
          'action' => 'jumps',
          'animal' => 'fox'
        };

So the positional captures are still returned, and we’ve assigned them to keys in a hash. This can be very convenient for complex regular expressions.

This is a cool feature, but there are a few caveats. First, according to the Perl regular expression documentation, (?{ }) is a highly experimental feature that could go away at any time. But more importantly, if you’re relying on this feature you should be aware of the side effects. What I mean by that is that, if a regular expression match fails, but there are some successful matches during execution, then the code in the (?{ }) assertions could still execute. For example, if you changed the word jumps to poops in the above example, the output becomes:

$VAR1 = [];
$VAR1 = {
          'color' => 'brown',
          'speed' => 'quick',
          'animal' => 'fox'
        };

Which means that the match failed, but there were still assignments to our hash, because some of the captures succeeded before the overall match failed. The upshot is that you should always check the return value from the match before relying on whatever the code inside the (?{ }) assertions did.

The problem becomes even more subtle if your regular expressions trigger backgracking. In that case, you might have an optional group match and its value assigned to the hash, and then the next required group fail. Perl will then backtrack to throw out the successfull group match and then see if the next required match succeeds. If so, you can have a successful match and potentially invalid data in your hash. Here’s an example:

my @captures = $string =~ /
    (?: (quick|slow) \s+    (?{ $found{speed}  = $^N  }) )
    (?: (brown|blue) \s+    (?{ $found{color}  = $^N  }) )?
    (?: (brown\s+fox)       (?{ $found{animal} = $^N  }) )
/xms;

print Dumper \@captures;
print Dumper \%found;

And the output is:

$VAR1 = [
          'quick',
          undef,
          'brown fox'
        ];
$VAR1 = {
          'color' => 'brown',
          'speed' => 'quick',
          'animal' => 'brown fox'
        };

So while the second group returned undef for the color capture, the %foundhash still had the color key in it. This may or may not be what you want.

Of course, all this seems cool, but since it’s a truly evil hack, you have to be careful. If you can wait, though, perhaps we’ll see named captures in Perl 5.10.

What's With These CPAN-Testers Failures?

So I just learned about and subscribed to the CPAN-Testers feed for my modules. There appear to be a number of odd failures. Take this one. It says, Can’t locate Algorithm/Diff.pm, despite the fact that I have properly specified the requirement for Text::Diff, which itself properly requires Algorithm::Diff.. Is this an instance of CPAN.pm or CPANPLUS not following all prerequisites, or what?

Or take this failure. It says, [CP_ERROR] [Mon Sep 5 09:32:08 2005] No such module ‘mod_perl’ found on CPAN. Yet here it is. Maybe the CPANPLUS indexer has a bug? Or are people’s configurations just horked? Or am I just doing something braindead?

Opinions welcomed.

FSA::Rules Graphing Features Improved

FSA::Rules sample graph output

I just released FSA::Rules 0.25. This version came about as I returned to the module to handle setting up a PostgreSQL database and found the graphics that it churned out, well, wanting. I wanted a decision tree, but the graphics just had the names of the states for the nodes, and then long question-like labels on the edges. What I wanted instead was for each node to be a question (or a statement about what the node was doing), and for the edges to be simple answers to those questions (or indicators as to the success of the code run in a state).

So I added a new attribute to the state class, label. You can use this attribute to say something more about the state. In my case, I used it to store the question the state asks, or the description of the state’s activities. I then changed the code that creates the graph to use this attribute in preference to the state name when creating node labels. The result is a much more natural decision graph, as you see here

The release features a number of other goodies, including the elimination of a dependence on the Clone module, and thus also a big memory savings. There is now a lot more control over the format of graphs, too. Enjoy!

Stepped Series of Numbers in Perl

In working on a Perl validation function for GTINs (recipe here), I found a need to generate a series of numbers with a step of two. For example, I in the series 1-10, I first want 1, 3, 5, 7, and 9. And then later I want 2, 4, 6, 8, 10. Here’s how I went about creating those series in my GTIN function to create hash slices:

sub isa_gtin {
    my @nums = reverse split q{}, shift;
    (
        sum( @nums[ grep {   $_ % 2  } 0..$#nums ] ) * 3
      + sum( @nums[ grep { !($_ % 2) } 0..$#nums ] )
    ) % 10 == 0;
}

But it seems wasteful to generate the series of numbers twice and to calculate whether they’re odd or even twice. Surely there’s a more efficient way to do this in Perl, perhaps even more expressive? Python seems to have a useful syntax for creating array slices that step. In Python, I’d do something like this:

  sum( nums[1:10:2] ) * 3 + sum( nums[2:10:2])

But barring such a slice feature in Perl is there some cleaner way than the ugly grep approach I created to generate a stepped series in Perl?

Powered by KinoSearch