home :: computers

CITEXT Patch Submitted to PostgreSQL Contrib

On Friday, I submitted a patch to add a locale-aware case-insensitive text type as a PostgreSQL contrib module. This has been among my top requests as a feature for PostgreSQL ever since I started using it. And as I started work on yet another application recently, I decided to look into what it would take to just make it happen myself. I’m hopeful that everyone will be able to benefit from this bit of yak shaving.

I started out by trying to use the citext project on pgFoundry, but immediately identified two issues with it:

  1. It does not install properly on PostgreSQL 8.3 (it uses a lot of casts that were removed in 8.3); and
  2. It only case-insensitively compared ASCII characters. So accented multibyte characters work just as they do in the text type.

So I set about trying to create my own, new type, originally called “lctext”, since what it does is not true case-insensitive comparisons, but lowercases text and then compares, just as millions of us developers already do by using LOWER() on both sides of a query:

SELECT *
  FROM tab
 WHERE lower(col) = LOWER(?);

I just finally got fed up with this. The last straw for me was wanting to create a primary key that would be stored case-insensitively, which would have required that I create two indexes for it: One created for the primary key by default, the other a functional UNIQUE INDEX on LOWER(col), which would just be stupid.

So this patch is the culmination of my work to make a locale-aware case-insensitive text type. It’s locale-aware in that it uses the same locale-aware string comparison code as that used for the text type, and it uses the same C function as LOWER() uses. The nice thing is that it works just as if you had used LOWER() in all your SQL, but now you don’t have to.

So while this is not a true case-insensitive text type, in the sense that it doesn’t do a case-insensitive comparison, but changes the cases and then compares, it is likely more efficient than the LOWER() workaround that we’ve all been using for years, and it neater, too. Using this type, it will now be much easier to create, e.g, an email domain, like so:

CREATE OR REPLACE FUNCTION is_email(text)
RETURNS BOOLEAN
AS $$
    use Email::Valid;
    return TRUE if Email::Valid->address( $_[0] );
    return FALSE;
$$ LANGUAGE ‘plperlu’ STRICT IMMUTABLE;

CREATE DOMAIN email AS CITEXT CHECK ( is_email( value ) );

No more nasty workarounds to account for the lack of case-insensitive comparisons for text types. It works great for time zones and other data types that are defined to compare case-insensitively:

CREATE OR REPLACE FUNCTION is_timezone( tz TEXT ) RETURNS BOOLEAN as $$
BEGIN
  PERFORM now() AT TIME ZONE tz;
  RETURN TRUE;
EXCEPTION WHEN invalid_parameter_value THEN
  RETURN FALSE;
END;
$$ language plpgsql STABLE;

CREATE DOMAIN timezone AS CITEXT
CHECK ( is_timezone( value ) );

And that should just work!

I’m hoping that this is accepted during the July CommitFest. Of course I will welcome suggestions for how to improve it. Since I sent the patch, for example, I’ve been thinking that I should suggest in the documentation that it is best used for short text entries (say, up to 256 characters), rather than longer entries (like email bodies or journal articles), and that for longer entries, one should really make use of tsearch2, instead. There are other notes and caveats in the patch submission. Please do let me know what you think.

pgTAP pgFoundry Project and Lightning Talk

A couple of quick announcements:

pgTAP on pgFoundry

First, the PostgreSQL community approved my project, so now there is a pgTAP project page, including a couple of mail lists, a bug tracker, and downloads. I uploaded a new version shortly after the project was approved, and 0.03 should be there soon, as well.

pgTAP YAPC::NA Lightning Talk

I gave a Lightning talk at YAPC::NA 2008 in Chicago this afternoon. I’ve now posted the slides for your enjoyment.

Care to help me with development? Want to add your own test functions or make it all integrate better with standard PostgreSQL regression tests? Want to help me get Module::Build or Module::Install to run Perl and PostgreSQL and whatever tests side-by-side, all at once? Join the pgtap-users mail list and join the fun!

Introducing pgTAP

So I started working on a new PostgreSQL data type this week. More on that soon; in the meantime, I wanted to create a test suite for it, and wasn’t sure where to go. The only PostgreSQL tests I’ve seen are those distributed with Elein Mustain’s tests for the email data type she created in a PostgreSQL General Bits posting from a couple of years ago. I used the same approach myself for my GTIN data type, but it was rather hard to use: I had to pay very close attention to what was output in order to tell the description output from the test output. It was quite a PITA, actually.

This time, I started down the same path, then then started thinking about Perl testing, where each unit test, or assertion, in the xUnit parlance, triggers output of a single line of information indicating whether or not a test succeeded. It occurred to me that I could just run a bunch of queries that returned booleans to do my testing. So my first stab looked something like this:

\set ON_ERROR_STOP 1
\set AUTOCOMMIT off
\pset format unaligned
\pset tuples_only
\pset pager
\pset null '[NULL]'

SELECT foo() = 'bar';
SELECT foo(1) = 'baz';
SELECT foo(2) = 'foo';

The output looked like this:

% psql try -f ~/Desktop/try.sql
t
t
t

Once I started down that path, and had written ten or so tests, It suddenly dawned on me that the Perl Test::More module and its core ok() subroutine worked just like that. It essentially just depends on a boolean value and outputs text based on that value. A couple minutes of hacking and I had this:

CREATE TEMP SEQUENCE __tc__;
CREATE OR REPLACE FUNCTION ok ( boolean, text ) RETURNS TEXT AS $$
    SELECT (CASE $1 WHEN TRUE THEN '' ELSE 'not ' END) || 'ok'
        || ' ' || NEXTVAL('__tc__')
        || CASE $2 WHEN '' THEN '' ELSE COALESCE( ' - ' || $2, '' ) END;
$$ LANGUAGE SQL;

I then rewrote my test queries like so:

\echo 1..3
SELECT ok( foo() = 'bar'   'foo() should return "bar"' );
SELECT ok( foo(1) = 'baz', 'foo(1) should return "baz"' );
SELECT ok( foo(2) = 'foo', 'foo(2) should return "foo"' );

Running these tests, I now got:

% psql try -f ~/Desktop/try.sql
1..3
ok 1 - foo() should return "bar"
ok 2 - foo(1) should return "baz"
ok 3 - foo(2) should return "foo"

And, BAM! I had the beginning of a test framework that emits pure TAP output.

Well, I was so excited about this that I put aside my data type for a few hours and banged out the rest of the framework. Why was this exciting to me? Because now I can use a standard test harness to run the tests, even mix them in with other TAP tests on any project I might work on. Just now, I quickly hacked together a quick script to run the tests:

use TAP::Harness;

my $harness = TAP::Harness->new({
    timer   => $opts->{timer},
    exec    => [qw( psql try -f )],
});

$harness->runtests( @ARGV );

Now I’m able to run the tests like so:

% try ~/Desktop/try.sql        
/Users/david/Desktop/try........ok   
All tests successful.
Files=1, Tests=3,  0 wallclock secs ( 0.00 usr  0.00 sys +  0.01 cusr  0.00 csys =  0.01 CPU)
Result: PASS

Pretty damn cool! And lest you wonder whether such a suite of TAP-emitting test functions is suitable for testing SQL, here are a few exmples of tests I’ve written:

-- Plan the tests.
SELECT plan(4);

-- Emit a diagnostic message for users of different locales.
SELECT diag(
    E'These tests expect LC_COLLATE to be en_US.UTF-8,\n'
  || 'but yours is set to ' || setting || E'.\n'
  || 'As a result, some tests may fail. YMMV.'
)
  FROM pg_settings
 WHERE name = 'lc_collate'
   AND setting <> 'en_US.UTF-8';

SELECT is( 'a', 'a', '"a" should = "a"' );
SELECT is( 'B', 'B', '"B" should = "B"' );

CREATE TEMP TABLE try (
    name lctext PRIMARY KEY
);

INSERT INTO try (name)
VALUES ('a'), ('ab'), ('â'), ('aba'), ('b'), ('ba'), ('bab'), ('AZ');

SELECT ok( 'a' = name, 'We should be able to select the value' )
  FROM try
 WHERE name = 'a';

SELECT throws_ok(
    'INSERT INTO try (name) VALUES (''a'')',
    '23505',
    'We should get an error inserting a lowercase letter'
);

-- Finish the tests and clean up.
SELECT * FROM finish();

As you can see, it’s just SQL. And yes, I have ported most of the test functions from Test::More, as well as a couple from Test::Exception.

So, without further ado, I’d like to introduce pgTAP, a lightweight test framework for PostgreSQL implemented in PL/pgSQL and PL/SQL. I’ll be hacking on it more in the coming days, mostly to get a proper client for running tests hacked together. Then I think I’ll see if pgFoundry is interested in it.

Whaddya think? Is this something you could use? I can see many uses, myself, not only for testing a custom data type as I develop it, but also custom functions in PL/pgSQL or PL/Perl, and, heck, just regular schema stuff. I’ve had to write a lot of Perl tests to test my database schema (triggers, rules, functions, etc.), all using the DBI and being very verbose. Being able to do it all in a single psql script seems so much cleaner. And if I cand end up mixing the output of those scripts in with the rest of my unit tests, so much the better!

Anyway, feedback welcome. Leave your comments, suggestions, complaints, patches, etc., below. Thanks!

How to Globally Change a Subversion Username

I successfully migrated the Kineticode Subversion repository to a new server yesterday. Everything works great. But after my first commit, I realized that, while my username on the old server was theory, on the new server it’s david. Subversion works fine, of course, and I was able to start committing from old checkouts using the new username, but I realized that sites like Ohloh would pick up the two usernames as separate usernames. So I wanted to update all of the 3630 existing revisions that were mine to use the new username.

Unfortunately, I couldn’t find much on how to do this in a quick Googling. But I quickly figured out that what I need to do was to svnadmin dump my repository, modify the dump, and then load it again. The Subversion dump format has all these fields for tracking the content-lengths of various, so doing the update was a bit tricky. But I wrote the script here to track things, and it worked great for me. So here it is for others to reference and use.

#!/usr/bin/perl -w

use strict;
use warnings;

while (<>) {
    print;
    next unless /^Revision-number:\s+\d+$/;

    # Grab the content lengths. Examples:
    # Prop-content-length: 139
    # Content-length: 139
    my $plen_line = <>;
    my $clen_line = <>;

    unless ( $plen_line =~ /^Prop-content-length:\s+\d+$/ ) {
        # Nothing we want to change.
        print $plen_line, $clen_line;
        next;
    }

    my @lines;
    while ( <> ) {
        if ( /^PROPS-END$/ ) {
            # finish.
            print $plen_line, $clen_line, @lines, $_;
            last;
        }

        push @lines, $_;

        if ( /^svn:author$/ ) {
            # Grab the author content length. Example:
            # V 6
            my $alen_line = <>;

            # Grab the author name.
            my $auth = <>;

            if ( $auth =~ s/^theory$/david/ ) {
                # Adjust the content lengths.
                for my $line ( $plen_line, $clen_line, $alen_line ) {
                    $line =~ s/(\d+)$/$1 - 1/e;
                }
            }
            print $plen_line, $clen_line, @lines, $alen_line, $auth;
            last;
        }
    }
}

To use it, save it to a file, say svn_author, then change line 40 to your old and new usernames. Then, on line 43, change the $1 - 1 bit to be correct for the difference between the usernames you’re changing. For example, if you’re changing your username from, say, shane to chromatic, the new name is five characters longer, so you’d make it $1 + 5.

Now, run it like so:

svnadmin dump /path/to/svnroot > svndump.out
perl svn_author svndump.out > svndump.in
svnadmin create /path/to/new/svnroot
svnadmin load /path/to/new/svnroot < svndump.in

And that’s it! Feel free to take this code and do with it what you like, including fix any bugs, add command-line options, support changing multiple authors at once, or whatever. Share and enjoy.

SVN::Notify 2.70: Output Filtering and Character Encoding

I’m very pleased to announce the release of SVN::Notify 2.70. You can see an example of its colordiff output here. This is a major release that I’ve spent the last several weeks polishing and tweaking to get just right. There are quite a few changes, but the two most important are imporoved character encoding support and output filtering.

Improved Character Encoding Support

I’ve had a number of bug reports regarding issues with character encodings. Particularly for folks working in Europe and Asia, but really for anyone using multibyte characters in their source code and log messages (and we all do nowadays, don’t we?), it has been difficult to find the proper incantation to get SVN::Notify to convert data from and to their proper encodings. Using a patch from Toshikazu Kinkoh as a starting-point, and with a lot of reading and experimentation, as well as regular and patient tests on Toshikazu’s and Martin Lindhe’s production systems, I think I’ve finally got it nailed down.

Now you can use the --encoding (formerly --charset), --svn-encoding, and --diff-encoding options—as well as --language—to get SVN::Notify to do the right thing. As long as your Subversion server’s OS supports an appropriate locale, you should be golden (mine is old, with no UTF-8 locales :\). And if all else fails, you can still set the $LANG environment variable before executing svnnotify.

There is actually a fair bit to know about encodings to get it to work properly, but if you use UTF-8 throughout and your OS supports UTF-8 locales, you shouldn’t have to do anything. You might have to set --language in order to get it to use the proper locale. See the new documentation of the encoding support for all the details. And if you still have problems, please do let me know.

Output Filtering

Much sexier is the addition of output filtering in SVN::Notify 2.70. I got pretty tired of getting feature requests for what are essentially formatting modifications, such as this one requesting support for KDE-style keyword support. I myself was using Trac wiki syntax in commit messages on a recent project and wanted to see them converted to HTML for messages output by SVN::Notify::HTML::ColorDiff.

So I finally sat down and gave some though on how to implement a simple plugin architecture for SVN::Notify. When I realized that it was generally just formatting that people wanted, it became simpler: I just needed a way to allow folks to write simple output filters. The solution I came up with was to just use Perl. Output filters are simply subroutines named for the kind of output they filter. They live in perl packages. That’s it.

For example, say that your developers write their commit log messages in Textile, and rather than receive them stuck inside <pre> tags, you’d like them converted to HTML. It’s simple. Just put this code in a Perl module file:

package SVN::Notify::Filter::Textile;
use Text::Textile ();

sub log_message {
    my ($notifier, $lines) = @_;
    return $lines unless $notify->content_type eq 'text/html';
    return [ Text::Textile->new->process( join $/, @$lines ) ];
}

Put the file, SVN/Notify/Filter/Textile.pm somewhere in a Perl library directory. Then use the new --filter option to svnnotify to put it to work:

svnnotify -p "$1" -r "$2" --handler HTML::ColorDiff --filter Textile

Yep, that’s it! SVN::Notify will find the filter module, load it, register its filtering subroutine, and then call it at the appropriate time. Of course, there are a lot of things you can filter; consult the complete documentation for all of the details. But hopefully this gives you a flavor for how easy it is to write new filters for SVN::Notify. I’m hoping that all those folks who want featurs can now stop bugging me and writing their own filters to do the job, and uploading them to CPAN for all to share!

To get things started, I scratched my own itch, writing a Trac filter myself. The filter is almost as simple as the Textile example above, but I also spent quite a bit of time tweaking the CSS so that most of the Trac-generated HTML looks good. You can see an example right here. Thanks to a number of bug fixes in Text::Trac, as well as Trac-specific CSS added via a filter on CSS output, it works beautifully. If I’m feeling motivated in the next week or so, I’ll create a separate CPAN distribution with just a Markdown filter and upload it. That will create a nice distriution example for folks to copy to creat their own. Or maybe someone on the Lazy Web Will do it for me! Maybe you?

I wish I’d thought to do this from the beginning; it would have saved me from having to add so many features/cruft to SVN::Notify over the years. Here’s a quick list of the features that likely could have been implemented via filters instead of added to the core:

  • --user-domain: Combine the SVN username with a domain for the From header.
  • --add-header: Add a header to the message.
  • --reply-to: Add a specific header to the message.
  • SVN::Notify::HTML::ColorDiff: Frankly, looking back on it, I don’t know why I didn’t just put this support right into SVN::Notify::HTML. But even if I hadn’t, it could have been implemented via filters.
  • --subject-prefix:: Modify the message subject.
  • --subject-cx: Add the commit context to the subject.
  • --strip-cx-regex: More subject context modification.
  • --no-first-line: Another subject filter.
  • --max-sub-length: Yet another!
  • --max-diff-length: A filter could truncate the diff, although this might be tricky with the HTML formatting.
  • --author-url: Modify the metadata section to add a link to the author URL.
  • --revision-url: Ditto for the revision URL.
  • --ticket-map: Filter the log message for various ticketing system strings to convert to URLs. This also encompasses the old --rt-url, --bugzilla-url, --gnats-url, and --jira-url options.
  • --header: Filter the beginning of the message.
  • --footer: Filter the end of the message.
  • --linkize: Filter the log message to convert URLs to links for HTML messages.
  • --css-url: Filter the CSS to modify it, or filter the start of the HTML to add a link to an external CSS URL.
  • --wrap-log: Reformat the log message for HTML.

Yes, really! That’s about half the functionality right there. I’m glad that I won’t have to add any more like that; filters are a much better way to go.

So download it, install it, write some filters, get your multibyte characters output properly, and enjoy! And as usual, send me your bug reports, but implement your own improvements using filters!

Powered by KinoSearch