CITEXT Patch Submitted to PostgreSQL Contrib

On Friday, I submitted a patch to add a locale-aware case-insensitive text type as a PostgreSQL contrib module. This has been among my top requests as a feature for PostgreSQL ever since I started using it. And as I started work on yet another application recently, I decided to look into what it would take to just make it happen myself. I’m hopeful that everyone will be able to benefit from this bit of yak shaving.

I started out by trying to use the citext project on pgFoundry, but immediately identified two issues with it:

  1. It does not install properly on PostgreSQL 8.3 (it uses a lot of casts that were removed in 8.3); and
  2. It only case-insensitively compared ASCII characters. So accented multibyte characters work just as they do in the text type.

So I set about trying to create my own, new type, originally called “lctext”, since what it does is not true case-insensitive comparisons, but lowercases text and then compares, just as millions of us developers already do by using LOWER() on both sides of a query:

SELECT *
  FROM tab
 WHERE lower(col) = LOWER(?);

I just finally got fed up with this. The last straw for me was wanting to create a primary key that would be stored case-insensitively, which would have required that I create two indexes for it: One created for the primary key by default, the other a functional UNIQUE INDEX on LOWER(col), which would just be stupid.

So this patch is the culmination of my work to make a locale-aware case-insensitive text type. It’s locale-aware in that it uses the same locale-aware string comparison code as that used for the text type, and it uses the same C function as LOWER() uses. The nice thing is that it works just as if you had used LOWER() in all your SQL, but now you don’t have to.

So while this is not a true case-insensitive text type, in the sense that it doesn’t do a case-insensitive comparison, but changes the cases and then compares, it is likely more efficient than the LOWER() workaround that we’ve all been using for years, and it neater, too. Using this type, it will now be much easier to create, e.g, an email domain, like so:

CREATE OR REPLACE FUNCTION is_email(text)
RETURNS BOOLEAN
AS $$
    use Email::Valid;
    return TRUE if Email::Valid->address( $_[0] );
    return FALSE;
$$ LANGUAGE ‘plperlu’ STRICT IMMUTABLE;

CREATE DOMAIN email AS CITEXT CHECK ( is_email( value ) );

No more nasty workarounds to account for the lack of case-insensitive comparisons for text types. It works great for time zones and other data types that are defined to compare case-insensitively:

CREATE OR REPLACE FUNCTION is_timezone( tz TEXT ) RETURNS BOOLEAN as $$
BEGIN
  PERFORM now() AT TIME ZONE tz;
  RETURN TRUE;
EXCEPTION WHEN invalid_parameter_value THEN
  RETURN FALSE;
END;
$$ language plpgsql STABLE;

CREATE DOMAIN timezone AS CITEXT
CHECK ( is_timezone( value ) );

And that should just work!

I’m hoping that this is accepted during the July CommitFest. Of course I will welcome suggestions for how to improve it. Since I sent the patch, for example, I’ve been thinking that I should suggest in the documentation that it is best used for short text entries (say, up to 256 characters), rather than longer entries (like email bodies or journal articles), and that for longer entries, one should really make use of tsearch2, instead. There are other notes and caveats in the patch submission. Please do let me know what you think.

pgTAP pgFoundry Project and Lightning Talk

A couple of quick announcements:

pgTAP on pgFoundry

First, the PostgreSQL community approved my project, so now there is a pgTAP project page, including a couple of mail lists, a bug tracker, and downloads. I uploaded a new version shortly after the project was approved, and 0.03 should be there soon, as well.

pgTAP YAPC::NA Lightning Talk

I gave a Lightning talk at YAPC::NA 2008 in Chicago this afternoon. I’ve now posted the slides for your enjoyment.

Care to help me with development? Want to add your own test functions or make it all integrate better with standard PostgreSQL regression tests? Want to help me get Module::Build or Module::Install to run Perl and PostgreSQL and whatever tests side-by-side, all at once? Join the pgtap-users mail list and join the fun!

Introducing pgTAP

So I started working on a new PostgreSQL data type this week. More on that soon; in the meantime, I wanted to create a test suite for it, and wasn’t sure where to go. The only PostgreSQL tests I’ve seen are those distributed with Elein Mustain’s tests for the email data type she created in a PostgreSQL General Bits posting from a couple of years ago. I used the same approach myself for my GTIN data type, but it was rather hard to use: I had to pay very close attention to what was output in order to tell the description output from the test output. It was quite a PITA, actually.

This time, I started down the same path, then then started thinking about Perl testing, where each unit test, or assertion, in the xUnit parlance, triggers output of a single line of information indicating whether or not a test succeeded. It occurred to me that I could just run a bunch of queries that returned booleans to do my testing. So my first stab looked something like this:

\set ON_ERROR_STOP 1
\set AUTOCOMMIT off
\pset format unaligned
\pset tuples_only
\pset pager
\pset null '[NULL]'

SELECT foo() = 'bar';
SELECT foo(1) = 'baz';
SELECT foo(2) = 'foo';

The output looked like this:

% psql try -f ~/Desktop/try.sql
t
t
t

Once I started down that path, and had written ten or so tests, It suddenly dawned on me that the Perl Test::More module and its core ok() subroutine worked just like that. It essentially just depends on a boolean value and outputs text based on that value. A couple minutes of hacking and I had this:

CREATE TEMP SEQUENCE __tc__;
CREATE OR REPLACE FUNCTION ok ( boolean, text ) RETURNS TEXT AS $$
    SELECT (CASE $1 WHEN TRUE THEN '' ELSE 'not ' END) || 'ok'
        || ' ' || NEXTVAL('__tc__')
        || CASE $2 WHEN '' THEN '' ELSE COALESCE( ' - ' || $2, '' ) END;
$$ LANGUAGE SQL;

I then rewrote my test queries like so:

\echo 1..3
SELECT ok( foo() = 'bar'   'foo() should return "bar"' );
SELECT ok( foo(1) = 'baz', 'foo(1) should return "baz"' );
SELECT ok( foo(2) = 'foo', 'foo(2) should return "foo"' );

Running these tests, I now got:

% psql try -f ~/Desktop/try.sql
1..3
ok 1 - foo() should return "bar"
ok 2 - foo(1) should return "baz"
ok 3 - foo(2) should return "foo"

And, BAM! I had the beginning of a test framework that emits pure TAP output.

Well, I was so excited about this that I put aside my data type for a few hours and banged out the rest of the framework. Why was this exciting to me? Because now I can use a standard test harness to run the tests, even mix them in with other TAP tests on any project I might work on. Just now, I quickly hacked together a quick script to run the tests:

use TAP::Harness;

my $harness = TAP::Harness->new({
    timer   => $opts->{timer},
    exec    => [qw( psql try -f )],
});

$harness->runtests( @ARGV );

Now I’m able to run the tests like so:

% try ~/Desktop/try.sql        
/Users/david/Desktop/try........ok   
All tests successful.
Files=1, Tests=3,  0 wallclock secs ( 0.00 usr  0.00 sys +  0.01 cusr  0.00 csys =  0.01 CPU)
Result: PASS

Pretty damn cool! And lest you wonder whether such a suite of TAP-emitting test functions is suitable for testing SQL, here are a few exmples of tests I’ve written:

-- Plan the tests.
SELECT plan(4);

-- Emit a diagnostic message for users of different locales.
SELECT diag(
    E'These tests expect LC_COLLATE to be en_US.UTF-8,\n'
  || 'but yours is set to ' || setting || E'.\n'
  || 'As a result, some tests may fail. YMMV.'
)
  FROM pg_settings
 WHERE name = 'lc_collate'
   AND setting <> 'en_US.UTF-8';

SELECT is( 'a', 'a', '"a" should = "a"' );
SELECT is( 'B', 'B', '"B" should = "B"' );

CREATE TEMP TABLE try (
    name lctext PRIMARY KEY
);

INSERT INTO try (name)
VALUES ('a'), ('ab'), ('â'), ('aba'), ('b'), ('ba'), ('bab'), ('AZ');

SELECT ok( 'a' = name, 'We should be able to select the value' )
  FROM try
 WHERE name = 'a';

SELECT throws_ok(
    'INSERT INTO try (name) VALUES (''a'')',
    '23505',
    'We should get an error inserting a lowercase letter'
);

-- Finish the tests and clean up.
SELECT * FROM finish();

As you can see, it’s just SQL. And yes, I have ported most of the test functions from Test::More, as well as a couple from Test::Exception.

So, without further ado, I’d like to introduce pgTAP, a lightweight test framework for PostgreSQL implemented in PL/pgSQL and PL/SQL. I’ll be hacking on it more in the coming days, mostly to get a proper client for running tests hacked together. Then I think I’ll see if pgFoundry is interested in it.

Whaddya think? Is this something you could use? I can see many uses, myself, not only for testing a custom data type as I develop it, but also custom functions in PL/pgSQL or PL/Perl, and, heck, just regular schema stuff. I’ve had to write a lot of Perl tests to test my database schema (triggers, rules, functions, etc.), all using the DBI and being very verbose. Being able to do it all in a single psql script seems so much cleaner. And if I cand end up mixing the output of those scripts in with the rest of my unit tests, so much the better!

Anyway, feedback welcome. Leave your comments, suggestions, complaints, patches, etc., below. Thanks!

Moving Towards Bricolage 2.0

Today I’ve finished just about over two and a half weeks of hacking on Bricolage. It has been a couple of years since I gave it much attention, but there was so much good stuff that other people have contributed that, since I had a little time, it seemed worth it to give it some love. So here’s a quick list of all that I’ve done in the last two weeks:

  • Fixed all reported issues with Bricolage 1.10. Scott Lanning kindly released 1.10.5 yesterday with all of those fixes.

  • I integrated the element occurrence branch that Christian Muise had worked on as his 2006 Google Summer of Code project. Christian’s project added support for maximum and mininum specfications for subelements in Bricolage, which allows administrators to define how many fields and elements can occur in a story or media document. All I had to do was add a few UI tweaks to support the new fields and their specification in the story profile, and all was ready to go. Oh, and I did have to go back and make the SOAP interface work with the feature, but the only reason it never did was lazy hacking of the SOAP interface (way before Christian’s time). Nice work, Christian, and thank you for your contribution!

  • I fixed a few bugs with Arsu Andrei’s port of Bricolage to MySQL, which was his 2006 Google Summer of Code project. Arsu did a terrific job with the port, with only a few minor things missed that he likely could not have caught anyway. This work had already been merged into the trunk. Thanks Arsu!

  • I fixed a bunch of bugs from Marshall Roch’s AJAXification of Bricolage, carried out during his 2006 Google Summer of Code project. Marshall actually did a lot more stuff than he’d planned, as it all went quite smoothly. I found only a few minor oversights that I was able to eaily address. This work represents the single most visible change to how users user Bricolage since we launched the probject back in 2001. Editing stories, in particular, is now a lot cleaner, with far fewer page loads. Thanks a million, Marshall!

  • I completed the work started by Chris Heiland of the University of Washington, Bothell, and Scott Lanning of the World Health Organization to port Bricolage to Apache 2. They really did most of the hard work, and I just spent several days integrating everything, making sure all the features work, and updating the installer to handle differences in configuration. I thought this would take me a day or two, but it actually took the better part of a week! So much has changed, but in truth Bricolage is now better for running on mod_perl 2. Expect to see Apache 2 bet the recommended platform for Bricolage in a release in the near future.

  • I integrated a number of patches from Brian Smith of Gossamer Threads to allow the installer to be run as a non-root user. The key here is if the installer has to become the database super user, which is required for ident authentication, and of course whether files are to be installed somewhere on the system requiring super user access. This work is not done, yet, as make upgrade and make uninstall are not quite there yet. But we’re getting there, and it should be all done in time for 2.0, thanks to Brian.

  • I added support for a whole slew of environment variables to the installer. Now you can set environment variables to override default settings for installation parameters, such as choice of RDBMS, Apache, location of an SSL cert and key, whether to support SLL, and lots of other stuff, besides. This is all documented in the Quick Installation Instructions section of Bric::Admin/INSTALL.

  • I fully tested and fixed a lot of bugs leftover from making the installer database- and Apache-neutral. Now all of these commands should work perfectly:

    • make
    • make cpan
    • make test
    • make install
    • make devtest
    • make clone
    • make uninstall
  • I improved the DHTML functionality of the Add More widget, which is used to add contact information to users and contributors, rules to alert types, and extensions to media types. I think it’s pretty slick, now! This was built on Marshall’s AJAX work.

All of these changes have been integrated into the Bricolage trunk and I’ve pushed out a developer release today. Please do check out all the goodness on a test box and send feedback or file bug reports! There are only a couple of other features waiting to go into Bricolage before we start the release candidate process. And, oh yeah, tht title of this blog post? It’s not a lie. The next production release of Bricolage, based on all this work, will be Bricolage 2.0. Enough of the features we’d planned for Bricolage lo these many years ago are in the trunk that the new version number is warranted. I for one will be thrilled to see 2.0 ship in time for OSCON.

And in case it isn’t already clear, many thanks to the Google Summer of Code and participating students for the great contributions! This release would not have been possible without them.

Also in the news today, the Bricolage server has been replaced! The new server, which hosts the Web site, the wiki and the instance of Bricolage used to manage the site itself, is graciously provided by the kind folks at Gossamer Threads. The server is Gossamer Threads’s way of giving back to the Bricolage community as they prepare to launch a hosted Bricolage solution. Thaks GT!

The old Bricolage server was provided by pair Networds for the last five years. I’d just like to thank pair for the generous five-year loan of that box, which helped provided infrastructure for both Bricolage and Kineticode. Thank you, pair!

And with that, I’m going heads-down on some other projects. I’ll pop back up to make sure that Bricolage 2.0 is released in a few months, but otherwise, I’m on to other things again for a while. Watch this space for details!

How to Globally Change a Subversion Username

I successfully migrated the Kineticode Subversion repository to a new server yesterday. Everything works great. But after my first commit, I realized that, while my username on the old server was theory, on the new server it’s david. Subversion works fine, of course, and I was able to start committing from old checkouts using the new username, but I realized that sites like Ohloh would pick up the two usernames as separate usernames. So I wanted to update all of the 3630 existing revisions that were mine to use the new username.

Unfortunately, I couldn’t find much on how to do this in a quick Googling. But I quickly figured out that what I need to do was to svnadmin dump my repository, modify the dump, and then load it again. The Subversion dump format has all these fields for tracking the content-lengths of various, so doing the update was a bit tricky. But I wrote the script here to track things, and it worked great for me. So here it is for others to reference and use.

#!/usr/bin/perl -w

use strict;
use warnings;

while (<>) {
    print;
    next unless /^Revision-number:\s+\d+$/;

    # Grab the content lengths. Examples:
    # Prop-content-length: 139
    # Content-length: 139
    my $plen_line = <>;
    my $clen_line = <>;

    unless ( $plen_line =~ /^Prop-content-length:\s+\d+$/ ) {
        # Nothing we want to change.
        print $plen_line, $clen_line;
        next;
    }

    my @lines;
    while ( <> ) {
        if ( /^PROPS-END$/ ) {
            # finish.
            print $plen_line, $clen_line, @lines, $_;
            last;
        }

        push @lines, $_;

        if ( /^svn:author$/ ) {
            # Grab the author content length. Example:
            # V 6
            my $alen_line = <>;

            # Grab the author name.
            my $auth = <>;

            if ( $auth =~ s/^theory$/david/ ) {
                # Adjust the content lengths.
                for my $line ( $plen_line, $clen_line, $alen_line ) {
                    $line =~ s/(\d+)$/$1 - 1/e;
                }
            }
            print $plen_line, $clen_line, @lines, $alen_line, $auth;
            last;
        }
    }
}

To use it, save it to a file, say svn_author, then change line 40 to your old and new usernames. Then, on line 43, change the $1 - 1 bit to be correct for the difference between the usernames you’re changing. For example, if you’re changing your username from, say, shane to chromatic, the new name is five characters longer, so you’d make it $1 + 5.

Now, run it like so:

svnadmin dump /path/to/svnroot > svndump.out
perl svn_author svndump.out > svndump.in
svnadmin create /path/to/new/svnroot
svnadmin load /path/to/new/svnroot < svndump.in

And that’s it! Feel free to take this code and do with it what you like, including fix any bugs, add command-line options, support changing multiple authors at once, or whatever. Share and enjoy.

Powered by KinoSearch