Just a Theory

Trans rights are human rights

Posts about Unicode

Encoding is a Headache

I have to spend way too much of my programming time worrying about character encodings. Take my latest module, Text::Markup for example. The purpose of the module is very simple: give in the name of a file, and it will figure out the markup it uses (HTML, Markdown, Textile, whatever) and return a string containing the HTML generated from the file. Simple, right?

But, hang on. Should the HTML it returns be decoded to Perl’s internal form? I’m thinking not, because the HTML itself might declare the encoding, either in a XML declaration or via something like

<meta http-equiv="Content-type" content="text/html;charset=Big5" />

And as you can see, it’s not UTF-8. So decoded it would be lying. So it should be encoded, right? Parsers like XML::LibXML::Parser are smart enough to see such declarations and decode as appropriate.

But wait a minute! Some markup languages, like Markdown, don’t have XML declarations or headers. They’re HTML fragments. So there’s no wait to tell the encoding of the resulting HTML unless it’s decoded. So maybe it should be decoded. Or perhaps it should be decoded, and then given an XML declaration that declares the encoding as UTF-8 and encoded it as UTF-8 before returning it.

But, hold the phone! When reading in a markup file, should it be decoded before it’s passed to the parser? Does Text::Markdown know or care about encodings? And if it should be decoded, what encoding should one assume the source file uses? Unless it uses a BOM, how do you know what its encoding is?

Text::Markup is a dead simple idea, but virtually all of my time is going into thinking about this stuff. It drives me nuts. When will the world cease to be this way?

Oh, and you have answers to any of these questions, please do feel free to leave a comment. I hate having to spend so much time on this, but I’d much rather do so and get things right (or close to right) than wrong.

Looking for the comments? Try the old layout.

Make the Pragmas Stop!

I’ve been following the development of a few things in the Perl community lately, and it’s leaving me very frustrated. For years now, I’ve written modules that start with the same incantation:

package My::Module;

use strict;
our $VERSION = '0.01';

Pretty simple: declare the module name and version, and turn on strictures to make sure I’m not doing anything stupid. More recently I’ve added use warnings; as a best practice. And even more recently, I’ve started adding use utf8;, too, because I like to write my code in UTF-8. And I like to turn on all of the Perl 5.10 features. It’s mildly annoying to have the same incantation at the start of every module, but I could deal with it:

package My::Module;

use strict;
use warnings;
use feature ':5.10';
use utf8;

our $VERSION = '0.01';

Until now that is. Last year, chromatic started something with his Modern::Perl module. It was a decent idea for newbies to help them get started with Perl by having to have only one declaration at the tops of their modules:

package My::Module;

use Modern::Perl;
our $VERSION = '0.01';

Alas, it wasn’t really designed for me, but for more casual users of Perl, so that they don’t have to think about the pragmas they need to use. The fact that it doesn’t include the utf8 pragma also made it a non-starter for me. Or did it? Someone recently suggested that the utf8 pragma has problems (I can’t find the Perl Monks thread at the moment). Others report that the encoding pragma has issues, too. So what’s the right thing to do with regard to assuming everything is UTF8 in my program and its inputs (unless I say otherwise)? I’m not at all sure.

Not only that, but Modern::Perl has lead to an explosion of other pragma-like modules on CPAN that promise best pragma practices. There’s common::sense, which loads utf8 but only some of of the features of strict, warnings, and feature. uni::perl looks almost exactly the same. There’s also Damian Conway’s Toolkit, which allows you to write your own pragma-like loader module. There’s even Acme::Very::Modern::Perl, which is meant to be a joke, but is it really?

If I want to simplify the incantation at the top of every file, what do I use?

And now it’s getting worse. In addition to feature, Perl 5.11 introduces the legacy pragma, which allows one to get back behaviors from older Perls. For example, to get back the old Unicode semantics, you’d use legacy 'unicode8bit';. I mean, WTF?

I’ve had it. Please make the pragma explosion stop! Make it so that the best practices known at the time of the release of any given version of Perl can automatically imported if I just write:

package My::Module '0.01';
use 5.12;

That’s it. Nothing more. Whatever has been deemed the best practice at the time 5.12 is released will simply be used. If the best practices change in 5.14, I can switch to use 5.14; and get them, or just leave it at use 5.12 and keep what was the best practices in 5.12 (yay future-proofing!).

What should the best practices be? My list would include:

  • strict
  • warnings
  • features — all of them
  • UTF-8 — all input and output to the scope, as well as the source code

Maybe you disagree with that list. Maybe I’d disagree with what Perl 5 Porters settles on. But then you can I can read what’s included and just add or removed pragmas as necessary. But surely there’s a core list of likely candidates that should be included the vast majority of the time, including for all novices.

In personal communication, chromatic tells me, with regard to Modern::Perl, “Experienced Perl programmers know the right incantations to get the behavior they want. Novices don’t, and I think we can provide them much better defaults without several lines of incantations.â€? I’m fine with the second assertion, but disagree with the first. I’ve been hacking Perl for almost 15 years, and I no longer have any fucking idea what incantation is best to use in my modules. Do help the novices, and make the power tools available to experienced hackers, but please make life easier for the experienced hackers, too.

I think that declaring the semantics of a particular version of Perl is where the Perl 5 Porters are headed. I just hope that includes handling all of the likely pragmas too, so that I don’t have to.

Looking for the comments? Try the old layout.

Unicode Normalization in SQL

I’ve been peripherally aware of the need for unicode normalization in my code for a while, but only got around to looking into it today. Although I use Encode to convert text inputs into Perl’s internal form and UTF-8 or an appropriate encoding in all my outputs, it does nothing about normalization.

What’s normalization you ask?

Well, UTF-8 allows some characters to be encoded in different ways. For example, é can be written as either “&#x00e9;”, which is a “precomposed character,” or as “&#x0065;&#x0301;”, which is a combination of “e” and “́”. This is all well and good, but the trouble comes when you want to compare values. Observe this Perl one-liner:

% perl -le 'print "\x{00e9}" eq "\x{0065}\x{0301}" ? "t" : "f"'
f

The same issue exists in your database. Here’s an example from PostgreSQL:

try=# select U&'\00E9' = U&'\0065\0301';
 ?column? 
----------
 f
(1 row)

If you have a user enter data in your Web app using combining characters, and then another does a search with canonical characters, the search will fail. This won’t do at all.

The solution is to normalize your Unicode data. In Perl, you can use Unicode::Normalize, a C/XS module that uses Perl’s internal unicode tables to convert things as appropriate. For general use the NFC normalization is probably best:

use Unicode::Normalize;
$string = NFC $string;

PostgreSQL offers no normalization routines. However, the SQL standard mandates one (as of SQL 2008, at least). It looks like this:

<normalize function> ::= NORMALIZE <left paren> <character value expression> [ <comma> <normal form> [ <comma> <normalize function result length> ] ] <right paren>
<normal form> ::= NFC | NFD | NFKC | NFKD

The second argument defaults to NFC and the third, which specifies a maximum length of the return value, is optional. The fact that it looks like a function means that we can use PL/PerlU to emulate it in PostgreSQL until a proper implementation makes it into core. Here’s how:

CREATE OR REPLACE FUNCTION NORMALIZE(
    string TEXT,
    form   TEXT,
    maxlen INT
) RETURNS TEXT LANGUAGE plperlu AS $$
    use Unicode::Normalize 'normalize';
    my ($string, $form, $maxlen) = @_;
    my $ret = normalize($form, $string);
    elog(ERROR, 'Normalized value is too long') if length $ret > $maxlen;
    return $ret;
$$;

CREATE OR REPLACE FUNCTION NORMALIZE(
    string TEXT,
    form   TEXT
) RETURNS TEXT LANGUAGE plperlu AS $$
    use Unicode::Normalize 'normalize';
    return normalize($_[1], $_[0]);
$$;

CREATE OR REPLACE FUNCTION NORMALIZE(
    string TEXT
) RETURNS TEXT LANGUAGE plperlu AS $$
    use Unicode::Normalize 'normalize';
    return normalize('NFC', shift);
$$;

I wrote a few tests to make sure it was sane:

SELECT U&'\0065\0301' as combined,
       char_length(U&'\0065\0301'),
       NORMALIZE(U&'\0065\0301') as normalized,
       char_length(NORMALIZE(U&'\0065\0301'));

SELECT NORMALIZE(U&'\0065\0301', 'NFC')  AS NFC,
       NORMALIZE(U&'\0065\0301', 'NFD')  AS NFD,
       NORMALIZE(U&'\0065\0301', 'NFKC') AS NFKC,
       NORMALIZE(U&'\0065\0301', 'NFKD') AS NFKD
;

SELECT NORMALIZE(U&'\0065\0301', 'NFC', 1)  AS NFC,
       NORMALIZE(U&'\0065\0301', 'NFD', 2)  AS NFD,
       NORMALIZE(U&'\0065\0301', 'NFKC', 1) AS NFKC,
       NORMALIZE(U&'\0065\0301', 'NFKD', 2) AS NFKD;

SELECT NORMALIZE(U&'\0065\0301', 'NFD', 1);

And the output

 combined | char_length | normalized | char_length 
----------+-------------+------------+-------------
 é        |           2 | é          |           1
(1 row)

 nfc | nfd | nfkc | nfkd 
-----+-----+------+------
 é   | é   | é    | é
(1 row)

 nfc | nfd | nfkc | nfkd 
-----+-----+------+------
 é   | é   | é    | é
(1 row)

psql:try.sql:45: ERROR:  error from Perl function "normalize": Normalized value is too long at line 5.

Cool! So that’s fairly close to the standard. The main difference is that the form argument must be a string instead of a constant literal. But PostgreSQL would likely support both. The length argument is also a literal, and can be 10 characters or 64 bytes, but for our purposes, this is fine. The only downside to it is that it’s slow: PostgreSQL must convert its text value to a Perl string to pass to the function, and then Unicode::Normalize turns it into a C string again to do the conversion, then back to a Perl string which, in turn, is returned to PostgreSQL and converted back into the text form. Not the quickest process, but may prove useful anyway.

Update: 1 Hour Later

Note that this issue applies when using full text search, too. Alas, it does not normalize unicode characters for you:

try=# select to_tsvector(U&'\00E9clair') @@ to_tsquery(U&'\0065\0301clair');
 ?column? 
----------
 f
(1 row)

But normalizing with the functions I introduced does work:

try=# select to_tsvector(U&'\00E9clair') @@ to_tsquery(normalize(U&'\0065\0301clair'));
 ?column? 
----------
 t
(1 row)

So yes, this really can be an issue in your applications.

Looking for the comments? Try the old layout.

CITEXT Patch Submitted to PostgreSQL Contrib

On Friday, I submitted a patch to add a locale-aware case-insensitive text type as a PostgreSQL contrib module. This has been among my top requests as a feature for PostgreSQL ever since I started using it. And as I started work on yet another application recently, I decided to look into what it would take to just make it happen myself. I’m hopeful that everyone will be able to benefit from this bit of yak shaving.

I started out by trying to use the citext project on pgFoundry, but immediately identified two issues with it:

  1. It does not install properly on PostgreSQL 8.3 (it uses a lot of casts that were removed in 8.3); and
  2. It only case-insensitively compared ASCII characters. So accented multibyte characters work just as they do in the text type.

So I set about trying to create my own, new type, originally called “lctext”, since what it does is not true case-insensitive comparisons, but lowercases text and then compares, just as millions of us developers already do by using LOWER() on both sides of a query:

SELECT *
  FROM tab
 WHERE lower(col) = LOWER(?);

I just finally got fed up with this. The last straw for me was wanting to create a primary key that would be stored case-insensitively, which would have required that I create two indexes for it: One created for the primary key by default, the other a functional UNIQUE INDEX on LOWER(col), which would just be stupid.

So this patch is the culmination of my work to make a locale-aware case-insensitive text type. It’s locale-aware in that it uses the same locale-aware string comparison code as that used for the text type, and it uses the same C function as LOWER() uses. The nice thing is that it works just as if you had used LOWER() in all your SQL, but now you don’t have to.

So while this is not a true case-insensitive text type, in the sense that it doesn’t do a case-insensitive comparison, but changes the cases and then compares, it is likely more efficient than the LOWER() workaround that we’ve all been using for years, and it neater, too. Using this type, it will now be much easier to create, e.g, an email domain, like so:

CREATE OR REPLACE FUNCTION is_email(text)
RETURNS BOOLEAN
AS $$
    use Email::Valid;
    return TRUE if Email::Valid->address( $_[0] );
    return FALSE;
$$ LANGUAGE 'plperlu' STRICT IMMUTABLE;

CREATE DOMAIN email AS CITEXT CHECK ( is_email( value ) );

No more nasty workarounds to account for the lack of case-insensitive comparisons for text types. It works great for time zones and other data types that are defined to compare case-insensitively:

CREATE OR REPLACE FUNCTION is_timezone( tz TEXT ) RETURNS BOOLEAN as $$
BEGIN
  PERFORM now() AT TIME ZONE tz;
  RETURN TRUE;
EXCEPTION WHEN invalid_parameter_value THEN
  RETURN FALSE;
END;
$$ language plpgsql STABLE;

CREATE DOMAIN timezone AS CITEXT
CHECK ( is_timezone( value ) );

And that should just work!

I’m hoping that this is accepted during the July CommitFest. Of course I will welcome suggestions for how to improve it. Since I sent the patch, for example, I’ve been thinking that I should suggest in the documentation that it is best used for short text entries (say, up to 256 characters), rather than longer entries (like email bodies or journal articles), and that for longer entries, one should really make use of tsearch2, instead. There are other notes and caveats in the patch submission. Please do let me know what you think.

Looking for the comments? Try the old layout.

Intelligent MySQL Configuration

James Duncan Davidson’s Configuring MySQL on MacOS X post earlier today reminded me that I wanted to blog about the configuration I came up with while installing MySQL 5 on my box. Nothing has irritated me more than when MySQL’s syntax has violated the ANSI SQL standards in the most blatant ways, or when transactions have appeared to work, but mysteriously not worked. Yes, I use Duncan’s settings to make sure that the MySQL box on my PowerBook only listens on local sockets, but I additionally add this configuration to /etc/my.cnf:

[mysqld]
sql-mode=ansi,strict_trans_tables,no_auto_value_on_zero,no_zero_date,no_zero_in_date,only_full_group_by
character-set-server=utf8
default-storage-engine=InnoDB
default-time-zone=utc

That last configuration can actually only be added after running this command:

/usr/local/mysql/bin/mysql_tzinfo_to_sql /usr/share/zoneinfo | \
mysql -u root mysql

But then the upshot is that I have everything configured to be as compliant as possible (although the time zone stuff is just my personal preference):

mysql> SELECT @@global.sql_mode;
mysql> SELECT @@global.sql_mode;
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| @@global.sql_mode                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| REAL_AS_FLOAT,PIPES_AS_CONCAT,ANSI_QUOTES,IGNORE_SPACE,ONLY_FULL_GROUP_BY,ANSI,NO_AUTO_VALUE_ON_ZERO,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> show variables like '%character_set\_%';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| character_set_client     | utf8   |
| character_set_connection | utf8   |
| character_set_database   | utf8   |
| character_set_filesystem | binary |
| character_set_results    | utf8   |
| character_set_server     | utf8   |
| character_set_system     | utf8   |
+--------------------------+--------+
7 rows in set (0.01 sec)

mysql> show variables like '%table_ty%';
+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| table_type    | InnoDB |
+---------------+--------+
1 row in set (0.00 sec)

mysql> show variables like 'time_zone%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| time_zone     | utc   |
+---------------+-------+
1 row in set (0.00 sec)

Now that’s the way things should be! Or at least as close as I’m going to get to it in MySQL 5.

Update 2006-11-04: Ask Bjørn Hansen turned me on to the “strict_trans_tables” mode, which prevents MySQL from trying to guess what you mean when you leave out a value for a required column. So I’ve now updated my configuration with sql-mode=ansi,strict_trans_tables.

Update 2009-11-05: I found myself configuring MySQL again today, and there were some other settings I found it useful to add:

  • no_auto_value_on_zero forces AUTO_INCREMENT columns to increment only when inserting a NULL, rather than when inserting a NULL or a zero(!).
  • no_zero_date and no_zero_in_date disallow dates where the the year or month are set to 0.
  • only_full_group_by requires that non-aggregated columns in a select list be included in a GROUP BY clause, as is mandated by the SQL standard. This only applies if an aggregate function is used in a query

I’ve added all of these to the example above.

Looking for the comments? Try the old layout.

Splitting Words in Perl

I’ve created a new module, Text::WordDiff, now on its way to CPAN, to show the differences between two documents using words as tokens, rather than lines as Text::Diff does. I plan to use it in Bricolage to give people a change tracking-type view (as seen in word processors) comparing two versions of a document. Fortunately, Algorithm::Diff makes this extremely easy to do. My only real problem was figuring out how to tokenize a string into words

After looking at discussions in The Perl Cookbook and Mastering Regular Expressions, I settled on using Friedl’s pattern for identifying the starting boundary of words, which is qr/(?<!\w)(?=\w)/msx. This pattern will turn the string, “this is O’Reilly’s string” into the following tokens:

[
    q{this },
    q{is },
    q{O'},
    q{Reilly'},
    q{s },
    q{string},
];

So it’s imperfect, but it works well enough for me. I’m thinking of using the Unicode character class for words, instead, at least for more recent versions of Perl that understand them (5.8.0 and later?). That would be /(?<!\p{IsWord})(?=\p{IsWord})/msx. The results using that regular expression are the same.

But otherwise, I’m not sure whether or not this is the best approach. I think that it’s good enough for the general cases I have, and the matching of words in and of themselves is not that important. What I mean is that, as long as most tokens are words, it’s okay with me if some, such as “O’”, “Reilly’”, and “s” in the above example, are not words. What I don’t know is how well it’ll work for non-Roman glyphs, such as in Japanese or Korean text. I tried a test on a Korean string I have lying around (borrowed from the Encode.pm test suite), but it didn’t split it up at all (with use utf8;).

So what do you think? Does Text::WordDiff work for your text? Is there a better and more general solution for tokenizing the words in a string?

Looking for the comments? Try the old layout.

No UTF-8 Support on Windows?

This just blows. It will be a while before Bricolage runs on Windows, then. The PostgreSQL team is understandably reluctant to simply include the whole ICU library in PostgreSQL. Maybe it could be compiled into the binaries, though?

Looking for the comments? Try the old layout.

Lessons Learned with Perl and UTF-8

I learned quite a lot last week as I was making Bricolage much more Unicode-aware. Bricolage has always managed Unicode content and stored it in a PostgreSQL Unicode-encoded database. And by “Unicode” I of course mean “UTF-8”. By far the biggest nightmare was figuring out the bug with Apache::Util::escape_html(), but ultimately it came down to an interesting lesson.

Why was I making Bricolage Unicode-aware? Well, it all started with a bug report from Kang-min Liu (a.k.a. “Gugod”). I had naïvely thought that if strings were Unicode that Perl would know it and do the right thing. It turns out I was wrong. Perl assumes that everything is binary unless you tell it otherwise. This means that Perl operators such as length and substr will count bytes instead of characters. And in the case of Unicode, where characters can be multiple bytes, this can cause serious problems. Not only were strings improperly concatenated mid-character for Gugod, but PostgreSQL could refuse to accept such strings, since a chopped-up multibyte character isn’t valid Unicode!

So I had to make some decisions: Either stop using Perl operators that count bytes, or let Perl know that all the strings that Bricolage deals with are Unicode strings. The former wasn’t really an option, of course, since users can specify that certain content fields be a certain length of characters. So with a lot of testing help from Gugod and his Bricolage install full of multibyte characters, I set about doing so. The result is in the recently released Bricolage 1.8.2 and I’m blogging what I learned for both your reference and mine.

Perl considers its internal representation of strings to be UTF-8 strings, and it knows what variables contain valid UTF-8 strings because they have a special flag set on them, called, strangely enough, utf8. This flag isn’t set by default, but can be set in a number of ways. The ways I’ve found so far are:

  • Using Encode::decode() to decode a string from binary to Perl’s internal representation. The use of the word “decode” here had confused me for a while, because I thought it was a special encoding. But the truth is that it’s not. Strings can have any number of encodings, such as “ISO-8859-1”, “GB3212”, “EUC-KR”, “UTF-8”, and the like. But when you “decode” a string, you’re telling Perl that it’s not any of those encodings, but Perl’s own representation. I was confused because Perl’s internal representation is UTF-8, which is an encoding. But really it’s not UTF-8, It’s “utf8”, which isn’t an encoding, but Perl’s own thing.

  • Cheat: Use Encode::_set_utf8_on(). This private function is nevertheless documented by the Encode module, and therefore usable. What it does is simply turn on the utf8 flag on a variable. You need be confident that the variable contains only valid UTF-8 characters, but if it does, then you should be pretty safe.

  • Using the three-argument version of open, such as

    open my $fh, "<utf8", "/foo/bar" or die "Cannot open file: $!\n"
    

    Now when you read lines from this file, they will automatically be decoded to utf8.

  • Using binmode to set the mode on a file handle:

    binmode $fh, ":utf8";
    

    As with the three-argument version of open this forces Perl to decode the strings read from the file handle.

  • use utf8;. This Perl pragma indicates that everything within its scope is UTF-8, and therefore should be decoded to utf8.

So I started applying these approaches in various places. The first thing I did was to set the utf8 flag on data coming from the browser with Encode::_set_utf8_on(). Shitty browsers can of course send shitty data, but I’m deciding, for the moment at least, to trust browser to send only UTF-8 when I tell them that’s what I want. This solved Gugod’s immediate problem, and I happily closed the bug. But then he started to run into places where strings appeared properly in some places but not in others. We spent an entire day (night for Gugod–I really appreciated the help!) tracking down the problem, and there turned out to be two of them. One was the the bug with Apache::Util::escape_html() that I’ve described elsewhere, but the other proved more interesting.

It seems that if you concatenate a UTF-8 string with the utf8 flagged turned on with a UTF-8 string without utf8 turned on, the text in the unflagged variable turns to crap! I have no idea why this is, but Gugod noticed that strings pulled into the UI from the Bricolage zh_tw localization library simply didn’t display properly. I had him add use utf8; to the zh_tw module, and the problem went away!

So the lesson learned here is: If you’re going to make Perl strings Unicode-aware, then all of your Perl strings need to be Unicode-aware. It’s an all or nothing kind of thing.

So while setting the utf8 flag on browser submits and adding use utf8; to the localization modules got us part of the way toward a solution, it turned out to be trickier than I expected to get the utf8 flag set on everything. The places I needed to get it working were in the UI Mason components, in templates, and in strings pulled from the database.

It took a bit of research, but I think I successfully figured out how to make the UI Mason components UTF-8 aware. I just added preamble => "use utf8\n;" to the creation of the Mason interpretor. This gets passed on to is compiler, and now that string is added to the beginning of every template. This made things behave better in the UI. I applied the same approach to the interpreter created for Mason templates with equal success.

I’m less confident that I pulled it off for the HTML::Template and Template Toolkit templating architectures. In a discussion on the templates mailing list, Andy Wardley suggested that it wasn’t currently possible. But I wasn’t so sure. It seemed to me that, since Bricolage reads in the templates and asks TT to execute them within a certain scope, that I could just set the mode to utf8 on the file handle and then execute the template within the scope of a use utf8; statement. So that’s what I did. Feedback on whether it works or not would be warmly welcomed.

I tried a similar approach with the HTML::Template burner. Again, the burner reads the templates from files and passes them to HTML::Template for execution (as near as I could tell, anyway; I’m not an HTML::Template template user). Hopefully it’ll just work.

So that just left the database. Since the database is Unicode-only, all I needed to do was to turn on the utf8 flag for all content pulled from the database. Amazingly, this hasn’t come up as an issue for people very much, because DBI doesn’t do anything about Unicode. I picked up an older discussion started by Matt Sergeant on the dbi-dev mail list, but it looks like it might be a while before DBI has fast, integrated support for turning utf8 on and off for various database handles and columns. I look forward to it, though, because it’s likely to be very efficient. I greatly look forward to seeing the results of Tim’s work in the next release of DBI. I opened another bug report to remind myself to take advantage of the new feature when it’s ready.

So in the meantime, I needed to find another solution. Fortunately, my fellow PostgreSQL users had run into it before, and added what I needed to DBD::Pg back in version 1.22. The pg_enable_utf8 database handle parameter forces the utf8 flag to be turned on for all string data returned from the database. I added this parameter to Bricolage, and now all data pulled from the database is utf8. And so are the UI components, templates, localization libraries, and data submitted from browsers. I think that nailed everything, but I know that Unicode issues are a slippery slope. I can’t wait until I have to deal with them again!

Not.

Looking for the comments? Try the old layout.

Always use the C Locale with PostgreSQL

I ran into the weirdest bug with Bricolage today. We use the LIKE operator to do string comparisons throughout Bricolage. In one usage, the code checks to see if there’s a record in the “keyword” table before creating it. This is because keyword names are unique. So it looks for a keyword record like this:

SELECT name, screen_name, sort_name, active
  FROM   keyword
 WHERE  LOWER(name) LIKE ?

If it finds a keyword, it creates a relationship between it and a story document. If it doesn’t find it, it creates a new keyword record and then associates the new keyword with a story document.

However, one of our customers was getting SQL errors when attempting to add keywords to a story, and it took me a while to figure out what the problem was. This is because I couldn’t replicate the problem until I started trying to create multibyte keywords. Now, Bricolage uses a UTF-8 PostgreSQL database, but something very odd was going on. When I attempted to add the keyword “북한의”, it didn’t find an existing keyword, but then threw an error when the unique index thought it existed already! Running tests in psql, I found that = would find the existing record, but LIKE wouldn’t!

Once I posted a query on the pgsql-general list, someone noticed that the record returned when using = actually had a different value than was actually queried for. I had searched for “북한의”, but the database found “국방비”. It seems that = compares bytes, while LIKE compares characters. The error I was getting meant that the unique index was also using bytes. And because of the locale used when initdb was run, PostgreSQL thought that they actually were the same!

The solution to this problem, it turns out, was to dump the database, shut down PostgreSQL, move the old data directory, and create a new one with initdb -locale=C. I then restored the database, and suddenly = and LIKE (and the unique index) were doing the same thing. Hallelujah!

Naturally, I’m not the first to notice this issue. It’s particularly an issue with RedHat Linux installations, since RedHat has lately decided to set a system-wide locale. In my case, it was “en_US.UTF-8.” This apparently can break collations in other languages, and this affects indices, of course. So I was led to wonder if initdb shouldn’t default to a locale of C instead of the system default. What do you think?

You can read the whole thread here.

Looking for the comments? Try the old layout.