I ran into the weirdest bug with Bricolage today. We use the
operator to do string comparisions throughout Bricolage. In one usage, the
code checks to see if there's a record in the
keyword table before
creating it. This is because keyword names are unique. So it looks for a keyword
record like this:
SELECT name, screen_name, sort_name, active FROM keyword WHERE LOWER(name) LIKE ?
If it finds a keyword, it creates a relationship between it and a story document. If it doesn't find it, it creates a new keyword record and then associates the new keyword with a story document.
However, one of our customers was getting SQL errors when attempting to add
keywords to a story, and it took me a while to figure out what the problem
was. This is because I couldn't replicate the problem until I started trying
to create multibyte keywords. Now, Bricolage uses a UTF-8 PostgreSQL database,
but something very odd was going on. When I attempted to add the
북한의, it didn't find an existing keyword, but then threw an
error when the unique index thought it existed already! Running tests
psql, I found that
= would find the existing
Once I posted a query on the pgsql-general list, someone noticed that the record returned
= actually had a different value than was actually
queried for. I had searched for
북한의, but the database found
국방비. It seems that
= compares bytes, while
characters. The error I was getting meant that the unique index was also using bytes. And
because of the locale used when
initdb was run, PostgreSQL thought that they
actually were the same!
The solution to this problem, it turns out, was to dump the database, shut down
PostgreSQL, move the old data directory, and create a new one with
I then restored the database, and suddenly
LIKE (and the unique
index) were doing the same thing. Hallelujah!
Naturally, I'm not the first to notice this issue. It's particularly an issue with RedHat Linux
installations, since RedHat has lately decided to set a system-wide locale. In my case, it was
en_US.UTF-8. This apparently can break collations in other languages, and this affects indices, of course. So I
was led to wonder if
initdb shouldn't default to a locale of
C instead of
the system default. What do you think?
You can read the whole thread here.