Always use the C Locale with PostgreSQL
I ran into the weirdest bug with Bricolage today. We use the LIKE
operator to do string comparisions throughout Bricolage. In one usage, the
code checks to see if there's a record in the keyword
table before
creating it. This is because keyword names are unique. So it looks for a keyword
record like this:
SELECT name, screen_name, sort_name, active FROM keyword WHERE LOWER(name) LIKE ?
If it finds a keyword, it creates a relationship between it and a story document. If it doesn't find it, it creates a new keyword record and then associates the new keyword with a story document.
However, one of our customers was getting SQL errors when attempting to add
keywords to a story, and it took me a while to figure out what the problem
was. This is because I couldn't replicate the problem until I started trying
to create multibyte keywords. Now, Bricolage uses a UTF-8 PostgreSQL database,
but something very odd was going on. When I attempted to add the
keyword 북한의
, it didn't find an existing keyword, but then threw an
error when the unique index thought it existed already! Running tests
in psql, I found that = would find the existing
record, but LIKE wouldn't!
Once I posted a query on the pgsql-general list, someone noticed that the record returned
when using = actually had a different value than was actually
queried for. I had searched for 북한의
, but the database found 국방비
. It seems that = compares bytes, while LIKE compares
characters. The error I was getting meant that the unique index was also using bytes. And
because of the locale used when initdb was run, PostgreSQL thought that they
actually were the same!
The solution to this problem, it turns out, was to dump the database, shut down
PostgreSQL, move the old data directory, and create a new one with initdb -locale=C.
I then restored the database, and suddenly = and LIKE (and the unique
index) were doing the same thing. Hallelujah!
Naturally, I'm not the first to notice this issue. It's particularly an issue with RedHat Linux
installations, since RedHat has lately decided to set a system-wide locale. In my case, it was en_US.UTF-8.
This apparently can break collations in other languages, and this affects indices, of course. So I
was led to wonder if initdb shouldn't default to a locale of C instead of
the system default. What do you think?
You can read the whole thread here.
Backtalk
Chris wrote:
Theory wrote: