I learned quite a lot last week as I was making Bricolage much more
Unicode-aware. Bricolage has always managed Unicode content and stored it in a
PostgreSQL Unicode-encoded database. And by “Unicode” I of course mean “UTF-8”.
By far the biggest nightmare was figuring out the bug with
Apache::Util::escape_html()
, but ultimately it came down to an interesting
lesson.
Why was I making Bricolage Unicode-aware? Well, it all started with a bug
report from Kang-min Liu (a.k.a. “Gugod”). I had naïvely thought that if
strings were Unicode that Perl would know it and do the right thing. It turns
out I was wrong. Perl assumes that everything is binary unless you tell it
otherwise. This means that Perl operators such as length
and substr
will
count bytes instead of characters. And in the case of Unicode, where characters
can be multiple bytes, this can cause serious problems. Not only were strings
improperly concatenated mid-character for Gugod, but PostgreSQL could refuse to
accept such strings, since a chopped-up multibyte character isn’t valid
Unicode!
So I had to make some decisions: Either stop using Perl operators that count
bytes, or let Perl know that all the strings that Bricolage deals with are
Unicode strings. The former wasn’t really an option, of course, since users can
specify that certain content fields be a certain length of characters. So with a
lot of testing help from Gugod and his Bricolage install full of multibyte
characters, I set about doing so. The result is in the recently released
Bricolage 1.8.2 and I’m blogging what I learned for both your reference and
mine.
Perl considers its internal representation of strings to be UTF-8 strings, and
it knows what variables contain valid UTF-8 strings because they have a special
flag set on them, called, strangely enough, utf8
. This flag isn’t set by
default, but can be set in a number of ways. The ways I’ve found so far are:
-
Using Encode::decode()
to decode a string from binary to Perl’s internal
representation. The use of the word “decode” here had confused me for a
while, because I thought it was a special encoding. But the truth is that
it’s not. Strings can have any number of encodings, such as “ISO-8859-1”,
“GB3212”, “EUC-KR”, “UTF-8”, and the like. But when you “decode” a string,
you’re telling Perl that it’s not any of those encodings, but Perl’s own
representation. I was confused because Perl’s internal representation is
UTF-8, which is an encoding. But really it’s not UTF-8, It’s “utf8”, which
isn’t an encoding, but Perl’s own thing.
-
Cheat: Use Encode::_set_utf8_on()
. This private function is nevertheless
documented by the Encode module, and therefore usable. What it does is
simply turn on the utf8
flag on a variable. You need be confident that the
variable contains only valid UTF-8 characters, but if it does, then you
should be pretty safe.
-
Using the three-argument version of open
, such as
open my $fh, "<utf8", "/foo/bar" or die "Cannot open file: $!\n"
Now when you read lines from this file, they will automatically be decoded
to utf8
.
-
Using binmode
to set the mode on a file handle:
As with the three-argument version of open
this forces Perl to decode the
strings read from the file handle.
-
use utf8;
. This Perl pragma indicates that everything within its scope is
UTF-8, and therefore should be decoded to utf8
.
So I started applying these approaches in various places. The first thing I did
was to set the utf8
flag on data coming from the browser with
Encode::_set_utf8_on()
. Shitty browsers can of course send shitty data, but
I’m deciding, for the moment at least, to trust browser to send only UTF-8 when
I tell them that’s what I want. This solved Gugod’s immediate problem, and I
happily closed the bug. But then he started to run into places where strings
appeared properly in some places but not in others. We spent an entire day
(night for Gugod–I really appreciated the help!) tracking down the problem, and
there turned out to be two of them. One was the the bug with
Apache::Util::escape_html()
that I’ve described elsewhere, but the other proved more interesting.
It seems that if you concatenate a UTF-8 string with the utf8
flagged turned
on with a UTF-8 string without utf8
turned on, the text in the unflagged
variable turns to crap! I have no idea why this is, but Gugod noticed that
strings pulled into the UI from the Bricolage zh_tw localization library simply
didn’t display properly. I had him add use utf8;
to the zh_tw module, and the
problem went away!
So the lesson learned here is: If you’re going to make Perl strings
Unicode-aware, then all of your Perl strings need to be Unicode-aware. It’s
an all or nothing kind of thing.
So while setting the utf8
flag on browser submits and adding use utf8;
to
the localization modules got us part of the way toward a solution, it turned out
to be trickier than I expected to get the utf8
flag set on everything. The
places I needed to get it working were in the UI Mason components, in templates,
and in strings pulled from the database.
It took a bit of research, but I think I successfully figured out how to make
the UI Mason components UTF-8 aware. I just added preamble => "use utf8\n;"
to
the creation of the Mason interpretor. This gets passed on to is compiler, and
now that string is added to the beginning of every template. This made things
behave better in the UI. I applied the same approach to the interpreter created
for Mason templates with equal success.
I’m less confident that I pulled it off for the HTML::Template and Template
Toolkit templating architectures. In a discussion on the templates mailing
list, Andy Wardley suggested that it wasn’t currently possible. But I wasn’t
so sure. It seemed to me that, since Bricolage reads in the templates and asks
TT to execute them within a certain scope, that I could just set the mode to
utf8
on the file handle and then execute the template within the scope of a
use utf8;
statement. So that’s what I did. Feedback on whether it works or not
would be warmly welcomed.
I tried a similar approach with the HTML::Template burner. Again, the burner
reads the templates from files and passes them to HTML::Template for execution
(as near as I could tell, anyway; I’m not an HTML::Template template user).
Hopefully it’ll just work.
So that just left the database. Since the database is Unicode-only, all I needed
to do was to turn on the utf8
flag for all content pulled from the database.
Amazingly, this hasn’t come up as an issue for people very much, because DBI
doesn’t do anything about Unicode. I picked up an older discussion started by
Matt Sergeant on the dbi-dev mail list, but it looks like it might be a while
before DBI has fast, integrated support for turning utf8
on and off for
various database handles and columns. I look forward to it, though, because it’s
likely to be very efficient. I greatly look forward to seeing the results of
Tim’s work in the next release of DBI. I opened another bug report to
remind myself to take advantage of the new feature when it’s ready.
So in the meantime, I needed to find another solution. Fortunately, my fellow
PostgreSQL users had run into it before, and added what I needed to DBD::Pg
back in version 1.22. The pg_enable_utf8
database handle parameter forces the
utf8
flag to be turned on for all string data returned from the database. I
added this parameter to Bricolage, and now all data pulled from the database is
utf8
. And so are the UI components, templates, localization libraries, and
data submitted from browsers. I think that nailed everything, but I know that
Unicode issues are a slippery slope. I can’t wait until I have to deal with them
again!
Not.
Looking for the comments? Try the old layout.