Apache::Util::escape_html() Doesn't Like Perl UTF-8 Strings

I got bit by a bug with Apache::Util's escape_html() function in mod_perl 1. It seems that it doesn't like Perl's Unicode encoded strings! This patch demonstrates the issue (be sure that your editor understands utf-8):

--- modperl/t/net/perl/util.pl.~1.18.~	Sun May 25 03:54:08 2003
+++ modperl/t/net/perl/util.pl	Thu Sep  9 19:38:40 2004
@@ -74,6 +74,25 @@
 
 #print $esc_2;
 test ++$i, $esc eq $esc_2;
+
+# Make sure that escape_html() understands multibyte characters.
+my $utf8 = '<專輯>';
+my $esc_utf8 = '<專輯>';
+my $test_esc_utf8 = Apache::Util::escape_html($utf8);
+test ++$i, $test_esc_utf8 eq $esc_utf8;
+#print STDERR "Compare '$test_esc_utf8'\n     to '$esc_utf8'\n";
+
+eval { require Encode };
+unless ($@) {
+    # Make sure escape_html() properly handles strings with Perl's
+    # Unicode encoding.
+    $utf8 = Encode::decode_utf8($utf8);
+    $esc_utf8 = Encode::decode_utf8($esc_utf8);
+    $test_esc_utf8 = Apache::Util::escape_html($utf8);
+    test ++$i, $test_esc_utf8 eq $esc_utf8;
+    #print STDERR "Compare '$test_esc_utf8'\n     to '$esc_utf8'\n";
+}
+
 use Benchmark;
 
 =pod

If I enable the print statements and look at the log, I see this:

Compare '<專輯>'
     to '<專輯>'
Compare '<å°è¼¯>'
     to '<專輯>'

The first escape appears to work correctly, but when I decode the string to Perl's Unicode representation, you can see how badly escape_html() munges the text!

Curiously, both tests fail, although the first conversion appears to be correct. This could be due to the behavior of eq, though I'm not sure why. But it's the second test that's the more interesting, since it really screws things up.

Backtalk

Mark Fowler wrote:

It's very hard to tell what's going on from what is printed out with utf8. This is because it's hard to tell if a perl scalar printed contained the right thing, or it's just printed a stream of bytes that happen to cause your terminal to render the correct thing. I have two bits of advice: 1. Use Devel::Peek. Devel::Peek is the only thing I've found that'll print the internals of perl, to show you exactly what bytes and flags are set in the scalar. 2. Make use of my module Test::utf8 to check things are encoded correctly. I wrote it because I got too confused without it ;-)