Best practice for handling UTF-8 when percent-encoding?

rocks near Greens Fork, Indiana

In my previous post I summarized the current state of Percent-encoding in Perl. One of my conclusions was that the perfect percent-encoding solution would automatically handle UTF-8 encoding, using logic like this:

utf8::encode $string if utf8::is_utf8 $string;

Respected Plack author miyagawa quickly responded in a response post to say that the above code approach is a bug, although the code pattern is already wide use as it is present in Catalyst, CGI.pm (and by extension CGI::Application and other frameworks) as well as Mojo.

In one sense, he’s right. The pattern goes against the advice found in the official perlunifaq documentation which states that

It’s better to pretend that the internal format is some unknown encoding, and that you always have to encode and decode explicitly.

In other words: don’t use the “is_utf8()” function.

Before drawing a conclusion whether this code pattern is the best design in in practice, let’s look some related facts about the matter.

  • The W3C is clear that it is a best practice to convert text to UTF-8 before percent encoding it.
  • However, as Miyagawa points out, only newer protocols, since about 2005, are expected to follow that. The HTTP protocol is far older, and it is valid to pass binary data through HTTP query strings.
  • But conveniently, Perl reads binary data as 8-bit characters. You can read in /usr/bin/perl, pass it to uri_escape, and it will percent-encode it just fine— without blowing up due to the “hi-bit” characters that some UTF-8 characters correspond to.
  • In practice, the approach of automatic encoding used by Catalyst and CGI.pm has not been problematic. From reviewing both both bug queues, I don’t see any open or previously-closed bugs that haven caused by the current behavior. It does seem reasonable and logical that it should be safe to UTF-encode text that is marked as UTF-8.
  • People would reasonably expect the percent-encode/decode process to be a symmetric round trip— you should get back exactly the data that you started with. When UTF-8 encoding is added to the process, this isn’t necessarily the case, since UTF-8 encoding happens only before percent-encoding. The reverse does not happen when percent-decoding. The one way trip to UTF-8 is probably what you wanted, but since the conversion is automatic, the intent of the programmer is not explicit.
  • The most popular percent-encoding solutions for Perl— CGI.pm and URI::Escape— both pre-date proper Unicode support in Perl, and both include built-in code to re-implement UTF-8 functionality in case you are using Perl 5.6 which lacks proper Unicode support. So in the context of designing those modules, it wasn’t really an option to tell people just call encode($string) if they wanted to UTF-8 encode their text before it was URI-encoded.

So, then, which of these two designs represents a best practice for handling UTF-8 encoding in combination with percent-encoding in Perl?

  1. First option: The popular solution used in practice to UTF-8-encode data if the UTF-8 flag is set. The approach automatically handles the recommended practice of UTF-8 encoding so users don’t have to think about it.
  2. Second option: UTF-8 handling should not built-into a percent-encoding solution. Following the advice of perlunifaq, an ideal solution would not check the UTF-8 flag, but instead would instead offer clear documentation that advises the user about the best practices about possibly UTF-8 encoding text before it is percent-encoded. This design cleanly externalizing all the character encoding issues that sometimes get bundled with percent-encoding, such as translating from EBCDIC or UTF-16 surrogate pairs. It provides users a bit of education about UTF-8 best practices which they may able to apply in other areas. As long as UTF-8 handling is attempted to be automatic, programmers can continue to stay in the dark that the best practice is for them to explicitly handle character encodings themselves. The solution might be as simple as documenting this in the synopsis:

use Encode 'encode_utf8'; $uri = uri_percent_encode(encode_utf8($string));

Note that URI::Escape currently essentially already has the code part of the second alternative in place. It offers percent-encoding without any character encoding, as well as uri_escape_utf8, which is basically just sugar for the code sample above.The URI::Escape documentation explains in precise technical terms what the two alternatives do, but offers very little guidance on which method to choose—the plain one or the UTF-8 one? It doesn’t address which reflects a best practice, or cover the possiblity of any drawbacks to choosing the UTF-8 version. A Google code search shows that there are about 10x more hits for “uri_escape()” vs “uri_escape_utf8()”. I suspect that many of these cases would benefit from being able to handle UTF-8 characters, but the programmers weren’t really educated about when to choose that option. Considering that masses of programmers aren’t suddenly going to become UTF-8 experts, it’s clear to me that an ideal solution should go with one of the option above: Either try to handle UTF-8 automatically, or offer strong guidance in documentation on percent-encoding about best practices regarding character encodings. I think with clear, informative documentation, the second option could be the better way to go.

What do you think?

1 Comment

You seem so convinced that Catalyst has had no bugs about this utf8::encode, but IMO that should not be true - there's been lots of complaints about this utf8 related behavior in Catlayst, at least as I recall from my time at #catalyst irc channel.

If i understand it right, Catalyst doesn't do these kinds of utf8 handling by default, but provides it through the "Unicode" or "Encoding" plugin.

And here's a quote from the Catalyst::Plugin::Unicode pod:

Note that this plugin tries to autodetect if your response is encoded into characters before trying to encode it into a byte stream. This is *bad* as sometimes it can guess wrongly and cause problems.

As an example, latin1 characters such as e (e-accute) will not actually cause the output to be encoded as utf8.

See that it admits that what it does is bad?

Also, your second option sounds like leaving the users to "force" encoding before passing to the function. On one point, modules such as URI which are pretty much user facing data (i.e. you see the exact value on the browser) and should be considered as I/O, so per Perl's unicode best practice "decode on input and encode on output" - it's good to encode before passing to the URI encoding module.

However, the URI escaping itself is *encoding* and it should be somewhat pretty nasty to force users to do the encoding before passing to it, like you suggest in the post. For this situation, providing a function that takes strings and automatically encode in utf8 inside is not a bad idea either, like URI::Escape does with uri_escape_utf8. The point is that the "automatic" encoding should NOT rely on the utf8 flag at all, like perlunifaq says: that way latin-1 range characters get broken.

Leave a comment

Recent Entries

Close