Best practice for handling UTF-8 when percent-encoding?

Best practice for handling UTF-8 when percent-encoding?
Photo by Sean Stratton / Unsplash

In my previous post I summarized the current state of Percent-encoding in Perl. One of my conclusions was that the perfect percent-encoding solution would automatically handle UTF-8 encoding, using logic like this:

utf8::encode $string if utf8::is_utf8 $string;

Respected Plack author miyagawa quickly responded in a response post to say that the above code approach is a bug, although the code pattern is already wide use as it is present in Catalyst, CGI.pm (and by extension CGI::Application and other frameworks) as well as Mojo.

In one sense, he's right. The pattern goes against the advice found in the official perlunifaq documentation which states that

It's better to pretend that the internal format is some unknown encoding, and that you always have to encode and decode explicitly.

In other words: don't use the "is_utf8()" function.

Before drawing a conclusion whether this code pattern is the best design in in practice,let's look some related facts about the matter.

  • The W3C is clear that it is a best practice to convert text to UTF-8 before percent encoding it.
  • However, as Miyagawa points out, only newer protocols, since about 2005, are expected to follow that. The HTTP protocol is far older, and it is valid to pass binary data through HTTP query strings.
  • But conveniently, Perl reads binary data as 8-bit characters. You can read in /usr/bin/perl, pass it to uri_escape, and it will percent-encode it just fine-- without blowing up due to the "hi-bit" characters that some UTF-8 characters correspond to.
  • In practice, the approach of automatic encoding used by Catalyst and CGI.pm has not been problematic. From reviewing both both bug queues, I don't see any open or previously-closed bugs that haven caused by the current behavior. It does seem reasonable and logical that it should be safe to UTF-encode text that is marked as UTF-8.
  • People would reasonably expect the percent-encode/decode process to be a symmetric round trip-- you should get back exactly the data that you started with. When UTF-8 encoding is added to the process, this isn't necessarily the case, since UTF-8 encoding happens only before percent-encoding. The reverse does not happen when percent-decoding. The one way trip to UTF-8 is probably what you wanted, but since the conversion is automatic, the intent of the programmer is not explicit.
  • The most popular percent-encoding solutions for Perl-- CGI.pm and URI::Escape-- both pre-date proper Unicode support in Perl, and both include built-in code to re-implement UTF-8 functionality in case you are using Perl 5.6 which lacks proper Unicode support. So in the context of designing those modules, it wasn't really an option to tell people just call encode($string) if they wanted to UTF-8 encode their text before it was URI-encoded.

So, then, which of these two designs represents a best practice for handling UTF-8 encoding in combination with percent-encoding in Perl?

  1. First option: The popular solution used in practice to UTF-8-encode data if the UTF-8 flag is set. The approach automatically handles the recommended practice of UTF-8 encoding so users don't have to think about it.
  2. Second option: UTF-8 handling should not built-into a percent-encoding solution. Following the advice of perlunifaq, an ideal solution would not check the UTF-8 flag, but instead would instead offer clear documentation that advises the user about the best practices about possibly UTF-8 encoding text before it is percent-encoded. This design cleanly externalizing all the character encoding issues that sometimes get bundled with percent-encoding, such as translating from EBCDIC or UTF-16 surrogate pairs. It provides users a bit of education about UTF-8 best practices which they may able to apply in other areas. As long as UTF-8 handling is attempted to be automatic, programmers can continue to stay in the dark that the best practice is for them to explicitly handle character encodings themselves. The solution might be as simple as documenting this in the synopsis:

use Encode 'encode\_utf8'; $uri = uri\_percent\_encode(encode\_utf8($string));

Note that URI::Escape currently essentially already has the code part of the second alternative in place. It offers percent-encoding without any character encoding, as well as uri_escape_utf8, which is basically just sugar for the code sample above.The URI::Escape documentation explains in precise technical terms what the two alternatives do, but offers very little guidance on which method to choose--the plain one or the UTF-8 one? It doesn't address which reflects a best practice, or cover the possiblity of any drawbacks to choosing the UTF-8 version. A Google code search shows that there are about 10x more hits for "uri_escape()" vs "uri_escape_utf8()". I suspect that many of these cases would benefit from being able to handle UTF-8 characters, but the programmers weren't really educated about when to choose that option. Considering that masses of programmers aren't suddenly going to become UTF-8 experts, it's clear to me that an ideal solution should go with one of the option above: Either try to handle UTF-8 automatically, or offer strong guidance in documentation on percent-encoding about best practices regarding character encodings. I think with clear, informative documentation, the second option could be the better way to go.

What do you think?