This post is about the current state of URI encoding in Perl. This is the problem space of being able to safely pass arbitrary text into and out of a URI format. If you’ve even seen a space in URL represented as “%20”, that’s the topic of the moment.
The best general introduction I’ve found on the topic is the Wikipedia page on Percent-encoding.
RFCs on the topic include the 2005 RFC 3986 that defined the generic syntax of URIs. It replaces RFC 1738 from 1994 which defined Uniform Resource Locators (URLs), and RFC 1808 from 1995 which defined Relative Uniform Resource Locators. Sometimes this transformation is called “URI escaping” and sometimes it’s refered to “URL encoding”. RFC 3986 clarified the naming issue:
“In general, the terms “escaped” and “unescaped” have been replaced with “percent-encoded” and “decoded”, respectively, to reduce confusion with other forms of escape mechanisms.”
Elsewhere it’s clarified that percent encoding applies to all URIs, not just URLs.
I think the Perl community would do well to adopt “percent encode URI” and “percent decode URI” as ways to describe this process that is unambigous and in line with the RFC.
There are two URI percent-encoding solutions in Perl that seem to be in the widest use. Both have a significant deficiency.
Percent-encoding with CGI.pm
The first is
CGI::Util which provides
unescape() as a pair. This solution has a lot going for it— it’s
been in the core for years, it works back to Perl 5.6, it automatically handles
UTF-8 encoding, and it handles some edge cases like EBCIDIC encoding and
UTF-16 surrogate pairs. Further you can use escape() and unescape() without
using the rest of CGI.pm or ever creating a CGI.pm object. There’s just one
major deficiency: These methods have never been documented! Many take advantage
of them by using CGI.pm directly or indirectly, as CGI.pm uses them internally.
A few people have found them and use them directly. As someone with commit access
to the CGI.pm repo, I’ll be documenting them shortly, once I’m done with the detour
that became this post.
Percent-encoding with URI::Escape
Probably the most intentionally widely used module for URI percent encoding is
URI::Escape. URI::Escape is not
in the core, but the URI distribution depends only on MIME::Base64, and that
module is not actually needed for the URI::Escape functionality. Like CGI.pm,
URI::Escape also advertises support back to Perl 5.6.1. It does not handle
EBIDIC or UTF-16 surrogate pairs, but as I’ll explain later, it’s questionable
whether those abilities are truly desirable to be built-in to a percent-encoding
solution. The deficiency with URI::Escape is that doesn’t handle UTF-8 automatically like most other solutions do.
Many perl scripts and modules have called
expecting that it will always “just work” for encoding all text.
Instead, you have to explictly ask for UTF-8 handling by calling
uri_escape_utf8() instead. To credit URI::Escape, it has clearly
documented how it behaves in this regard, but it seems like a missed
opportunity to handle UTF-8 input automatically. By contrast, most other
solutions handle either case automatically with a single line like this:
utf8::encode $_ if utf8::is_utf8 $_;
RFC 3986 is quite clear that UTF-8 encoding should be part of the solution:
“Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then each octet of the corresponding UTF-8 sequence must be percent- encoded to be represented as URI characters”
URI::Escape is likely suffering from being far older than RFC 3986, and added a
new method specific to UTF-8 to keep
uri_escape() perfectly backwards compatible.
In hindsight from 2010, I think that was an unfortunate choice.
Summary of all known percent-encoding solutions for Perl
I researched further to see what other percent-encoding solutions for Perl and how they differ. Here’s what I found, including CGI.pm and URI::Escape again for completeness.
Has the benefit of being in the core, but the drawback of being undocumented as of version 3.50.
escape / unescape
- Min Perl version: 5.6.0
- Handles UTF-8 handling: Yes, on Perl 5.8 and newer
- Notes: also handles EBCIDIC and UTF-16 surrogate pairs.
CGI::Simple 1.112 Appears to have a bug regarding RFC 2396, section 2.2, concerning reserved characters. It explicitly translates spaces to “+”, unlike most other solutions here which translate it to %20. It also lacks automatic UTF-8 handling. It’s implementation is notably not compatible with the one in CGI.pm, as some would assume.
url_encode / url_decode
- Min Perl Version: 5.6.1
- Handles UTF-8 encoding: No.
- Notes: The implemention here isn’t the same as a second one in the distribution, in CGI::Simple::Util.
A second percent-encoding in CGI::Simple 1.112, it is not compatible with CGI.pm’s implementation either. Compare:
CGI::escape å -> %C3%A5%20X URI::Escape::uri_escape_utf8 å -> %C3%A5%20X CGI::Simple->url_encode å -> %E5+X CGI::Simple::Util::escape å -> %E5%20X
escape / unescape
- Min Perl Version: 5.6.1
- Handles UTF encoding: No.
- Notes: Handles EBCIDIC encoding, inherited from CGI.pm before the fork.
Mojo::Util 0.999941 provides a modern, simple implementation with automatic UTF-8
encoding. My gripes with it are that the names say “url” and “escape” instead
of “uri” and “encode” to follow the RFCs more closely. It also doesn’t allow
you to use a rather normal syntax:
That’s because Mojo has uses the unconventional impementation of modify the
input by reference instead of returning a modified copy. Presumably this is
done for performance.
url_escape / url_unescape
- Min Perl Version: 5.8.7.
- Handles UTF-8 encoding: Yes.
Tie::UrlEncoder 0.02 provides a unique interface through a %urlencode hash. However,
it doesn’t provide a decoding routine. Basic UTF-8 tests pass for it, but the
solution employed is unothorodox. Instead of calling UTF-8 related functions,
use bytes;. Official Perl documentation is clearly
opinionated this approach. In
perlunifaq, it says plainly
“Don’t use it.” in regard to
- Min Perl version: 5.6.
- Handles UTF-8 encoding: The implemention does not follow best practices. See above.
Not be confused with URI::Escape, URI::Encode is meant to be a newer and simpler take on the problem space. It offers automatic UTF-8 encoding, and includes an option on whether are not to include reserved characters— The option to not encode reserved characters is nice for those who know what they are doing. Unfortunately, it has a poor object-oriented UI. It offers a constructor which does nothing, when the reserved characters option could be used as option there. Then, it doesn’t document that you can call the key methods as class methods to bypass the do-nothing constructor. While it also offers a procedural interface, it’s implemented in terms of calling the do-nothing constructor every time, adding an unnecessary penalty.
uri_encode / uri_decode
- Min Perl version: Perl 5.8.1.
- Handles UTF-8 encoding: Yes
URI::Escape provides three APIs, two that don’t handle UTF-8 encoding and one
that does. It’s popular, works well and is well documented. It’s main drawback
is that UTF-8 encoding is not automatic in
uri_escape() and as a result and has
not been used by many applications, when UTF-8 support here could have
otherwise been a free benefit.
uri_escape_utf8() can be used for UTF-8 support.
uri_escape / uri_escape_utf8 / uri_unescape / %escapes
- Min Perl version: 5.6.1
- Handles UTF-8 encoding: Not automatically
It sounds like a module that’s compatible with URI::Escape, only faster due to
a C-based XS implementation. It does benchmark to be much faster, and it is
somewhat compatible, but it lacks a
uri_escape_utf8 method, which
could be a valuable addition for better compatibility. Instead, it has a
uri_escape method that includes UTF-8 support automatically. It
also has a higher minimum Perl requirement— Perl 5.8 vs 5.6, which is another
important difference that’s not documented. As an additional benfit,
which will use the XS version if it exists, and the Pure-Perl version otherwise.
The wrapper module also unfortunatley glosses over the difference in UTF-8 handling
in the XS version and the pure-Perl version.
uri_escape / uri_unescape
- Min Perl version: 5.8.1
- Handles UTF-8 encoding: Yes
- Notes: Requires a C-compiler (but very fast)
All the URI percent encoding solutions I reviewed had flaws, but the pieces are all there to produce an optimal solution. Here’s my recommendation for designing a perfect solution:
- Name the module URI::PercentEncode;
- Name the functions
- Return the changed value (don’t modify by reference)
- Require at least Perl 5.8.1. Supporting older version is unnecessary baggage at this point.
- Don’t build in support for getting data into UTF-8 beyond a simple call to utf8::encode(). Anything else belongs in the domain of the “Encode” module. If I’ve wrong about including support for UTF-16 surrogate pairs in a percent encoding solution, let me know.
- Automatically handle UTF-8 encoding (like this:
utf8::encode $_ if utf8::is_utf8 $_;)
- Use faster XS-based code by default, but allow building a Pure-Perl version for those who need or want it. (Follow the model of Params::Validate here).
That’s my take on URI percent-encoding in Perl. What do you have to add?
Update: *See the reply by miyagawa who states that this code is a bug:
utf8::encode $_ if utf8::is_utf8 $_;. It is used by CGI::Util, Mojo::Util in the versions given above as well as in Catalyst. URI::Escape and URI::Encode do UTF-8 encoding without checking the UTF-8 flag. He has more experience with UTF-8, and I defer to his advice here. *