Percent-encoding URIs in Perl

Percent-encoding URIs in Perl
Photo by Victoria Strukovskaya / Unsplash

This post is about the current state of URI encoding in Perl. This is the problem space of being able to safely pass arbitrary text into and out of a URI format. If you've even seen a space in URL represented as "%20", that's the topic of the moment.

The best general introduction I've found on the topic is the Wikipedia page on Percent-encoding.

RFCs on the topic include the 2005 RFC 3986that defined the generic syntax of URIs. It replaces RFC 1738from 1994 which defined Uniform Resource Locators (URLs), and RFC 1808 from 1995 which defined Relative Uniform Resource Locators. Sometimes this transformation is called "URI escaping" and sometimes it's refered to "URL encoding". RFC 3986 clarified the naming issue:

"In general, the terms "escaped" and "unescaped" have been replaced with "percent-encoded" and "decoded", respectively, to reduce confusion with other forms of escape mechanisms."

Elsewhere it's clarified that percent encoding applies to all URIs, not just URLs.

I think the Perl community would do well to adopt "percent encode URI" and "percent decode URI" as ways to describe this process that is unambigous and in line with the RFC.

There are two URI percent-encoding solutions in Perl that seem to be in the widest use. Both have a significant deficiency.

Percent-encoding with CGI.pm

The first is CGI::Util which provides escape() and unescape() as a pair. This solution has a lot going for it-- it's been in the core for years, it works back to Perl 5.6, it automatically handles UTF-8 encoding, and it handles some edge cases like EBCIDIC encoding and UTF-16 surrogate pairs. Further you can use escape() and unescape() without using the rest of CGI.pm or ever creating a CGI.pm object. There's just one major deficiency: These methods have never been documented! Many take advantage of them by using CGI.pm directly or indirectly, as CGI.pm uses them internally. A few people have found them and use them directly. As someone with commit access to the CGI.pm repo, I'll be documenting them shortly, once I'm done with the detour that became this post.

Percent-encoding with URI::Escape

Probably the most intentionally widely used module for URI percent encoding is URI::Escape. URI::Escape is not in the core, but the URI distribution depends only on MIME::Base64, and that module is not actually needed for the URI::Escape functionality. Like CGI.pm, URI::Escape also advertises support back to Perl 5.6.1. It does not handle EBIDIC or UTF-16 surrogate pairs, but as I'll explain later, it's questionable whether those abilities are truly desirable to be built-in to a percent-encoding solution. The deficiency with URI::Escape is that doesn't handle UTF-8 automatically like most other solutions do. Many perl scripts and modules have called URI::Escape::uri_escapeexpecting that it will always "just work" for encoding all text. Instead, you have to explictly ask for UTF-8 handling by calling uri_escape_utf8() instead.  To credit URI::Escape, it has clearly documented how it behaves in this regard, but it seems like a missed opportunity to handle UTF-8 input automatically. By contrast, most other solutions handle either case automatically with a single line like this:

utf8::encode $_[0] if utf8::is_utf8 $_[0];

RFC 3986 is quite clear that UTF-8 encoding should be part of the solution:

"Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then each octet of the corresponding UTF-8 sequence must be percent- encoded to be represented as URI characters"

URI::Escape is likely suffering from being far older than RFC 3986, and added a new method specific to UTF-8 to keep uri_escape() perfectly backwards compatible. In hindsight from 2010, I think that was an unfortunate choice.

Summary of all known percent-encoding solutions for Perl

I researched further to see what other percent-encoding solutions for Perl and how they differ. Here's what I found, including CGI.pm and URI::Escape again for completeness.

CGI::Util

Has the benefit of being in the core, but the drawback of being undocumented as of version 3.50.

  • Names:  escape / unescape
  • Min Perl version: 5.6.0
  • Handles UTF-8 handling: Yes, on Perl 5.8 and newer
  • Notes: also handles EBCIDIC and UTF-16 surrogate pairs.

CGI::Simple

CGI::Simple 1.112 Appears to have a bug regarding RFC 2396, section 2.2, concerning reserved characters. It explicitly translates spaces to "+", unlike most other solutions here which translate it to %20. It also lacks automatic UTF-8 handling. It's implementation is notably not compatible with the one in CGI.pm, as some would assume.

  • Names: url_encode / url_decode
  • Min Perl Version: 5.6.1
  • Handles UTF-8 encoding: No.
  • Notes: The implementation here isn't the same as a second one in the distribution, in CGI::Simple::Util.

CGI::Simple::Util

A second percent-encoding in CGI::Simple 1.112, it is not compatible with CGI.pm's implementation either. Compare:

CGI::escape                  å -> %C3%A5%20X
URI::Escape::uri_escape_utf8 å -> %C3%A5%20X
CGI::Simple->url_encode      å -> %E5+X
CGI::Simple::Util::escape    å -> %E5%20X
  • Names: escape / unescape
  • Min Perl Version: 5.6.1
  • Handles UTF encoding: No.
  • Notes: Handles EBCIDIC encoding, inherited from CGI.pm before the fork.

Mojo::Util

Mojo::Util 0.999941 provides a modern, simple implementation with automatic UTF-8 encoding. My gripes with it are that the names say "url" and "escape" instead of "uri" and "encode" to follow the RFCs more closely. It also doesn't allow you to use a rather normal syntax: Mojo::Util::url_escape('å').. That's because Mojo has uses the unconventional impementation of modify the input by reference instead of returning a modified copy. Presumably this is done for performance.

  • Names: url_escape / url_unescape
  • Min Perl Version: 5.8.7.
  • Handles UTF-8 encoding: Yes.

Tie::UrlEncoder

Tie::UrlEncoder 0.02 provides a unique interface through a %urlencode hash. However, it doesn't provide a decoding routine. Basic UTF-8 tests pass for it, but the solution employed is unothorodox. Instead of calling UTF-8 related functions, it calls use bytes;. Official Perl documentation is clearly opinionated this approach. In perlunifaq, it says plainly "Don't use it."  in regard to use bytes;.

  • Names: %urlencode
  • Min Perl version: 5.6.
  • Handles UTF-8 encoding: The implemention does not follow best practices. See above.

URI::Encode

Not be confused with URI::Escape, URI::Encode is meant to be a newer and simpler take on the problem space. It offers automatic UTF-8 encoding, and includes an option on whether are not to include reserved characters-- The option to not encode reserved characters is nice for those who know what they are doing. Unfortunately, it has a poor object-oriented UI. It offers a constructor which does nothing, when the reserved characters option could be used as option there.  Then, it doesn't document that you can call the key methods as class methods to bypass the do-nothing constructor. While it also offers a procedural interface, it's implemented in terms of calling the do-nothing constructor every time, adding an unnecessary penalty.

  • Names: uri_encode / uri_decode
  • Min Perl version: Perl 5.8.1.
  • Handles UTF-8 encoding: Yes

URI::Escape

URI::Escape provides three APIs, two that don't handle UTF-8 encoding and one that does. It's popular, works well and is well documented. It's main drawback is that UTF-8 encoding is not automatic in uri_escape() and as a result and has not been used by many applications, when UTF-8 support here could have otherwise been a free benefit. uri_escape_utf8() can be used for UTF-8 support.

  • Names: uri_escape / uri_escape_utf8 / uri_unescape / %escapes
  • Min Perl version: 5.6.1
  • Handles UTF-8 encoding: Not automatically

URI::Escape::XS

It sounds like a module that's compatible with URI::Escape, only faster due to a C-based XS implementation. It does benchmark to be much faster, and it is somewhat compatible, but it lacks a uri_escape_utf8 method, which could be a valuable addition for better compatibility. Instead, it has a uri_escape method that includes UTF-8 support automatically.  It also has a higher minimum Perl requirement-- Perl 5.8 vs 5.6, which is another important difference that's not documented. As an additional benfit, Any::URI::Escape exists which will use the XS version if it exists, and the Pure-Perl version otherwise. The wrapper module also unfortunatley glosses over the difference in UTF-8 handling in the XS version and the pure-Perl version.

  • Names: uri_escape / uri_unescape
  • Min Perl version: 5.8.1
  • Handles UTF-8 encoding: Yes
  • Notes: Requires a C-compiler (but very fast)

Recommmendations

All the URI percent encoding solutions I reviewed had flaws, but the pieces are all there to produce an optimal solution. Here's my recommendation for designing a perfect solution:

  • Name the module URI::PercentEncode;
  • Name the functions uri_percent_encode() and uri_percent_decode()
  • Return the changed value (don't modify by reference)
  • Require at least Perl 5.8.1. Supporting older version is unnecessary baggage at this point.
  • Don't build in support for getting data into UTF-8 beyond a simple call to utf8::encode(). Anything else belongs in the domain of the "Encode" module. If I've wrong about including support for UTF-16 surrogate pairs in a percent encoding solution, let me know.
  • Automatically handle UTF-8 encoding (like this: utf8::encode $_[0] if utf8::is_utf8 $_[0];)
  • Use faster XS-based code by default, but allow building a Pure-Perl version for those who need or want it. (Follow the model of Params::Validate here).

Any volunteers?

That's my take on URI percent-encoding in Perl. What do you have to add?

Update: *See the reply by miyagawa who states that this  code is a bug: utf8::encode $_[0] if utf8::is_utf8 $_[0];. It is used by CGI::Util, Mojo::Util  in the versions given above as well as in Catalyst. URI::Escape and URI::Encode do UTF-8 encoding without checking the UTF-8 flag.  He has more experience with UTF-8, and I  defer to his advice here. *