URL Inter-op

Mottikumar
5 min readFeb 21, 2021

--

(or RFC 3986 vs WHATWG URL Specification vs the real world)

Sample Curl Command

In this blog i would attempt to describe where and how RFC 3986 (86), RFC 3987 (87) and the WHATWG URL Specification (TWUS) differ.

This might be useful input when trying to interop with URLs on the modern Internet.

In my previous blog on , “Learning the Art of Bug bounty” (Currently edit on progress) . we went on a journey to learn about bug bounty and we came across an interesting report submitted by Mr.Jonathan Leitschuh to Curl, where Mr. Bagder (Curl Staff) responded to his report and shared his knowledge about the bug which WHATWG brought to this world when it broke the backward compatibility with RFC 3986.

This information might be useful input when trying to interop with URLs on the modern Internet.

“URL” (in the scope of this blog) refers to the sequence of characters passed into APIs, then it is sent over the network and passed from machine to machine. It does not necessarily match what a typical web browsers accept or support in their Graphical User Interface (GUI) address bars.

We will focuses on network-using URL schemes such as http, https, ftp, etc.. as well as ‘file’.

URL components

A URL may consist of the following components — many of them are optional:

[scheme][divider][userinfo][hostname][port number][path][query][fragment]

Each component is separated from the following component with a divider character.

Which in an example could look like this,

http://user:password@www.example.com:80/index.hmtl?foo=bar#top

Scheme

In Scheme, there are no known interop issues.

Divider

86: specifies this to be exactly “://” for all network-using (hierarchical) schemes.

TWUS: says a parser must accept zero to an infinite amount of slashes but a producer should use two.

Real world: one and three slash URLs occur, possibly a few using even more. ‘file’ URLs are notoriously often malformed.

Additionally, browsers also happily accept backslashes instead of slashes, thus redirects to http:\\\\\example.com work.

Userinfo

The userinfo field can be used to set user name and password to pass on to the server. The use of this field is discouraged since it often means passing around the password in plain text and is thus a security risk.

86: specifies that ‘@’ is the separator between the userinfo field and the host name. The first ‘@’ character really.

TWUS: instead takes the last ‘@’ before the host name to be the separator

This is an interop collision

Hostname

Numerical IP addresses

86: mentions how IPv4 addresses with a dot-notation are valid

TWUS: specifies that both 32bit numbers (“12345677”) as well as partial dot-addresses (“127.0”) are valid.

Real world: 32 bit numbers occur, and are automatically supported if typical OS level name resolver functions are used since they often support this out of the box.

Different base numericals

86: mentions how each number in a IPv4 dotted address is a decimal number between 0 and 255.

TWUS: doesn’t specify which base the numbers should be specified as, which I presume is makes “http://0177.0.0.1" etc valid with an octal number as the first 8 bit value.

Real world: getaddrinfo() handles octals and hex in IPv4 addresses.

IDNA

Hostnames were traditionally ASCII based. When introducing IDN hostnames, it has caused problems to the specifications and they are lacking.

86: Is written to work without IDN (ASCII characters), so basically it works with already punycoded domain names.

87: Specifies IDNA 2003 to be used

TWUS: Doesn’t specify IDNA 2003 nor 2008, but somehow that’s still clear

Real world: A total mess. Some national registries (the German DENIC for example) require IDNA 2008, which makes user-agents treating the host name according to IDNA 2003 and IDNA 2008 TR46/transitional to fail or even to resolve the wrong IP address. Some user-agents use IDNA 2003, some do IDNA 2008 TR46/transitional and some do IDNA 2008 TR46/non-transitional.

EURid using IDNA 2008 with Homoglyph Bundling rules

DENIC describing IDNA 2003/2008 collisions

This is an interop collision

This is a security issue

Port number

The port number is a TCP port number between 0 and 65535.

86 and TWUS agree that this is a base-10 number that is virtually unbounded in length. 00000000000000000000000000000000000080 means 80 in both specs.

TWUS: limits the number to a 16bit unsigned value (0–65535) while 86 has no such language.

Real world: at least curl and wget2 ignore “rubbish” entered after the number all the way to the next component divider (a slash, a pound sign, or a question mark). That seems to be a bug according go both 86 and TWUS.

Also, when using URLs containing multiple port numbers like “http://[127.0.0.1]:11211:80", many URL parsers (Ruby, JavaScript, PHP, perl) will extract and use the latter port number (80) and ignore the first one, some other parsers will extract and use the first one and some will report errors…

Path

8bit

86: says that a URL path is specified as ASCII characters or need to be URL encoded. That makes 8-bit characters illegal.

TWUS:

Real world: 8bit characters are occasionally seen in URLs in the wild, and when used in redirects, browsers are known to URL-encode them in the next outgoing request.

U+0020, space

86: does not allow spaces (U+0020) to be part of the path. A space instead ends the URL.

TWUS: allows spaces in URLs and will instead URL-encode it to %20 when sent in a request. A TWUS URL thus needs to end on another character or have another method to know the end.

Real world: Spaces are occasionally seen in URLs in the wild, and when used in redirects, browsers are known to URL-encode them in the next outgoing request.

Leading slashes in file: URLS

86: Since unix file systems can handle any number of leading slashes, they have been fine.

TWUS: No leading slashes on file: URLs

Pretending backslashes are slashes

86: Backslashes are not slashes

TWUS: Backslashes should be converted to slashes and then treated as such!

Query

There exist some problems in Query.

Fragment

Web browsers have not decided entirely on how fragments work, at least not for data: URLs. Should they support framgments or not? Webkit differs from the others.

Test suite

TWUS has a test suite you could refer it in the link below:

https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestdata.json

I hope this blog was interesting and informative on various technical aspects. Stay tuned for more contents.

Thank you.

--

--

Mottikumar
Mottikumar

Written by Mottikumar

I never hear from most of my high school classmates unless their email account gets hacked.

No responses yet