Goals
The URL standard takes the following approach towards making URLs fully interoperable:
-
Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete the RFCs in the process. (E.g., spaces, other "illegal" code points, query encoding, equality, canonicalization, are all concepts not entirely shared, or defined.) URL parsing needs to become as solid as HTML parsing. [RFC3986] [RFC3987]
-
Standardize on the term URL. URI and IRI are just confusing. In practice a single algorithm is used for both so keeping them distinct is not helping anyone. URL also easily wins the search result popularity contest.
-
Supplanting Origin of a URI [sic]. [RFC6454]
-
Define URL’s existing JavaScript API in full detail and add enhancements to make it easier to work with. Add a new
URL
object as well for URL manipulation without usage of HTML elements. (Useful for JavaScript worker environments.) -
Ensure the combination of parser, serializer, and API guarantee idempotence. For example, a non-failure result of a parse-then-serialize operation will not change with any further parse-then-serialize operations applied to it. Similarly, manipulating a non-failure result through the API will not change from applying any number of serialize-then-parse operations to it.
As the editors learn more about the subject matter the goals might increase in scope somewhat.
1. Infrastructure
This specification depends on Infra. [INFRA]
Some terms used in this specification are defined in the following standards and specifications:
- Encoding [ENCODING]
- File API [FILEAPI]
- HTML [HTML]
- Unicode IDNA Compatibility Processing [UTS46]
- Web IDL [WEBIDL]
To serialize an integer, represent it as the shortest possible decimal number.
1.1. Writing
A validation error indicates a mismatch between input and valid input. User agents, especially conformance checkers, are encouraged to report them somewhere.
A validation error does not mean that the parser terminates. Termination of a parser is always stated explicitly, e.g., through a return statement.
It is useful to signal validation errors as error-handling can be non-intuitive, legacy user agents might not implement correct error-handling, and the intent of what is written might be unclear to other developers.
Error type | Error description | Failure |
---|---|---|
IDNA | ||
domain-to-ASCII |
Unicode ToASCII records an error or returns the empty string. [UTS46] If details about Unicode ToASCII errors are recorded, user agents are encouraged to pass those along. | Yes |
domain-invalid-code-point |
The input’s host contains a forbidden domain code point. Hosts are percent-decoded before being processed when the URL
is special, which would result in the following host portion becoming
" " | Yes |
domain-to-Unicode |
Unicode ToUnicode records an error. [UTS46] The same considerations as with domain-to-ASCII apply. | · |
Host parsing | ||
host-invalid-code-point |
An opaque host (in a URL that is not special) contains a forbidden host code point. | Yes |
IPv4-empty-part |
An IPv4 address ends with a U+002E (.). | · |
IPv4-too-many-parts |
An IPv4 address does not consist of exactly 4 parts. | Yes |
IPv4-non-numeric-part |
An IPv4 address part is not numeric. | Yes |
IPv4-non-decimal-part |
The IPv4 address contains numbers expressed using hexadecimal or octal digits. | · |
IPv4-out-of-range-part |
An IPv4 address part exceeds 255. | Yes (only if applicable to the last part) |
IPv6-unclosed |
An IPv6 address is missing the closing U+005D (]). | Yes |
IPv6-invalid-compression |
An IPv6 address begins with improper compression. | Yes |
IPv6-too-many-pieces |
An IPv6 address contains more than 8 pieces. | Yes |
IPv6-multiple-compression |
An IPv6 address is compressed in more than one spot. | Yes |
IPv6-invalid-code-point |
An IPv6 address contains a code point that is neither an ASCII hex digit nor a U+003A (:). Or it unexpectedly ends. | Yes |
IPv6-too-few-pieces |
An uncompressed IPv6 address contains fewer than 8 pieces. | Yes |
IPv4-in-IPv6-too-many-pieces |
An IPv6 address with IPv4 address syntax: the IPv6 address has more than 6 pieces. | Yes |
IPv4-in-IPv6-invalid-code-point |
An IPv6 address with IPv4 address syntax:
| Yes |
IPv4-in-IPv6-out-of-range-part |
An IPv6 address with IPv4 address syntax: an IPv4 part exceeds 255. | Yes |
IPv4-in-IPv6-too-few-parts |
An IPv6 address with IPv4 address syntax: an IPv4 address contains too few parts. | Yes |
URL parsing | ||
invalid-URL-unit |
A code point is found that is not a URL unit. | · |
special-scheme-missing-following-solidus |
The input’s scheme is not followed by " | · |
missing-scheme-non-relative-URL |
The input is missing a scheme, because it does not begin with an ASCII alpha, and either no base URL was provided or the base URL cannot be used as a base URL because it has an opaque path. Input’s scheme is missing and no base URL is given:
Input’s scheme is missing, but the base URL has an opaque path.
| Yes |
invalid-reverse-solidus |
The URL has a special scheme and it uses U+005C (\) instead of U+002F (/). | · |
invalid-credentials |
The input includes credentials. | · |
host-missing |
The input has a special scheme, but does not contain a host. | Yes |
port-out-of-range |
The input’s port is too big. | Yes |
port-invalid |
The input’s port is invalid. | Yes |
file-invalid-Windows-drive-letter |
The input is a relative-URL string that starts with a Windows drive letter and
the base URL’s scheme is "
| · |
file-invalid-Windows-drive-letter-host |
A | · |
1.2. Parsers
The EOF code point is a conceptual code point that signifies the end of a string or code point stream.
A pointer for a string input is an integer that points to a code point within input. Initially it points to the start of input. If it is −1 it points nowhere. If it is greater than or equal to input’s code point length, it points to the EOF code point.
When a pointer is used, c references the code point the pointer points to as long as it does not point nowhere. When the pointer points to nowhere c cannot be used.
When a pointer is used, remaining references the code point substring from the pointer + 1 to the end of the string, as long as c is not the EOF code point. When c is the EOF code point remaining cannot be used.
If "mailto:username@example
" is a string
being processed and a pointer points to @, c is U+0040 (@) and remaining is
"example
".
If the empty string is being processed and a pointer points to the start and is then decreased by 1, using c or remaining would be an error.
1.3. Percent-encoded bytes
A percent-encoded byte is U+0025 (%), followed by two ASCII hex digits.
It is generally a good idea for sequences of percent-encoded bytes to be such that, when percent-decoded and then passed to UTF-8 decode without BOM or fail, they do not end up as failure. How important this is depends on where the percent-encoded bytes are used. E.g., for the host parser not following this advice is fatal, whereas for URL rendering the percent-encoded bytes would not be rendered percent-decoded.
To percent-encode a byte byte, return a string consisting of U+0025 (%), followed by two ASCII upper hex digits representing byte.
To percent-decode a byte sequence input, run these steps:
Using anything but UTF-8 decode without BOM when input contains bytes that are not ASCII bytes might be insecure and is not recommended.
-
Let output be an empty byte sequence.
-
For each byte byte in input:
-
If byte is not 0x25 (%), then append byte to output.
-
Otherwise, if byte is 0x25 (%) and the next two bytes after byte in input are not in the ranges 0x30 (0) to 0x39 (9), 0x41 (A) to 0x46 (F), and 0x61 (a) to 0x66 (f), all inclusive, append byte to output.
-
Otherwise:
-
Let bytePoint be the two bytes after byte in input, decoded, and then interpreted as hexadecimal number.
-
Append a byte whose value is bytePoint to output.
-
Skip the next two bytes in input.
-
-
-
Return output.
To percent-decode a scalar value string input:
-
Let bytes be the UTF-8 encoding of input.
-
Return the percent-decoding of bytes.
In general, percent-encoding results in a string with more U+0025 (%) code points than the input, and percent-decoding results in a byte sequence with less 0x25 (%) bytes than the input.
The C0 control percent-encode set are the C0 controls and all code points greater than U+007E (~).
The fragment percent-encode set is the C0 control percent-encode set and U+0020 SPACE, U+0022 ("), U+003C (<), U+003E (>), and U+0060 (`).
The query percent-encode set is the C0 control percent-encode set and U+0020 SPACE, U+0022 ("), U+0023 (#), U+003C (<), and U+003E (>).
The query percent-encode set cannot be defined in terms of the fragment percent-encode set due to the omission of U+0060 (`).
The special-query percent-encode set is the query percent-encode set and U+0027 (').
The path percent-encode set is the query percent-encode set and U+003F (?), U+005E (^), U+0060 (`), U+007B ({), and U+007D (}).
The userinfo percent-encode set is the path percent-encode set and U+002F (/), U+003A (:), U+003B (;), U+003D (=), U+0040 (@), U+005B ([) to U+005D (]), inclusive, and U+007C (|).
The component percent-encode set is the userinfo percent-encode set and U+0024 ($) to U+0026 (&), inclusive, U+002B (+), and U+002C (,).
This is used by HTML for
registerProtocolHandler()
, and could also be used by other standards to
percent-encode data that can then be embedded in a URL’s path,
query, or fragment; or in an opaque host. Using it with
UTF-8 percent-encode gives identical results to JavaScript’s
encodeURIComponent()
[sic]. [HTML] [ECMA-262]
The application/x-www-form-urlencoded
percent-encode set is the
component percent-encode set and U+0021 (!), U+0027 (') to U+0029 RIGHT PARENTHESIS,
inclusive, and U+007E (~).
The application/x-www-form-urlencoded
percent-encode set contains
all code points, except the ASCII alphanumeric, U+002A (*), U+002D (-), U+002E (.), and
U+005F (_).
To percent-encode after encoding, given an encoding encoding, scalar value string input, a percentEncodeSet, and an optional boolean spaceAsPlus (default false):
-
Let encoder be the result of getting an encoder from encoding.
-
Let inputQueue be input converted to an I/O queue.
-
Let output be the empty string.
-
Let potentialError be 0.
This needs to be a non-null value to initiate the subsequent while loop.
-
While potentialError is non-null:
-
Let encodeOutput be an empty I/O queue.
-
Set potentialError to the result of running encode or fail with inputQueue, encoder, and encodeOutput.
-
For each byte of encodeOutput converted to a byte sequence:
-
If spaceAsPlus is true and byte is 0x20 (SP), then append U+002B (+) to output and continue.
-
Let isomorph be a code point whose value is byte’s value.
-
Assert: percentEncodeSet includes all non-ASCII code points.
-
If isomorph is not in percentEncodeSet, then append isomorph to output.
-
Otherwise, percent-encode byte and append the result to output.
-
-
If potentialError is non-null, then append "
%26%23
", followed by the shortest sequence of ASCII digits representing potentialError in base ten, followed by "%3B
", to output.This can happen when encoding is not UTF-8.
-
-
Return output.
Of the possible values for the percentEncodeSet argument only two end up
encoding U+0025 (%) and thus give “roundtripable data”: component percent-encode set and
application/x-www-form-urlencoded
percent-encode set. The other values for the
percentEncodeSet argument — which happen to be used by the URL parser — leave
U+0025 (%) untouched and as such it needs to be
percent-encoded first in order to be properly
represented.
To UTF-8 percent-encode a scalar value scalarValue using a percentEncodeSet, return the result of running percent-encode after encoding with UTF-8, scalarValue as a string, and percentEncodeSet.
To UTF-8 percent-encode a scalar value string input using a percentEncodeSet, return the result of running percent-encode after encoding with UTF-8, input, and percentEncodeSet.
Here is a summary, by way of example, of the operations defined above:
Operation | Input | Output |
---|---|---|
Percent-encode input | 0x23 | "%23 "
|
0x7F | "%7F "
| |
Percent-decode input | `%25%s%1G `
| `%%s%1G `
|
Percent-decode input | "‽%25%2E "
| 0xE2 0x80 0xBD 0x25 0x2E |
Percent-encode after encoding with Shift_JIS, input, and the userinfo percent-encode set | " "
| "%20 "
|
"≡ "
| "%81%DF "
| |
"‽ "
| "%26%238253%3B "
| |
Percent-encode after encoding with ISO-2022-JP, input, and the userinfo percent-encode set | "¥ "
| "%1B(J\%1B(B "
|
Percent-encode after encoding with Shift_JIS, input, the userinfo percent-encode set, and true | "1+1 ≡ 2%20‽ "
| "1+1+%81%DF+2%20%26%238253%3B "
|
UTF-8 percent-encode input using the userinfo percent-encode set | U+2261 (≡) | "%E2%89%A1 "
|
U+203D (‽) | "%E2%80%BD "
| |
UTF-8 percent-encode input using the userinfo percent-encode set | "Say what‽ "
| "Say%20what%E2%80%BD "
|
2. Security considerations
The security of a URL is a function of its environment. Care is to be taken when rendering, interpreting, and passing URLs around.
When rendering and allocating new URLs "spoofing" needs to be considered. An attack whereby one host or URL can be confused for another. For instance, consider how 1/l/I, m/rn/rri, 0/O, and а/a can all appear eerily similar. Or worse, consider how U+202A LEFT-TO-RIGHT EMBEDDING and similar code points are invisible. [UTR36]
When passing a URL from party A to B, both need to carefully consider what is happening. A might end up leaking data it does not want to leak. B might receive input it did not expect and take an action that harms the user. In particular, B should never trust A, as at some point URLs from A can come from untrusted sources.
3. Hosts (domains and IP addresses)
At a high level, a host, valid host string, host parser, and host serializer relate as follows:
-
The host parser takes an arbitrary scalar value string and returns either failure or a host.
-
A host can be seen as the in-memory representation.
-
A valid host string defines what input would not trigger a validation error or failure when given to the host parser. I.e., input that would be considered conforming or valid.
-
The host serializer takes a host and returns an ASCII string. (If that string is then parsed, the result will equal the host that was serialized.)
A parse-serialize roundtrip gives the following results, depending on the isOpaque argument to the host parser:
Input | Output (isOpaque = false) | Output (isOpaque = true) |
---|---|---|
EXAMPLE.COM
| example.com (domain)
| EXAMPLE.COM (opaque host)
|
example%2Ecom
| example%2Ecom (opaque host)
| |
faß.example
| xn--fa-hia.example (domain)
| fa%C3%9F.example (opaque host)
|
0
| 0.0.0.0 (IPv4)
| 0 (opaque host)
|
%30
| %30 (opaque host)
| |
0x
| 0x (opaque host)
| |
0xffffffff
| 255.255.255.255 (IPv4)
| 0xffffffff (opaque host)
|
[0:0::1]
| [::1] (IPv6)
| |
[0:0::1%5D
| Failure | |
[0:0::%31]
| ||
09
| Failure | 09 (opaque host)
|
example.255
| example.255 (opaque host)
| |
example^example
| Failure |
3.1. Host representation
A host is a domain, an IP address, an opaque host, or an empty host. Typically a host serves as a network address, but it is sometimes used as opaque identifier in URLs where a network address is not necessary.
A typical URL whose host is
an opaque host is git://github.com/whatwg/url.git
.
The RFCs referenced in the paragraphs below are for informative purposes only. They have no influence on host writing, parsing, and serialization. Unless stated otherwise in the sections that follow.
A domain is a non-empty ASCII string that identifies a realm within a network. [RFC1034]
The domain labels of a domain domain are the result of strictly splitting domain on U+002E (.).
The example.com
and example.com.
domains are
not equivalent and typically treated as distinct.
An IP address is an IPv4 address or an IPv6 address.
An IPv4 address is a 32-bit unsigned integer that identifies a network address. [RFC791]
An IPv6 address is a 128-bit unsigned integer that identifies a network address. This integer is composed of a list of 8 16-bit unsigned integers, also known as an IPv6 address’s pieces. [RFC4291]
Support for <zone_id>
is
intentionally omitted.
An opaque host is a non-empty ASCII string that can be used for further processing.
An empty host is the empty string.
3.2. Host miscellaneous
A forbidden host code point is U+0000 NULL, U+0009 TAB, U+000A LF, U+000D CR, U+0020 SPACE, U+0023 (#), U+002F (/), U+003A (:), U+003C (<), U+003E (>), U+003F (?), U+0040 (@), U+005B ([), U+005C (\), U+005D (]), U+005E (^), or U+007C (|).
A forbidden domain code point is a forbidden host code point, a C0 control, U+0025 (%), or U+007F DELETE.
To obtain the public suffix of a host host, run these steps. They return null or a domain representing a portion of host that is included on the Public Suffix List. [PSL]
-
If host is not a domain, then return null.
-
Let trailingDot be "
.
" if host ends with ".
"; otherwise the empty string. -
Let publicSuffix be the public suffix determined by running the Public Suffix List algorithm with host as domain. [PSL]
-
Assert: publicSuffix is an ASCII string that does not end with "
.
". -
Return publicSuffix and trailingDot concatenated.
To obtain the registrable domain of a host host, run these steps. They return null or a domain formed by host’s public suffix and the domain label preceding it, if any.
-
If host’s public suffix is null or host’s public suffix equals host, then return null.
-
Let trailingDot be "
.
" if host ends with ".
"; otherwise the empty string. -
Let registrableDomain be the registrable domain determined by running the Public Suffix List algorithm with host as domain. [PSL]
-
Assert: registrableDomain is an ASCII string that does not end with "
.
". -
Return registrableDomain and trailingDot concatenated.
Host input | Public suffix | Registrable domain |
---|---|---|
com
| com
| null |
example.com
| com
| example.com
|
www.example.com
| com
| example.com
|
sub.www.example.com
| com
| example.com
|
EXAMPLE.COM
| com
| example.com
|
example.com.
| com.
| example.com.
|
github.io
| github.io
| null |
whatwg.github.io
| github.io
| whatwg.github.io
|
إختبار
| xn--kgbechtv
| null |
example.إختبار
| xn--kgbechtv
| example.xn--kgbechtv
|
sub.example.إختبار
| xn--kgbechtv
| example.xn--kgbechtv
|
[2001:0db8:85a3:0000:0000:8a2e:0370:7334]
| null | null |
Specifications should prefer the origin concept for security decisions. The notion of "public suffix" and "registrable domain" cannot be relied-upon to provide a hard security boundary, as the public suffix list will diverge from client to client. Specifications which ignore this advice are encouraged to carefully consider whether URLs' schemes ought to be incorporated into any decisions made, i.e. whether to use the same site or schemelessly same site concepts.
3.3. IDNA
The domain to ASCII algorithm, given a string domain and a boolean beStrict, runs these steps:
-
Let result be the result of running Unicode ToASCII with domain_name set to domain, CheckHyphens set to beStrict, CheckBidi set to true, CheckJoiners set to true, UseSTD3ASCIIRules set to beStrict, Transitional_Processing set to false, VerifyDnsLength set to beStrict, and IgnoreInvalidPunycode set to false. [UTS46]
If beStrict is false, domain is an ASCII string, and strictly splitting domain on U+002E (.) does not produce any item that starts with an ASCII case-insensitive match for "
xn--
", this step is equivalent to ASCII lowercasing domain. -
If result is a failure value, domain-to-ASCII validation error, return failure.
-
If beStrict is false:
-
If result is the empty string, domain-to-ASCII validation error, return failure.
-
If result contains a forbidden domain code point, domain-invalid-code-point validation error, return failure.
Due to web compatibility and compatibility with non-DNS-based systems the forbidden domain code points are a subset of those disallowed when UseSTD3ASCIIRules is true. See also issue #397.
-
-
Assert: result is not the empty string and does not contain a forbidden domain code point.
Unicode IDNA Compatibility Processing guarantees this holds when beStrict is true. [UTS46]
-
Return result.
This document and the web platform at large use
Unicode IDNA Compatibility Processing and not IDNA2008. For instance,
☕.example
becomes xn--53h.example
and not failure. [UTS46] [RFC5890]
The domain to Unicode algorithm, given a domain domain and a boolean beStrict, runs these steps:
-
Let result be the result of running Unicode ToUnicode with domain_name set to domain, CheckHyphens set to beStrict, CheckBidi set to true, CheckJoiners set to true, UseSTD3ASCIIRules set to beStrict, Transitional_Processing set to false, and IgnoreInvalidPunycode set to false. [UTS46]
-
Signify domain-to-Unicode validation errors for any returned errors, and then, return result.
3.4. Host writing
A valid host string must be a valid domain string, a valid IPv4-address string, or: U+005B ([), followed by a valid IPv6-address string, followed by U+005D (]).
A string input is a valid domain if these steps return true:
-
Let domain be the result of running domain to ASCII with input and true.
-
Return false if domain is failure; otherwise true.
Ideally we define this in terms of a sequence of code points that make up a valid domain rather than through a whack-a-mole: issue 245.
A valid domain string must be a string that is a valid domain.
A valid IPv4-address string must be four shortest possible strings of ASCII digits, representing a decimal number in the range 0 to 255, inclusive, separated from each other by U+002E (.).
A valid IPv6-address string is defined in the "Text Representation of Addresses" chapter of IP Version 6 Addressing Architecture. [RFC4291]
A valid opaque-host string must be one of the following:
-
one or more URL units excluding forbidden host code points
-
U+005B ([), followed by a valid IPv6-address string, followed by U+005D (]).
This is not part of the definition of valid host string as it requires context to be distinguished.
3.5. Host parsing
The host parser takes a scalar value string input with an optional boolean isOpaque (default false), and then runs these steps. They return failure or a host.
-
If input starts with U+005B ([), then:
-
If input does not end with U+005D (]), IPv6-unclosed validation error, return failure.
-
Return the result of IPv6 parsing input with its leading U+005B ([) and trailing U+005D (]) removed.
-
-
If isOpaque is true, then return the result of opaque-host parsing input.
-
Assert: input is not the empty string.
-
Let domain be the result of running UTF-8 decode without BOM on the percent-decoding of input.
Alternatively UTF-8 decode without BOM or fail can be used, coupled with an early return for failure, as domain to ASCII fails on U+FFFD (�).
-
Let asciiDomain be the result of running domain to ASCII with domain and false.
-
If asciiDomain is failure, then return failure.
-
If asciiDomain ends in a number, then return the result of IPv4 parsing asciiDomain.
-
Return asciiDomain.
The ends in a number checker takes an ASCII string input and then runs these steps. They return a boolean.
-
Let parts be the result of strictly splitting input on U+002E (.).
-
If the last item in parts is the empty string, then:
-
Let last be the last item in parts.
-
If last is non-empty and contains only ASCII digits, then return true.
The erroneous input "
09
" will be caught by the IPv4 parser at a later stage. -
If parsing last as an IPv4 number does not return failure, then return true.
This is equivalent to checking that last is "
0X
" or "0x
", followed by zero or more ASCII hex digits. -
Return false.
The IPv4 parser takes an ASCII string input and then runs these steps. They return failure or an IPv4 address.
The IPv4 parser is not to be invoked directly. Instead check that the return value of the host parser is an IPv4 address.
-
Let parts be the result of strictly splitting input on U+002E (.).
-
If the last item in parts is the empty string, then:
-
If parts’s size is greater than 4, IPv4-too-many-parts validation error, return failure.
-
Let numbers be an empty list.