Encoding

Living Standard — Last Updated

Participate:
GitHub whatwg/encoding (new issue, open issues)
Chat on Matrix
Commits:
GitHub whatwg/encoding/commits
Snapshot as of this commit
@encodings
Tests:
web-platform-tests encoding/ (ongoing work)
Translations (non-normative):
日本語
简体中文
한국어

Abstract

The Encoding Standard defines encodings and their JavaScript API.

1. Preface

The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore, for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the UTF-8 encoding.

The other (legacy) encodings have been defined to some extent in the past. However, user agents have not always implemented them in the same way, have not always used the same labels, and often differ in dealing with undefined and former proprietary areas of encodings. This specification addresses those gaps so that new user agents do not have to reverse engineer encoding implementations and existing user agents can converge.

In particular, this specification defines all those encodings, their algorithms to go from bytes to scalar values and back, and their canonical names and identifying labels. This specification also defines an API to expose part of the encoding algorithms to JavaScript.

User agents have also significantly deviated from the labels listed in the IANA Character Sets registry. To stop spreading legacy encodings further, this specification is exhaustive about the aforementioned details and therefore has no need for the registry. In particular, this specification does not provide a mechanism for extending any aspect of encodings.

2. Security background

There is a set of encoding security issues when the producer and consumer do not agree on the encoding in use, or on the way a given encoding is to be implemented. For instance, an attack was reported in 2011 where a Shift_JIS leading byte 0x82 was used to “mask” a 0x22 trailing byte in a JSON resource of which an attacker could control some field. The producer did not see the problem even though this is an illegal byte combination. The consumer decoded it as a single U+FFFD (�) and therefore changed the overall interpretation as U+0022 (") is an important delimiter. Decoders of encodings that use multiple bytes for scalar values now require that in case of an illegal byte combination, a scalar value in the range U+0000 to U+007F, inclusive, cannot be “masked”. For the aforementioned sequence the output would be U+FFFD U+0022. (As an unfortunate exception to this, the gb18030 decoder will “mask” up to one such byte at end-of-queue.)

This is a larger issue for encodings that map anything that is an ASCII byte to something that is not an ASCII code point, when there is no leading byte present. These are “ASCII-incompatible” encodings and other than ISO-2022-JP and UTF-16BE/LE, which are unfortunately required due to deployed content, they are not supported. (Investigation is ongoing whether more labels of other such encodings can be mapped to the replacement encoding, rather than the unknown encoding fallback.) An example attack is injecting carefully crafted content into a resource and then encouraging the user to override the encoding, resulting in, e.g., script execution.

Encoders used by URLs found in HTML and HTML’s form feature can also result in slight information loss when an encoding is used that cannot represent all scalar values. E.g., when a resource uses the windows-1252 encoding a server will not be able to distinguish between an end user entering “💩” and “💩” into a form.

The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.

See also the Browser UI chapter.

3. Terminology

This specification depends on the Infra Standard. [INFRA]

Hexadecimal numbers are prefixed with "0x".

In equations, all numbers are integers, addition is represented by "+", subtraction by "−", multiplication by "×", integer division by "/" (returns the quotient), modulo by "%" (returns the remainder of an integer division), logical left shifts by "<<", logical right shifts by ">>", bitwise AND by "&", and bitwise OR by "|".

For logical right shifts operands must have at least twenty-one bits precision.


An I/O queue is a type of list with items of a particular type (i.e., bytes or scalar values). End-of-queue is a special item that can be present in I/O queues of any type and it signifies that there are no more items in the queue.

There are two ways to use an I/O queue: in immediate mode, to represent I/O data stored in memory, and in streaming mode, to represent data coming in from the network. Immediate queues have end-of-queue as their last item, whereas streaming queues need not have it, and so their read operation might block.

It is expected that streaming I/O queues will be created empty, and that new items will be pushed to it as data comes in from the network. When the underlying network stream closes, an end-of-queue item is to be pushed into the queue.

Since reading from a streaming I/O queue might block, streaming I/O queues are not to be used from an event loop. They are to be used in parallel instead.

To read an item from an I/O queue ioQueue, run these steps:

  1. If ioQueue is empty, then wait until its size is at least 1.

  2. If ioQueue[0] is end-of-queue, then return end-of-queue.

  3. Remove ioQueue[0] and return it.

To read a number number of items from ioQueue, run these steps:

  1. Let readItems be « ».

  2. Perform the following step number times:

    1. Append to readItems the result of reading an item from ioQueue.

  3. Remove end-of-queue from readItems.

  4. Return readItems.

To peek a number number of items from an I/O queue ioQueue, run these steps:

  1. Wait until either ioQueue’s size is equal to or greater than number, or ioQueue contains end-of-queue, whichever comes first.

  2. Let prefix be « ».

  3. For each n in the range 1 to number, inclusive:

    1. If ioQueue[n] is end-of-queue, break.

    2. Otherwise, append ioQueue[n] to prefix.

  4. Return prefix.

To push an item item to an I/O queue ioQueue, run these steps:

  1. If the last item in ioQueue is end-of-queue:

    1. If item is end-of-queue, do nothing.

    2. Otherwise, insert item before the last item in ioQueue.

  2. Otherwise, append item to ioQueue.

To push a sequence of items to an I/O queue ioQueue is to push each item in the sequence to ioQueue, in the given order.

To restore an item other than end-of-queue to an I/O queue, perform the list prepend operation. To restore a list of items excluding end-of-queue to an I/O queue, insert those items, in the given order, before the first item in the queue.

Inserting the bytes « 0xF0, 0x9F » in an I/O queue « 0x92 0xA9, end-of-queue », results in an I/O queue « 0xF0, 0x9F, 0x92 0xA9,