1. Preface
The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore, for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the UTF-8 encoding.
The other (legacy) encodings have been defined to some extent in the past. However, user agents have not always implemented them in the same way, have not always used the same labels, and often differ in dealing with undefined and former proprietary areas of encodings. This specification addresses those gaps so that new user agents do not have to reverse engineer encoding implementations and existing user agents can converge.
In particular, this specification defines all those encodings, their algorithms to go from bytes to scalar values and back, and their canonical names and identifying labels. This specification also defines an API to expose part of the encoding algorithms to JavaScript.
User agents have also significantly deviated from the labels listed in the IANA Character Sets registry. To stop spreading legacy encodings further, this specification is exhaustive about the aforementioned details and therefore has no need for the registry. In particular, this specification does not provide a mechanism for extending any aspect of encodings.
2. Security background
There is a set of encoding security issues when the producer and consumer do not agree on the encoding in use, or on the way a given encoding is to be implemented. For instance, an attack was reported in 2011 where a Shift_JIS leading byte 0x82 was used to “mask” a 0x22 trailing byte in a JSON resource of which an attacker could control some field. The producer did not see the problem even though this is an illegal byte combination. The consumer decoded it as a single U+FFFD (�) and therefore changed the overall interpretation as U+0022 (") is an important delimiter. Decoders of encodings that use multiple bytes for scalar values now require that in case of an illegal byte combination, a scalar value in the range U+0000 to U+007F, inclusive, cannot be “masked”. For the aforementioned sequence the output would be U+FFFD U+0022. (As an unfortunate exception to this, the gb18030 decoder will “mask” up to one such byte at end-of-queue.)
This is a larger issue for encodings that map anything that is an ASCII byte to something that is not an ASCII code point, when there is no leading byte present. These are “ASCII-incompatible” encodings and other than ISO-2022-JP and UTF-16BE/LE, which are unfortunately required due to deployed content, they are not supported. (Investigation is ongoing whether more labels of other such encodings can be mapped to the replacement encoding, rather than the unknown encoding fallback.) An example attack is injecting carefully crafted content into a resource and then encouraging the user to override the encoding, resulting in, e.g., script execution.
Encoders used by URLs found in HTML and HTML’s form feature can also result in slight information loss when an encoding is used that cannot represent all scalar values. E.g., when a resource uses the windows-1252 encoding a server will not be able to distinguish between an end user entering “💩” and “💩” into a form.
The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.
See also the Browser UI chapter.
3. Terminology
This specification depends on the Infra Standard. [INFRA]
Hexadecimal numbers are prefixed with "0x".
In equations, all numbers are integers, addition is represented by "+", subtraction by "−", multiplication by "×", integer division by "/" (returns the quotient), modulo by "%" (returns the remainder of an integer division), logical left shifts by "<<", logical right shifts by ">>", bitwise AND by "&", and bitwise OR by "|".
For logical right shifts operands must have at least twenty-one bits precision.
An I/O queue is a type of list with items of a particular type (i.e., bytes or scalar values). End-of-queue is a special item that can be present in I/O queues of any type and it signifies that there are no more items in the queue.
There are two ways to use an I/O queue: in immediate mode, to represent I/O data stored in memory, and in streaming mode, to represent data coming in from the network. Immediate queues have end-of-queue as their last item, whereas streaming queues need not have it, and so their read operation might block.
It is expected that streaming I/O queues will be created empty, and that new items will be pushed to it as data comes in from the network. When the underlying network stream closes, an end-of-queue item is to be pushed into the queue.
Since reading from a streaming I/O queue might block, streaming I/O queues are not to be used from an event loop. They are to be used in parallel instead.
To read an item from an I/O queue ioQueue, run these steps:
-
If ioQueue is empty, then wait until its size is at least 1.
-
If ioQueue[0] is end-of-queue, then return end-of-queue.
-
Remove ioQueue[0] and return it.
To read a number number of items from ioQueue, run these steps:
-
Let readItems be « ».
-
Perform the following step number times:
-
Remove end-of-queue from readItems.
-
Return readItems.
To peek a number number of items from an I/O queue ioQueue, run these steps:
-
Wait until either ioQueue’s size is equal to or greater than number, or ioQueue contains end-of-queue, whichever comes first.
-
Let prefix be « ».
-
For each n in the range 1 to number, inclusive:
-
If ioQueue[n] is end-of-queue, break.
-
Otherwise, append ioQueue[n] to prefix.
-
-
Return prefix.
To push an item item to an I/O queue ioQueue, run these steps:
-
If the last item in ioQueue is end-of-queue:
-
If item is end-of-queue, do nothing.
-
-
Otherwise, append item to ioQueue.
To push a sequence of items to an I/O queue ioQueue is to push each item in the sequence to ioQueue, in the given order.
To restore an item other than end-of-queue to an I/O queue, perform the list prepend operation. To restore a list of items excluding end-of-queue to an I/O queue, insert those items, in the given order, before the first item in the queue.
Inserting the bytes « 0xF0, 0x9F » in an I/O queue « 0x92 0xA9, end-of-queue », results in an I/O queue « 0xF0, 0x9F, 0x92 0xA9, end-of-queue ». The next item to be read would be 0xF0.
To convert an I/O queue ioQueue into a list, string, or byte sequence, return the result of reading an indefinite number of items from ioQueue.
To convert a list, string, or byte sequence input into an I/O queue, run these steps:
-
Assert: input is not a list or it does not contain end-of-queue.
-
Return an I/O queue containing the items in input, in order, followed by end-of-queue.
The Infra standard is expected to define some infrastructure around type conversions. See whatwg/infra issue #319. [INFRA]
I/O queues are defined as lists, not queues, because they feature a restore operation. However, this restore operation is an internal detail of the algorithms in this specification, and is not to be used by other standards. Implementations are free to find alternative ways to implement such algorithms, as detailed in Implementation considerations.
To obtain a scalar value from surrogates, given a leading surrogate leading and a trailing surrogate trailing, return 0x10000 + ((leading − 0xD800) << 10) + (trailing − 0xDC00).
To create a Uint8Array
object, given an I/O queue
ioQueue and a realm realm:
-
Let bytes be the result of converting ioQueue into a byte sequence.
-
Return the result of creating a
Uint8Array
object from bytes in realm.
4. Encodings
An encoding defines a mapping from a scalar value sequence to a byte sequence (and vice versa). Each encoding has a name, and one or more labels.
This specification defines three encodings with the same names as encoding schemes defined in the Unicode standard: UTF-8, UTF-16LE, and UTF-16BE. The encodings differ from the encoding schemes by byte order mark (also known as BOM) handling not being part of the encodings themselves and instead being part of wrapper algorithms in this specification, whereas byte order mark handling is part of the definition of the encoding schemes in the Unicode Standard. UTF-8 used together with the UTF-8 decode algorithm matches the encoding scheme of the same name. This specification does not provide wrapper algorithms that would combine with UTF-16LE and UTF-16BE to match the similarly-named encoding schemes. [UNICODE]
4.1. Encoders and decoders
Each encoding has an associated decoder and most of them have an associated encoder. Instances of decoders and encoders have a handler algorithm and might also have state. A handler algorithm takes an input I/O queue and an item, and returns finished, one or more items, error optionally with a code point, or continue.
The replacement and UTF-16BE/LE encodings have no encoder.
An error mode as used below is "replacement
" or "fatal
" for
a decoder and "fatal
" or "html
" for an encoder.
An XML processor would set error mode to "fatal
".
[XML]
"html
" exists as error mode due to HTML forms requiring a
non-terminating legacy encoder. The "html
" error mode causes
a sequence to be emitted that cannot be distinguished from legitimate input and can therefore lead
to silent data loss. Developers are strongly encouraged to use the UTF-8
encoding to prevent this from happening. [HTML]
To process a queue given an encoding’s decoder or encoder instance encoderDecoder, I/O queue input, I/O queue output, and error mode mode:
-
While true:
-
Let result be the result of processing an item with the result of reading from input, encoderDecoder, input, output, and mode.
-
If result is not continue, then return result.
-
To process an item given an item item, encoding’s encoder or decoder instance encoderDecoder, I/O queue input, I/O queue output, and error mode mode:
-
Assert: encoderDecoder is not an encoder instance or mode is not "
replacement
". -
Assert: encoderDecoder is not a decoder instance or mode is not "
html
". -
Assert: encoderDecoder is not an encoder instance or item is not a surrogate.
-
Let result be the result of running encoderDecoder’s handler on input and item.
-
If result is finished:
-
Push end-of-queue to output.
-
Return result.
-
-
Otherwise, if result is one or more items:
-
Assert: encoderDecoder is not a decoder instance or result does not contain any surrogates.
-
Push result to output.
-
-
Otherwise, if result is an error, switch on mode and run the associated steps:
- "
replacement
" - Push U+FFFD (�) to output.
- "
html
" - Push 0x26 (&), 0x23 (#), followed by the shortest sequence of 0x30 (0) to 0x39 (9), inclusive, representing result’s code point’s value in base ten, followed by 0x3B (;) to output.
- "
fatal
" - Return result.
- "
-
Return continue.
4.2. Names and labels
The table below lists all encodings and their labels user agents must support. User agents must not support any other encodings or labels.
For each encoding, ASCII-lowercasing its name yields one of its labels.
Authors must use the UTF-8 encoding and must use its
(ASCII case-insensitive) "utf-8
" label to identify it.
New protocols and formats, as well as existing formats deployed in new contexts, must use the
UTF-8 encoding exclusively. If these protocols and formats need to expose the
encoding’s name or label, they must expose it
as "utf-8
".
To get an encoding from a string label, run these steps:
-
Remove any leading and trailing ASCII whitespace from label.
-
If label is an ASCII case-insensitive match for any of the labels listed in the table below, then return the corresponding encoding; otherwise return failure.
This is a more basic and restrictive algorithm of mapping labels to encodings than section 1.4 of Unicode Technical Standard #22 prescribes, as that is necessary to be compatible with deployed content.
Name | Labels |
---|---|
The Encoding | |
UTF-8 | "unicode-1-1-utf-8 "
|
"unicode11utf8 "
| |
"unicode20utf8 "
| |
"utf-8 "
| |
"utf8 "
| |
"x-unicode20utf8 "
| |
Legacy single-byte encodings | |
IBM866 | "866 "
|
"cp866 "
| |
"csibm866 "
| |
"ibm866 "
| |
ISO-8859-2 | "csisolatin2 "
|
"iso-8859-2 "
| |
"iso-ir-101 "
| |
"iso8859-2 "
| |
"iso88592 "
| |
"iso_8859-2 "
| |
"iso_8859-2:1987 "
| |
"l2 "
| |
"latin2 "
| |
ISO-8859-3 | "csisolatin3 "
|
"iso-8859-3 "
| |
"iso-ir-109 "
| |
"iso8859-3 "
| |
"iso88593 "
| |
"iso_8859-3 "
| |
"iso_8859-3:1988 "
| |
"l3 "
| |
"latin3 "
| |
ISO-8859-4 | "csisolatin4 "
|
"iso-8859-4 "
| |
"iso-ir-110 "
| |
"iso8859-4 "
| |
"iso88594 "
| |
"iso_8859-4 "
| |
"iso_8859-4:1988 "
| |
"l4 "
| |
"latin4 "
| |
ISO-8859-5 | "csisolatincyrillic "
|
"cyrillic "
| |
"iso-8859-5 "
| |
"iso-ir-144 "
| |
"iso8859-5 "
| |
"iso88595 "
| |
"iso_8859-5 "
| |
"iso_8859-5:1988 "
| |
ISO-8859-6 | "arabic "
|
"asmo-708 "
| |
"csiso88596e "
| |
"csiso88596i "
| |
"csisolatinarabic "
| |
"ecma-114 "
| |
"iso-8859-6 "
| |
"iso-8859-6-e "
| |
"iso-8859-6-i "
| |
"iso-ir-127 "
| |
"iso8859-6 "
| |
"iso88596 "
| |
"iso_8859-6 "
| |
"iso_8859-6:1987 "
| |
ISO-8859-7 | "csisolatingreek "
|
"ecma-118 "
| |
"elot_928 "
| |
"greek "
| |
"greek8 "
| |
"iso-8859-7 "
| |
"iso-ir-126 "
| |
"iso8859-7 "
| |
"iso88597 "
| |
"iso_8859-7 "
| |
"iso_8859-7:1987 "
| |
"sun_eu_greek "
| |
ISO-8859-8 | "csiso88598e "
|
"csisolatinhebrew "
| |
"hebrew "
| |
"iso-8859-8 "
| |
"iso-8859-8-e "
| |
"iso-ir-138 "
| |
"iso8859-8 "
| |
"iso88598 "
| |
"iso_8859-8 "
| |
"iso_8859-8:1988 "
| |
"visual "
| |
ISO-8859-8-I | "csiso88598i "
|
"iso-8859-8-i "
| |
"logical "
| |
ISO-8859-10 | "csisolatin6 "
|
"iso-8859-10 "
| |
"iso-ir-157 "
| |
"iso8859-10 "
| |
"iso885910 "
| |
"l6 "
| |
"latin6 "
| |
ISO-8859-13 | "iso-8859-13 "
|
"iso8859-13 "
| |
"iso885913 "
| |
ISO-8859-14 | "iso-8859-14 "
|
"iso8859-14 "
| |
"iso885914 "
| |
ISO-8859-15 | "csisolatin9 "
|
"iso-8859-15 "
| |
"iso8859-15 "
| |
"iso885915 "
| |
"iso_8859-15 "
| |
"l9 "
| |
ISO-8859-16 | "iso-8859-16 "
|
KOI8-R | "cskoi8r "
|
"koi "
| |
"koi8 "
| |
"koi8-r "
| |
"koi8_r "
| |
KOI8-U | "koi8-ru "
|
"koi8-u "
| |
macintosh | "csmacintosh "
|
"mac "
| |
"macintosh "
| |
"x-mac-roman "
| |
windows-874 | "dos-874 "
|
"iso-8859-11 "
| |
"iso8859-11 "
| |
"iso885911 "
| |
"tis-620 "
| |
"windows-874 "
| |
windows-1250 | "cp1250 "
|
"windows-1250 "
| |
"x-cp1250 "
| |
windows-1251 | "cp1251 "
|
"windows-1251 "
| |
"x-cp1251 "
| |
windows-1252
See below for the relationship to historical "Latin1" and "ASCII" concepts. | "ansi_x3.4-1968 "
|
"ascii "
| |
"cp1252 "
| |
"cp819 "
| |
"csisolatin1 "
| |
"ibm819 "
| |
"iso-8859-1 "
| |
"iso-ir-100 "
| |
"iso8859-1 "
| |
"iso88591 "
| |
"iso_8859-1 "
| |
"iso_8859-1:1987 "
| |
"l1 "
| |
"latin1 "
| |
"us-ascii "
| |
"windows-1252 "
| |
"x-cp1252 "
| |
windows-1253 | "cp1253 "
|
"windows-1253 "
| |
"x-cp1253 "
| |
windows-1254 | "cp1254 "
|
"csisolatin5 "
| |
"iso-8859-9 "
| |
"iso-ir-148 "
| |
"iso8859-9 "
| |
"iso88599 "
| |
"iso_8859-9 "
| |
"iso_8859-9:1989 "
| |
"l5 "
| |
"latin5 "
| |
"windows-1254 "
| |
"x-cp1254 "
| |
windows-1255 | "cp1255 "
|
"windows-1255 "
| |
"x-cp1255 "
| |
windows-1256 | "cp1256 "
|
"windows-1256 "
| |
"x-cp1256 "
| |
windows-1257 | "cp1257 "
|
"windows-1257 "
| |
"x-cp1257 "
| |
windows-1258 | "cp1258 "
|
"windows-1258 "
| |
"x-cp1258 "
| |
x-mac-cyrillic | "x-mac-cyrillic "
|
"x-mac-ukrainian "
| |
Legacy multi-byte Chinese (simplified) encodings | |
GBK | "chinese "
|
"csgb2312 "
| |
"csiso58gb231280 "
| |
"gb2312 "
| |
"gb_2312 "
| |
"gb_2312-80 "
| |
"gbk "
| |
"iso-ir-58 "
| |
"x-gbk "
| |
gb18030 | "gb18030 "
|
Legacy multi-byte Chinese (traditional) encodings | |
Big5 | "big5 "
|
"big5-hkscs "
| |
"cn-big5 "
| |
"csbig5 "
| |
"x-x-big5 "
| |
Legacy multi-byte Japanese encodings | |
EUC-JP | "cseucpkdfmtjapanese "
|
"euc-jp "
| |
"x-euc-jp "
| |
ISO-2022-JP | "csiso2022jp "
|
"iso-2022-jp "
| |
Shift_JIS | "csshiftjis "
|
"ms932 "
| |
"ms_kanji "
| |
"shift-jis "
| |
"shift_jis "
| |
"sjis "
| |
"windows-31j "
| |
"x-sjis "
| |
Legacy multi-byte Korean encodings | |
EUC-KR | "cseuckr "
|
"csksc56011987 "
| |
"euc-kr "
| |
"iso-ir-149 "
| |
"korean "
| |
"ks_c_5601-1987 "
| |
"ks_c_5601-1989 "
| |
"ksc5601 "
| |
"ksc_5601 "
| |
"windows-949 "
| |
Legacy miscellaneous encodings | |
replacement | "csiso2022kr "
|
"hz-gb-2312 "
| |
"iso-2022-cn "
| |
"iso-2022-cn-ext "
| |
"iso-2022-kr "
| |
"replacement "
| |
UTF-16BE | "unicodefffe "
|
"utf-16be "
| |
UTF-16LE | "csunicode "
|
"iso-10646-ucs-2 "
| |
"ucs-2 "
| |
"unicode "
| |
"unicodefeff "
| |
"utf-16 "
| |
"utf-16le "
| |
x-user-defined | "x-user-defined "
|
All encodings and their labels are also available as non-normative encodings.json resource.
The set of supported encodings is primarily based on the intersection of the sets supported by major browser engines when the development of this standard started, while removing encodings that were rarely used legitimately but that could be used in attacks. The inclusion of some encodings is questionable in the light of anecdotal evidence of the level of use by existing Web content. That is, while they have been broadly supported by browsers, it is unclear if they are broadly used by Web content. However, an effort has not been made to eagerly remove single-byte encodings that were broadly supported by browsers or are part of the ISO 8859 series. In particular, the necessity of the inclusion of IBM866, macintosh, x-mac-cyrillic, ISO-8859-3, ISO-8859-10, ISO-8859-14, and ISO-8859-16 is doubtful for the purpose of supporting existing content, but there are no plans to remove these.
The windows-1252 encoding has various labels, such as
"latin1
", "iso-8859-1
", and "ascii
", which have historically
been confusing for developers. On the web, and in any software that seeks to be web-compatible by
implementing this standard, these are synonyms: "latin1
" and "ascii
" are
just labels for windows-1252, and any software following this standard will, for example,
decode 0x80 as U+20AC (€) when asked for the "Latin1" or "ASCII" decoding of that byte.
Software that does not follow this standard does not always give the same answers. The root of this is that the original document that specified Latin1 (ISO/IEC 8859-1) did not provide any mappings for bytes in the inclusive ranges 0x00 to 0x1F or 0x7F to 0x9F. Similarly, the original documents that specified ASCII (ISO/IEC 646, among others) did not provide any mappings for bytes in the inclusive range 0x80 to 0xFF. This means different software has chosen different code point mappings for those bytes when asked to use Latin1 or ASCII encodings. Web browsers and browser-compatible software have chosen to map those bytes according to windows-1252, which is a superset of both, and this choice was codified in this standard. Other software throws errors, or uses isomorphic decoding, or other mappings. [ISO8859-1] [ISO646]
As such, implementers and developers need to be careful whenever they are using libraries which expose APIs in terms of "Latin1" or "ASCII". It’s very possible such libraries will not give answers in line with this standard, if they have chosen other behaviors for the bytes which were left undefined in the original specifications.
4.3. Output encodings
To get an output encoding from an encoding encoding, run these steps:
-
If encoding is replacement or UTF-16BE/LE, then return UTF-8.
-
Return encoding.
The get an output encoding algorithm is useful for URL parsing and HTML form submission, which both need exactly this.
5. Indexes
Most legacy encodings make use of an index. An index is an ordered list of entries, each entry consisting of a pointer and a corresponding code point. Within an index pointers are unique and code points can be duplicated.
An efficient implementation likely has two indexes per encoding. One optimized for its decoder and one for its encoder.
To find the pointers and their corresponding code points in an index, let lines be the result of splitting the resource’s contents on U+000A LF. Then remove each item in lines that is the empty string or starts with U+0023 (#). Then the pointers and their corresponding code points are found by splitting each item in lines on U+0009 TAB. The first subitem is the pointer (as a decimal number) and the second is the corresponding code point (as a hexadecimal number). Other subitems are not relevant.
To signify changes an index includes an Identifier and a Date. If an Identifier has changed, so has the index.
The index code point for pointer in index is the code point corresponding to pointer in index, or null if pointer is not in index.
The index pointer for codePoint in index is the first pointer corresponding to codePoint in index, or null if codePoint is not in index.
There is a non-normative visualization for each index other than index gb18030 ranges and index ISO-2022-JP katakana. index jis0208 also has an alternative Shift_JIS visualization. Additionally, there is visualization of the Basic Multilingual Plane coverage of each index other than index gb18030 ranges and index ISO-2022-JP katakana.
The legend for the visualizations is:
- Unmapped
- Two bytes in UTF-8
- Two bytes in UTF-8, code point follows immediately the code point of previous pointer
- Three bytes in UTF-8 (non-PUA)
- Three bytes in UTF-8 (non-PUA), code point follows immediately the code point of previous pointer
- Private Use
- Private Use, code point follows immediately the code point of previous pointer
- Four bytes in UTF-8
- Four bytes in UTF-8, code point follows immediately the code point of previous pointer
- Duplicate code point already mapped at an earlier index
- CJK Compatibility Ideograph
- CJK Unified Ideographs Extension A
These are the indexes defined by this specification, excluding index single-byte, which have their own table:
Index | Notes | |||
---|---|---|---|---|
index Big5 | index-big5.txt | index Big5 visualization | index Big5 BMP coverage | This matches the Big5 standard in combination with the Hong Kong Supplementary Character Set and other common extensions. |
index EUC-KR | index-euc-kr.txt | index EUC-KR visualization | index EUC-KR BMP coverage | This matches the KS X 1001 standard and the Unified Hangul Code, more commonly known together as Windows Codepage 949. It covers the Hangul Syllables block of Unicode in its entirety. The Hangul block whose top left corner in the visualization is at pointer 9026 is in the Unicode order. Taken separately, the rest of the Hangul syllables in this index are in the Unicode order, too. |
index gb18030 | index-gb18030.txt | index gb18030 visualization | index gb18030 BMP coverage | This matches the GB18030-2022 standard for code points encoded as two bytes, except for 0xA3 0xA0 which maps to U+3000 IDEOGRAPHIC SPACE to be compatible with deployed content. This index covers the CJK Unified Ideographs block of Unicode in its entirety. Entries from that block that are above or to the left of (the first) U+3000 in the visualization are in the Unicode order. |
index gb18030 ranges | index-gb18030-ranges.txt | This index works different from all others. Listing all code points would result in over a million items whereas they can be represented neatly in 207 ranges combined with trivial limit checks. It therefore only superficially matches the GB18030-2000 standard for code points encoded as four bytes. The change for the GB18030-2005 revision is handled inline by the index gb18030 ranges code point and index gb18030 ranges pointer algorithms below that accompany this index. And the changes for the GB18030-2022 revision are handled differently again to not further increase the number of byte sequences mapping to Private Use code points. The relevant Private Use code points are mapped in the gb18030 encoder directly through a side table to preserve compatibility with how they were mapped before. | ||
index jis0208 | index-jis0208.txt | index jis0208 visualization, Shift_JIS visualization | index jis0208 BMP coverage | This is the JIS X 0208 standard including formerly proprietary extensions from IBM and NEC. |
index jis0212 | index-jis0212.txt | index jis0212 visualization | index jis0212 BMP coverage | This is the JIS X 0212 standard. It is only used by the EUC-JP decoder due to lack of widespread support elsewhere. |
index ISO-2022-JP katakana | index-iso-2022-jp-katakana.txt | This maps halfwidth to fullwidth katakana as per Unicode Normalization Form KC, except that U+FF9E (゙) and U+FF9F (゚) map to U+309B (゛) and U+309C (゜) rather than U+3099 (◌゙) and U+309A (◌゚). It is only used by the ISO-2022-JP encoder. [UNICODE] |
The index gb18030 ranges code point for pointer is the return value of these steps:
-
If pointer is greater than 39419 and less than 189000, or pointer is greater than 1237575, then return null.
-
If pointer is 7457, then return code point U+E7C7.
-
Let offset be the last pointer in index gb18030 ranges that is less than or equal to pointer and let codePointOffset be its corresponding code point.
-
Return a code point whose value is codePointOffset + pointer − offset.
The index gb18030 ranges pointer for codePoint is the return value of these steps:
-
If codePoint is U+E7C7, then return pointer 7457.
-
Let offset be the last code point in index gb18030 ranges that is less than or equal to codePoint and let pointerOffset be its corresponding pointer.
-
Return a pointer whose value is pointerOffset + codePoint − offset.
The index Shift_JIS pointer for codePoint is the return value of these steps:
-
Let index be index jis0208 excluding all entries whose pointer is in the range 8272 to 8835, inclusive.
The index jis0208 contains duplicate code points so the exclusion of these entries causes later code points to be used.
-
Return the index pointer for codePoint in index.
The index Big5 pointer for codePoint is the return value of these steps:
-
Let index be index Big5 excluding all entries whose pointer is less than (0xA1 - 0x81) × 157.
Avoid returning Hong Kong Supplementary Character Set extensions literally.
-
If codePoint is U+2550 (═), U+255E (╞), U+2561 (╡), U+256A (╪), U+5341 (十), or U+5345 (卅), then return the last pointer corresponding to codePoint in index.
There are other duplicate code points, but for those the first pointer is to be used.
-
Return the index pointer for codePoint in index.
All indexes are also available as a non-normative indexes.json resource. (Index gb18030 ranges has a slightly different format here, to be able to represent ranges.)
6. Hooks for standards
The algorithms defined below (UTF-8 decode, UTF-8 decode without BOM, UTF-8 decode without BOM or fail, and UTF-8 encode) are intended for usage by other standards.
For decoding, UTF-8 decode is to be used by new formats. For identifiers or byte sequences within a format or protocol, use UTF-8 decode without BOM or UTF-8 decode without BOM or fail.
For encoding, UTF-8 encode is to be used.
Standards are to ensure that the input I/O queues they pass to UTF-8 encode (as well as the legacy encode) are effectively I/O queues of scalar values, i.e., they contain no surrogates.
These hooks (as well as decode and encode) will block until the input I/O queue has been consumed in its entirety. In order to use the output tokens as they are pushed into the stream, callers are to invoke the hooks with an empty output I/O queue and read from it in parallel. Note that some care is needed when using UTF-8 decode without BOM or fail, as any error found during decoding will prevent the end-of-queue item from ever being pushed into the output I/O queue.
To UTF-8 decode an I/O queue of bytes ioQueue given an optional I/O queue of scalar values output (default « »), run these steps:
-
Let buffer be the result of peeking three bytes from ioQueue, converted to a byte sequence.
-
If buffer is 0xEF 0xBB 0xBF, then read three bytes from ioQueue. (Do nothing with those bytes.)
-
Process a queue with an instance of UTF-8’s decoder, ioQueue, output, and "
replacement
". -
Return output.
To UTF-8 decode without BOM an I/O queue of bytes ioQueue given an optional I/O queue of scalar values output (default « »), run these steps:
-
Process a queue with an instance of UTF-8’s decoder, ioQueue, output, and "
replacement
". -
Return output.
To UTF-8 decode without BOM or fail an I/O queue of bytes ioQueue given an optional I/O queue of scalar values output (default « »), run these steps:
-
Let potentialError be the result of processing a queue with an instance of UTF-8’s