Glossary of Unicode Terms

A B C D E F G H I J K L M

This glossary is updated periodically to stay synchronized with changes to various standards maintained by the Unicode Consortium.

See About Unicode Terminology for translations of various terms. There is also an FAQ section on the website.

Abjad. A writing system in which only consonants are indicated. The term “abjad” is derived from the first four letters of the traditional order of the Arabic script: alef, beh, jeem, dal. (See Section 6.1, Writing Systems.)

Abstract Character. A unit of information used for the organization, control, or representation of textual data. (See definition D7 in Section 3.4, Characters and Encoding.)

Abstract Character Sequence. An ordered sequence of one or more abstract characters. (See definition D8 in Section 3.4, Characters and Encoding.)

Abugida. A writing system in which consonants are indicated by the base letters that have an inherent vowel, and in which other vowels are indicated by additional distinguishing marks of some kind modifying the base letter. The term “abugida” is derived from the first four letters of the Ethiopic script in the Semitic order: alf, bet, gaml, dant. (See Section 6.1, Writing Systems.)

Accent Mark. A mark placed above, below, or to the side of a character to alter its phonetic value. (See also diacritic.)

Acrophonic. Denoting letters or numbers by the first letter of their name. For example, the Greek acrophonic numerals are variant forms of such initial letters.

Aksara. (1) In Sanskrit grammar, the term for “letter” in general, as opposed to consonant (vyanjana) or vowel (svara). Derived from the first and last letters of the traditional ordering of Sanskrit letters—“a” and “ksha”. (2) More generally, in Indic writing systems, aksara refers to an orthographic syllable.

Algorithm. A term used in a broad sense in the Unicode Standard, to mean the logical description of a process used to achieve a specified result. This does not require the actual procedure described in the algorithm to be followed; any implementation is conformant as long as the results are the same.

Alphabet. A writing system in which both consonants and vowels are indicated. The term “alphabet” is derived from the first two letters of the Greek script: alpha, beta. (See Section 6.1, Writing Systems.)

Alphabetic Property. Informative property of the primary units of alphabets and/or syllabaries. (See Section 4.10, Letters, Alphabetic, and Ideographic.)

Alphabetic Sorting. (See collation.)

AMTRA. Acronym for Arabic Mark Transient Reordering Algorithm. (See Unicode Standard Annex #53, “Unicode Arabic Mark Rendering.”)

Annotation. The association of secondary textual content with a point or range of the primary text. (The value of a particular annotation is considered to be a part of the “content” of the text. Typical examples include glossing, citations, exemplification, Japanese yomi, and so on.)

ANSI. (1) The American National Standards Institute. (2) The Microsoft collective name for all Windows code pages. Sometimes used specifically for code page 1252, which is a superset of ISO/IEC 8859-1.

Apparatus Criticus. Collection of conventions used by editors to annotate and comment on text.

Arabic Digits. The term "Arabic digits" may mean either the digits in the Arabic script (see Arabic-Indic digits) or the ordinary ASCII digits in contrast to Roman numerals (see European digits). When the term "Arabic digits" is used in Unicode specifications, it means Arabic-Indic digits. See Terminology for Digits for additional information on terminology related to digits.

Arabic-Indic Digits. Forms of decimal digits used in most parts of the Arabic world (for instance, U+0660, U+0661, U+0662, U+0663). Although European digits (1, 2, 3,…) derive historically from these forms, they are visually distinct and are coded separately. (Arabic-Indic digits are sometimes called Indic numerals; however, this nomenclature leads to confusion with the digits currently used with the scripts of India.) Variant forms of Arabic-Indic digits used chiefly in Iran and Pakistan are referred to as Eastern Arabic-Indic digits. (See Section 9.2, Arabic.) See Terminology for Digits for additional information on terminology related to digits.

ASCII. (1) The American Standard Code for Information Interchange, a 7-bit coded character set for information interchange. It is the U.S. national variant of ISO/IEC 646 and is formally the U.S. standard ANSI X3.4. It was proposed by ANSI in 1963 and finalized in 1968. (2) The set of 128 Unicode characters from U+0000 to U+007F, including control codes as well as graphic characters. (3) ASCII has been incorrectly used to refer to various 8-bit character encodings that include ASCII characters in the first 128 code points.

ASCII digits. The digit characters U+0030 to U+0039. Also known as European digits. See Terminology for Digits for additional information on terminology related to digits.

Assigned Character. A code point that is assigned to an abstract character. This refers to graphic, format, control, and private-use characters that have been encoded in the Unicode Standard. (See Section 2.4, Code Points and Characters.)

Assigned Code Point. (See designated code point.)

Atomic Character. A character that is not decomposable. (See decomposable character.)

Base Character. Any graphic character except for those with the General Category of Combining Mark (M). (See definition D51 in Section 3.6, Combination.) In a combining character sequence, the base character is the initial character, which the combining marks are applied to.

Basic Multilingual Plane. Plane 0, abbreviated as BMP.

Bicameral. A script that distinguishes between two cases. (See case.) Most often used in the context of Latin-based alphabets of Europe and elsewhere in the world.

Bidi. Abbreviation of bidirectional, in reference to mixed left-to-right and right-to-left text.

Bidirectional Display. The process or result of mixing left-to-right text and right-to-left text in a single line. (See Unicode Standard Annex #9, “Unicode Bidirectional Algorithm.”)

Big-endian. A computer architecture that stores multiple-byte numerical values with the most significant byte (MSB) values first.

Binary Files. Files containing nontextual information.

Block. A grouping of characters within the Unicode encoding space used for organizing code charts. Each block is a uniquely named, continuous, non-overlapping range of code points, containing a multiple of 16 code points, and starting at a location that is a multiple of 16. A block may contain unassigned code points, which are reserved.

BMP. Acronym for Basic Multilingual Plane.

BMP Character. A Unicode encoded character having a BMP code point. (See supplementary character.)

BMP Code Point. A Unicode code point between U+0000 and U+FFFF. (See supplementary code point.)

BNF. Acronym for Backus-Naur Form, a formal meta-syntax for describing context-free syntaxes. (For details, see Appendix A, Notational Conventions.)

BOCU-1. Acronym for Binary Ordered Compression for Unicode. A Unicode compression scheme that is MIME-compatible (directly usable for e-mail) and preserves binary order, which is useful for databases and sorted lists.

BOM. Acronym for byte order mark.

Bopomofo. An alphabetic script used primarily in the Republic of China (Taiwan) to write the sounds of Mandarin Chinese and some other dialects. Each symbol corresponds to either the syllable-initial or syllable-final sounds; it is therefore a subsyllabic script in its primary usage. The name is derived from the names of its first four elements. More properly known as zhuyin zimu or zhuyin fuhao in Mandarin Chinese.

Boustrophedon. A pattern of writing seen in some ancient manuscripts and inscriptions, where alternate lines of text are laid out in opposite directions, and where right-to-left lines generally use glyphs mirrored from their left-to-right forms. Literally, “as the ox turns,” referring to the plowing of a field.

Braille. A writing system using a series of raised dots to be read with the fingers by people who are blind or whose eyesight is not sufficient for reading printed material. (See Section 21.1, Braille.)

Braille Pattern. One of the 64 (for six-dot Braille) or 256 (for eight-dot Braille) possible tangible dot combinations.

Byte. (1) The minimal unit of addressable storage for a particular computer architecture. (2) An octet. Note that many early computer architectures used bytes larger than 8 bits in size, but the industry has now standardized almost uniformly on 8-bit bytes. The Unicode Standard follows the current industry practice in equating the term byte with octet and using the more familiar term byte in all contexts. (See octet.)

Byte Order Mark. The Unicode character U+FEFF when used to indicate the byte order of a text. (See Section 2.13, Special Characters and Noncharacters, and Section 23.8, Specials.)

Byte Serialization. The order of a series of bytes determined by a computer architecture.

Byte-Swapped. Reversal of the order of a sequence of bytes.

Camelcase. A casing convention for compound terms or identifiers, in which the letters are mostly lowercased, but component words or abbreviations may be capitalized. For example, "ThreeWordTerm" or "threeWordTerm".

Canonical. (1) Conforming to the general rules for encoding—that is, not compressed, compacted, or in any other form specified by a higher protocol. (2) Characteristic of a normative mapping and form of equivalence specified in Chapter 3, Conformance.

Canonical Composition. A step in the algorithm for Unicode Normalization Forms, during which decomposed sequences are replaced by primary composites, where possible. (See definition D115 in Section 3.11, Normalization Forms.)

Canonical Decomposable Character. A character that is not identical to its canonical decomposition. (See definition D69 in Section 3.7, Decomposition.)

Canonical Decomposition. Mapping to an inherently equivalent sequence—for example, mapping ä to a + combining umlaut. (For a full, formal definition, see definition D68 in Section 3.7, Decomposition.)

Canonical Equivalence. The relation between two character sequences whose full canonical decompositions are identical. (See definition D70 in Section 3.7, Decomposition.)

Canonical Equivalent. Two character sequences are said to be canonical equivalents if their full canonical decompositions are identical. (See definition D70 in Section 3.7, Decomposition.)

Canonical Ordering. The order of a combining character sequence that results from the application of the Canonical Ordering Algorithm, a step in the process of normalization of strings. See definition D109 in Section 3.11, Normalization Forms.

Cantillation Mark. A mark that is used to indicate how a text is to be chanted or sung.

Capital Letter. Synonym for uppercase letter. (See case.)

Case. (1) Feature of certain alphabets where the letters have two distinct forms. These variants, which may differ markedly in shape and size, are called the uppercase letter (also known as capital or majuscule) and the lowercase letter (also known as small or minuscule). (2) Normative property of characters, consisting of uppercase, lowercase, and titlecase (Lu, Ll, and Lt). (See Section 4.2, Case.)

Case Folding. The mapping of strings to a particular case form, to facilitate searching and sorting of text. Case foldings may be simple, when the case mappings are required not to change the length of the strings to compare, or full, when the case mappings may change the length of the strings to compare. (See Section 3.13.3, Default Case Folding.)

Case Mapping. The association of the uppercase, lowercase, and titlecase forms of a letter. (See Section 5.18, Case Mappings.)

Case-Ignorable. A character C is defined to be case-ignorable if C has the value MidLetter (ML), MidNumLet (MB), or Single_Quote (SQ) for the Word_Break property or its General_Category is one of Nonspacing_Mark (Mn), Enclosing_Mark (Me), Format (Cf), Modifier_Letter (Lm), or Modifier_Symbol (Sk). (See definition D136 in Section 3.13, Default Case Algorithms.)

Case-Ignorable Sequence. A sequence of zero or more case-ignorable characters. (See definition D137 in Section 3.13, Default Case Algorithms.)

CCC. Short name for the Canonical_Combining_Class property, usually lowercased: ccc.

CCS. (1) Acronym for coded character set. (2) Also used as an acronym for combining character sequence.

Cedilla. A mark originally placed beneath the letter c in French, Portuguese, and Spanish to indicate that the letter is to be pronounced as an s, as in façade. Obsolete Spanish diminutive of ceda, the letter z.

CEF. Acronym for character encoding form.

CES. Acronym for character encoding scheme.

Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]

Character Block. (See block.)

Character Class. A set of characters sharing a particular set of properties.

Character Encoding Form. Mapping from a character set definition to the actual code units used to represent the data.

Character Encoding Scheme. A character encoding form plus byte serialization. There are seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.

Character Entity. Expression of the form & for "&" or   for the no-break space. These are found in markup language files like HTML or XML. There are also numerically defined character entities. (See also character escape.)

Character Escape. A numerical expression of the form \uXXXX, \xXXXX or &#xXXXX; where X is a hex digit, or &#dddd; where d is a decimal digit. These are found in programming source code or markup language files (such as HTML or XML).

Character Name. A unique string used to identify each abstract character encoded in the standard. (See definition D4 in Section 3.3, Semantics.)

Character Name Alias. An additional unique string identifier, other than the character name, associated with an encoded character in the standard. (See definition D5 in Section 3.3, Semantics.)

Character Properties. A set of property names and property values associated with individual characters. (See Chapter 4, Character Properties.)

Character Repertoire. The collection of characters included in a character set.

Character Sequence. Synonym for abstract character sequence.

Character Set. A collection of elements used to represent textual information.

Charset. (See coded character set.)

Chillu. Abbreviation for chilaaksharam (singular) (cillakṣaram). Refers to any of a set of sonorant consonants in Malayalam, when appearing in syllable-final position with no inherent vowel.

Choseong. A sequence of one or more leading consonants in Korean.

Chu Hán. The name for Han characters used in Vietnam; derived from hànzì.

Chu Nôm. A demotic script of Vietnam developed from components of Han characters. Its creators used methods similar to those used by the Chinese in creating Han characters.

CJK. Acronym for Chinese, Japanese, and Korean. A variant, CJKV, means Chinese, Japanese, Korean, and Vietnamese.

CJK Unified Ideograph. A Han character that has undergone the process of Han unification (conducted primarily by the Ideographic Research Group) and been encoded as a single ideograph with one or more clearly identified CJK source mappings. CJK unified ideographs have no decomposition mappings, and the set of them in the Unicode Standard is normatively specified by the Unified_Ideograph property.

CLDR. (See Unicode Common Locale Data Repository.)

Coded Character. (See encoded character.)

Coded Character Representation. Synonym for coded character sequence.

Coded Character Sequence. An ordered sequence of one or more code points. Normally, this consists of a sequence of encoded characters, but it may also include noncharacters or reserved code points. (See definition D12 in Section 3.4, Characters and Encoding.)

Coded Character Set. A character set in which each character is assigned a numeric code point. Frequently abbreviated as character set, charset, or code set; the acronym CCS is also used.

Code Page. A coded character set, often referring to a coded character set used by a personal computer—for example, PC code page 437, the default coded character set used by the U.S. English version of the DOS operating system.

Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF₁₆. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.

Code Point Type. Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved. (See definition D10a in Section 3.4, Characters and Encoding.)

Code Position. Synonym for code point. Used in ISO character encoding standards.

Code Set. (See coded character set.)

Codespace. (1) A range of numerical values available for encoding characters. (2) For the Unicode Standard, a range of integers from 0 to 10FFFF₁₆. (See definition D9 in Section 3.4, Characters and Encoding.)

Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.)

Code Value. Obsolete synonym for code unit.

Codomain. For a mapping, the codomain is the set of code points or sequences that it maps to, while the domain is the set of values that are mapped. For example, a canonical decomposition is a mapping from a set of code points to a set of sequences; the codomain is the set of canonical equivalent mappings. (See also domain.)

Collation. The process of ordering units of textual information. Collation is usually specific to a particular language. Also known as alphabetizing or alphabetic sorting. Unicode Technical Standard #10, “Unicode Collation Algorithm," defines a complete, unambiguous, specified ordering for all characters in the Unicode Standard.

Combining Character. A character with the General Category of Combining Mark (M). (See definition D52 in Section 3.6, Combination.) (See also nonspacing mark.)

Combining Character Sequence. A maximal character sequence consisting of either a base character followed by a sequence of one or more characters where each is a combining character, zero width joiner, or zero width non-joiner; or a sequence of one or more characters where each is a combining character, zero width joiner, or zero width non-joiner. (See definition D56 in Section 3.6, Combination.)

Combining Class. A numeric value in the range 0..254 given to each Unicode code point, formally defined as the property Canonical_Combining_Class. (See definition D104 in Section 3.11, Normalization Forms.)

Combining Mark. A commonly used synonym for combining character.

Compatibility. (1) Consistency with existing practice or preexisting character encoding standards. (2) Characteristic of a normative mapping and form of equivalence specified in Section 3.7, Decomposition.

Compatibility Character. A character that would not have been encoded except for compatibility and round-trip convertibility with other standards. (See Section 2.3, Compatibility Characters.)

Compatibility Composite Character. Synonym for compatibility decomposable character.

Compatibility Decomposable Character. A character whose compatibility decomposition is not identical to its canonical decomposition. (See definition D66 in Section 3.7, Decomposition.)

Compatibility Decomposition. Mapping to a roughly equivalent sequence that may differ in style. (For a full, formal definition, see definition D65 in Section 3.7, Decomposition.)

Compatibility Equivalence. The relation between two character sequences whose full compatibility decompositions are identical. (See definition D67 in Section 3.7, Decomposition.)

Compatibility Equivalent. Two character sequences are said to be compatibility equivalents if their full compatibility decompositions are identical. (See definition D67 in Section 3.7, Decomposition.)

Compatibility Ideograph. A Han character encoded for compatibility with some East Asian character encoding, but which is not encoded as a CJK unified ideograph. Instead, each compatibility ideograph has a canonical decomposition mapping to a particular CJK unified ideograph.

Compatibility Precomposed Character. Synonym for compatibility decomposable character.

Compatibility Variant. A character that generally can be remapped to another character without loss of information other than formatting.

Composite Character. (See decomposable character.)

Composite Character Sequence. (See combining character sequence.)

Composition Exclusion. A Canonical Decomposable Character which has the property value Composition_Exclusion=True. (Used in the definition of Unicode Normalization Forms.) (See definition D112 in Section 3.11, Normalization Forms.)

Conformance. Adherence to a specified set of criteria for use of a standard. (See Chapter 3, Conformance.)

Confusable. Of similar or identical appearance. When referring to characters in strings, the appearance of confusable characters can make different identifiers hard or impossible to distinguish. (See also Unicode Technical Standard #39, "Unicode Security Mechanisms".)

Conjunct Form. A ligated form representing a consonant conjunct.

Consonant Cluster. A sequence of two or more consonantal sounds. Depending on the writing system, a consonant cluster may be represented by a single character or by a sequence of characters. (Contrast digraph.)

Consonant Conjunct. A sequence of two or more adjacent consonantal letterforms, consisting of a sequence of one or more dead consonants followed by a normal, live consonant letter. A consonant conjunct may be ligated into a single conjunct form, or it may be represented by graphically separable parts, such as subscripted forms of the consonant letters. Consonant conjuncts are associated with the Brahmi family of Indic scripts. (See Section 12.1, Devanagari.)

Contextual Variant. A text element can have a presentation form that depends on the textual context in which it is rendered. This presentation form is known as a contextual variant.

Contributory Property. A simple property defined merely to make the statement of a rule defining a derived property more compact or general. (See definition D35a in Section 3.5, Properties.)

Control Codes. The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F. Also known as control characters.

Core Specification. The central part of the Unicode Standard–the portion which up until Version 5.0 was published as a separate book. Starting with Version 5.2, this part of the standard has been published online only, rather than as a book. The core specification consists of the general introduction and framework for the standard, the formal conformance requirements, many implementation guidelines, and extensive chapters providing information about all the encoded characters, organized by script or by significant classes of characters. Formally, a version of the Unicode Standard is defined by an edition of this core specification, together with the Code Charts, Unicode Standard Annexes, and the Unicode Character Database

Cursive. Writing where the letters of a word are connected.

Dasia. Greek term for rough breathing mark, used in polytonic Greek character names.

DBCS. Acronym for double-byte character set.

Dead Consonant. An Indic consonant character followed by a virama character. This sequence indicates that the consonant has lost its inherent vowel. (See Section 12.1, Devanagari .)

Decimal Digits. Digits that can be used to form decimal-radix numbers.

Decomposable Character. A character that is equivalent to a sequence of one or more other characters, according to the decomposition mappings found in the Unicode Character Database, and those described in Section 3.12, Conjoining Jamo Behavior. It may also be known as a precomposed character or a composite character. (See definition D63 in Section 3.7, Decomposition.)

Decomposition. (1) The process of separating or analyzing a text element into component units. These component units may not have any functional status, but may be simply formal units—that is, abstract shapes. (2) A sequence of one or more characters that is equivalent to a decomposable character. (See definition D64 in Section 3.7, Decomposition.)

Decomposition Mapping. A mapping from a character to a sequence of one or more characters that is a canonical or compatibility equivalent and that is listed in the character names list or described in Section 3.12, Conjoining Jamo Behavior. (See definition D62 in Section 3.7, Decomposition.)

Default Ignorable. Default ignorable code points are those that should be ignored by default in rendering unless explicitly supported. They have no visible glyph or advance width in and of themselves, although they may affect the display, positioning, or adornment of adjacent or surrounding characters. (See Section 5.21, Ignoring Characters in Processing.)

Defective Combining Character Sequence. A combining character sequence that does not start with a base character. (See definition D57 in Section 3.6, Combination.)

Demotic Script. (1) A script or a form of a script used to write the vernacular or common speech of some language community. (2) A simplified form of the ancient Egyptian hieratic writing.

Dependent Vowel. A symbol or sign that represents a vowel and that is attached or combined with another symbol, usually one that represents a consonant. For example, in writing systems based on Arabic, Hebrew, and Indic scripts, vowels are normally represented as dependent vowel signs.

Deprecated. Of a coded character or a character property, strongly discouraged from use. (Not the same as obsolete.)

Deprecated Character. A coded character whose use is strongly discouraged. Such characters are retained in the standard, indefinitely but should not be used. (See definition D13 in Section 3.4, Characters and Encoding.)

Designated Code Point. Any code point that has either been assigned to an abstract character (assigned characters) or that has otherwise been given a normative function by the standard (surrogate code points and noncharacters). This definition excludes reserved code points. Also known as assigned code point. (See Section 2.4 Code Points and Characters.)

Deterministic Comparison. A string comparison in which strings that do not have identical contents will compare as unequal. There are two main varieties, depending on the sense of "identical:" (a) binary equality, or (b) canonical equivalence. This is a property of the comparison mechanism, and not of the sorting algorithm. Also known as stable (or semi-stable) comparison.

Deterministic Sort. A sort algorithm which returns exactly the same output each time it is applied to the same input. This is a property of the sorting algorithm, and not of the comparison mechanism. For example, a randomized Quicksort (which picks a random element as the pivot element, for optimal performance) is not deterministic. Multiprocessor implementations of a sort algorithm may also not be deterministic.

Diacritic. (1) A mark applied or attached to a symbol to create a new symbol that represents a modified or new value. (2) A mark applied to a symbol irrespective of whether it changes the value of that symbol. In the latter case, the diacritic usually represents an independent value (for example, an accent, tone, or some other linguistic information). Also called diacritical mark or diacritical. (See also combining character and nonspacing mark.)

Diaeresis. Two horizontal dots over a letter, as in naïve. The diaeresis is not distinguished from the umlaut in the Unicode character encoding. (See umlaut.)

Dialytika. Greek term for diaeresis or trema, used in Greek character names.

Digits. (See Arabic digits, European digits, and Indic digits.) See Terminology for Digits for additional information on terminology related to digits.

Digraph. A pair of signs or symbols (two graphs), which together represent a single sound or a single linguistic unit. The English writing system employs many digraphs (for example, th, ch, sh, qu, and so on). The same two symbols may not always be interpreted as a digraph (for example, cathode versus cathouse). When three signs are so combined, they are called a trigraph. More than three are usually called an n-graph.

Dingbats. Typographical symbols and ornaments.

Diphthong. A pair of vowels that are considered a single vowel for the purpose of phonemic distinction. One of the two vowels is more prominent than the other. In writing systems, diphthongs are sometimes written with one symbol and sometimes with more than one symbol (for example, with a digraph).

Direction. (See paragraph direction.)

Directionality Property. A property of every graphic character that determines its horizontal ordering as specified in Unicode Standard Annex #9, “Unicode Bidirectional Algorithm.” (See Section 4.4, Directionality.)

Display Cell. A rectangular region on a display device within which one or more glyphs are imaged.

Display Order. The order of glyphs presented in text rendering. (See logical order and Section 2.2, Unicode Design Principles.)

Domain. 1. For a mapping, the domain is the set of code points or sequences that are mapped, while the codomain is the set of values they are mapped to. For example, a canonical decomposition is a mapping from a set of code points to a set of sequences; the domain is the entire Unicode codespace. (See also codomain.) 2. A realm of administrative autonomy, authority or control in the Internet, identified by a domain name.

Domain Name. The part of a network address that identifies it as belonging to a particular domain. (Oxford Languages definition.) A domain name is a string of characters. The rules for how Unicode characters can be used in domain names is the concern of IDNA and of UTS #46, Unicode IDNA Compatibility Processing.

Double-Byte Character Set. One of a number of character sets defined for representing Chinese, Japanese, or Korean text (for example, JIS X 0208-1990). These character sets are often encoded in such a way as to allow double-byte character encodings to be mixed with single-byte character encodings. Abbreviated DBCS. (See also multibyte character set.)

Ductility. The ability of a cursive font to stretch or compress the connective baseline to effect text justification.

Dynamic Composition. Creation of composite forms such as accented letters or Hangul syllables from a sequence of characters.

EBCDIC. Acronym for Extended Binary-Coded Decimal Interchange Code. A group of coded character sets used on mainframes that consist of 8-bit coded characters. EBCDIC coded character sets reserve the first 64 code points (x00 to x3F) for control codes, and reserve the range x41 to xFE for graphic characters. The English alphabetic characters are in discontinuous segments with uppercase at xC1 to xC9, xD1 to xD9, xE2 to xE9, and lowercase at x81 to x89, x91 to x99, xA2 to xA9.

ECCS. Acronym for extended combining character sequence.

EGC. Acronym for extended grapheme cluster.

Embedding. A concept relevant to bidirectional behavior. (See Unicode Standard Annex #9, “Unicode Bidirectional Algorithm,” for detailed terminology and definitions.)

Emoji. (1) The Japanese word for "pictograph." (2) Certain pictographic and other symbols encoded in the Unicode Standard that are commonly given a colorful or playful presentation when displayed on devices. Many of the emoji in Unicode were originally encoded for compatibility with Japanese telephone symbol sets. (3) Colorful or playful symbols which are not encoded as characters but which are widely implemented as graphics. (See pictograph.)

Emoticon. A symbol added to text to express emotional affect or reaction—for example, sadness, happiness, joking intent, sarcasm, and so forth. Emoticons are often expressed by a conventional kind of "ASCII art," using sequences of punctuation and other symbols to portray likenesses of facial expressions. In Western contexts these are often turned sideways, as :-) to express a happy face; in East Asian contexts other conventions often portray a facial expression without turning, as ^-^. Rendering systems often recognize conventional emoticon sequences and display them as colorful or even animated glyphs in text. There is also a set of dedicated pictographic symbols—mostly representing different facial expressions—encoded as characters in the Unicode Standard. (See pictograph.)

Encapsulated Text. (1) Plain text surrounded by formatting information. (2) Text recoded to pass through narrow transmission channels or to match communication protocols.

Enclosing Mark. A nonspacing mark with the General Category of Enclosing Mark (Me). (See definition D54 in Section 3.6, Combination.) Enclosing marks are a subclass of nonspacing marks that surround a base character, rather than merely being placed over, under, or through it.

Encoded Character. An association (or mapping) between an abstract character and a code point. (See definition D11 in Section 3.4, Characters and Encoding.) By itself, an abstract character has no numerical value, but the process of “encoding a character” associates a particular code point with a particular abstract character, thereby resulting in an “encoded character.”

Encoding Form. (See character encoding form.)

Encoding Scheme. (See character encoding scheme.)

Equivalence. In the context of text processing, the process or result of establishing whether two text elements are identical in some respect.

Equivalent Sequence. (See