[Unicode] Technical Reports
 

Unicode® Technical Standard #10

Unicode Collation Algorithm

Version 17.0.0
Editors Ken Whistler ([email protected]), Markus Scherer ([email protected])
Date 2025-09-03
This Version https://www.unicode.org/reports/tr10/tr10-53.html
Previous Version https://www.unicode.org/reports/tr10/tr10-51.html
Latest Version https://www.unicode.org/reports/tr10/
Latest Proposed Update https://www.unicode.org/reports/tr10/proposed.html
Revision 53

 

Summary

This report is the specification of the Unicode Collation Algorithm (UCA), which details how to compare two Unicode strings while remaining conformant to the requirements of the Unicode Standard. The UCA also supplies the Default Unicode Collation Element Table (DUCET) as the data specifying the default collation order for all Unicode characters.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents




1 Introduction

Collation is the general term for the process and function of determining the sorting order of strings of characters. It is a key function in computer systems; whenever a list of strings is presented to users, they are likely to want it in a sorted order so that they can easily and reliably find individual strings. Thus it is widely used in user interfaces. It is also crucial for databases, both in sorting records and in selecting sets of records with fields within given bounds.

Collation varies according to language and culture: Germans, French and Swedes sort the same characters differently. It may also vary by specific application: even within the same language, dictionaries may sort differently than phonebooks or book indices. For non-alphabetic scripts such as East Asian ideographs, collation can be either phonetic or based on the appearance of the character. Collation can also be customized according to user preference, such as ignoring punctuation or not, putting uppercase before lowercase (or vice versa), and so on. Linguistically correct searching needs to use the same mechanisms: just as "ä" and "æ" sort as if they were the same base letter in Swedish, a loose search should pick up words with either one of them.

Collation implementations must deal with the complex linguistic conventions for ordering text in specific languages, and provide for common customizations based on user preferences. Furthermore, algorithms that allow for good performance are crucial for any collation mechanisms to be accepted in the marketplace.

Table 1 shows some examples of cases where sort order differs by language, usage, or another customization.

Table 1. Example Differences

Language Swedish: z < ö
German: ö < z
Usage German Dictionary: of < öf
German Phonebook: öf < of
Customizations Upper-First A < a
Lower-First a < A

Languages vary regarding which types of comparisons to use (and in which order they are to be applied), and in what constitutes a fundamental element for sorting. For example, Swedish treats ä as an individual letter, sorting it after z in the alphabet; German, however, sorts it either like ae or like other accented forms of a, thus following a. In Slovak, the digraph ch sorts as if it were a separate letter after h. Examples from other languages and scripts abound. Languages whose writing systems use uppercase and lowercase typically ignore the differences in case, unless there are no other differences in the text.

It is important to ensure that collation meets user expectations as fully as possible. For example, in the majority of Latin languages, ø sorts as an accented variant of o, meaning that most users would expect ø alongside o. However, a few languages, such as Norwegian and Danish, sort ø as a unique element after z. Sorting "Søren" after "Sylt" in a long list, as would be expected in Norwegian or Danish, will cause problems if the user expects ø as a variant of o. A user will look for "Søren" between "Sorem" and "Soret", not see it in the selection, and assume the string is missing, confused because it was sorted in a completely different location. In matching, the same can occur, which can cause significant problems for software customers; for example, in a database selection the user may not realize what records are missing. See Section 1.5, Other Applications of Collation.

With Unicode applications widely deployed, multilingual data is the rule, not the exception. Furthermore, it is increasingly common to see users with many different sorting expectations accessing the data. For example, a French company with customers all over Europe will include names from many different languages. If a Swedish employee at this French company accesses the data from a Swedish company location, the customer names need to show up in the order that meets this employee's expectations—that is, in a Swedish order—even though there will be many different accented characters that do not normally appear in Swedish text.

For scripts and characters not used in a particular language, explicit rules may not exist. For example, Swedish and French have clearly specified, distinct rules for sorting ä (either after z or as an accented character with a secondary difference from a), but neither defines the ordering of characters such as Ж, ש, ♫, ∞, ◊, or ⌂.

1.1 Multi-Level Comparison

To address the complexities of language-sensitive sorting, a multilevel comparison algorithm is employed. In comparing two words, the most important feature is the identity of the base letters—for example, the difference between an A and a B. Accent differences are typically ignored, if the base letters differ. Case differences (uppercase versus lowercase) are typically ignored, if the base letters or their accents differ. Treatment of punctuation varies. In some situations a punctuation character is treated like a base letter. In other situations, it should be ignored if there are any base, accent, or case differences. There may also be a final, tie-breaking level (called an identical level), whereby if there are no other differences at all in the string, the (normalized) code point order is used.

Table 2. Comparison Levels

Level Description Examples
L1 Base characters role < roles < rule
L2 Accents role < rôle < roles
L3 Case/Variants role < Role < rôle
L4 Punctuation role < role < Role
Ln Identical role < role < role

The examples in Table 2 are in English; the description of the levels may correspond to different writing system features in other languages. In each example, for levels L2 through Ln, the differences on that level (indicated by the underlined characters) are swamped by the stronger-level differences (indicated by the blue text). For example, the L2 example shows that difference between an o and an accented ô is swamped by an L1 difference (the presence or absence of an s). In the last example, the □ represents a format character, which is otherwise completely ignorable.

The primary level (L1) is for the basic sorting of the text, and the non-primary levels (L2..Ln) are for adjusting string weights for other linguistic elements in the writing system that are important to users in ordering, but less important than the order of the basic sorting. In practice, fewer levels may be needed, depending on user preferences or customizations.

1.1.1 Collation Order and Code Chart Order

Many people expect the characters in their language to be in the "correct" order in the Unicode code charts. Because collation varies by language and not just by script, it is not possible to arrange the encoding for characters so that simple binary string comparison produces the desired collation order for all languages. Because multi-level sorting is a requirement, it is not even possible to arrange the encoding for characters so that simple binary string comparison produces the desired collation order for any particular language. Separate data tables are required for correct sorting order. For more information on tailorings for different languages, see [CLDR].

The basic principle to remember is: The position of characters in the Unicode code charts does not specify their sort order.

1.2 Canonical Equivalence

There are many cases in Unicode where two sequences of characters are canonically equivalent: the sequences represent essentially the same text, but with different actual sequences. For more information, see [UAX15].

Sequences that are canonically equivalent must sort the same. Table 3 gives some examples of canonically equivalent sequences. For example, the angstrom sign was encoded for compatibility, and is canonically equivalent to an A-ring. The latter is also equivalent to the decomposed sequence of A plus the combining ring character. The order of certain combining marks is also irrelevant in many cases, so such sequences must also be sorted the same, as shown in the second example. The third example shows a composed character that can be decomposed in four different ways, all of which are canonically equivalent.

Table 3. Canonical Equivalence

1 Å U+212B ANGSTROM SIGN
Å U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
A ◌̊ U+0041 LATIN CAPITAL LETTER A, U+030A COMBINING RING ABOVE
 
2 x ◌̛ ◌̣ U+0078 LATIN SMALL LETTER X, U+031B COMBINING HORN, U+0323 COMBINING DOT BELOW
x ◌̣ ◌̛ U+0078 LATIN SMALL LETTER X, U+0323 COMBINING DOT BELOW, U+031B COMBINING HORN
 
3 U+1EF1 LATIN SMALL LETTER U WITH HORN AND DOT BELOW
ụ ◌̛ U+1EE5 LATIN SMALL LETTER U WITH DOT BELOW, U+031B COMBINING HORN
u ◌̛ ◌̣ U+0075 LATIN SMALL LETTER U, U+031B COMBINING HORN, U+0323 COMBINING DOT BELOW
ư ◌̣ U+01B0 LATIN SMALL LETTER U WITH HORN, U+0323 COMBINING DOT BELOW
u ◌̣ ◌̛ U+0075 LATIN SMALL LETTER U, U+0323 COMBINING DOT BELOW, U+031B COMBINING HORN

1.3 Contextual Sensitivity

There are additional complications in certain languages, where the comparison is context sensitive and depends on more than just single characters compared directly against one another, as shown in Table 4.

The first example of such a complication consists of contractions, where two (or more) characters sort as if they were a single base letter. In the table below, CH acts like a single letter sorted after C.

The second example consists of expansions, where a single character sorts as if it were a sequence of two (or more) characters. In the table below, an Œ ligature sorts as if it were the sequence of O + E.

Both contractions and expansions can be combined: that is, two (or more) characters may sort as if they were a different sequence of two (or more) characters. In the third example, for Japanese, a length mark sorts with only a tertiary difference from the vowel of the previous syllable: as an A after KA and as an I after KI.

Table 4. Context Sensitivity

Contractions  H < Z, but
CH > CZ
Expansions OE < Œ < OF
Both カー < カア, but
キー > キア

Some languages have additional oddities in the way they sort. Normally, all differences in sorting are assessed from the start to the end of the string. If all of the base letters are the same, the first accent difference determines the final order. In row 1 of Table 5, the first accent difference is on the o, so that is what determines the order. In some French dictionary ordering traditions, however, it is the last accent difference that determines the order, as shown in row 2.

Table 5. Backward Accent Ordering

Normal Accent Ordering cote < coté < côte < cô
Backward Accent Ordering cote < côte < coté < côté

1.4 Customization

In practice, there are additional features of collation that users need to control. These are expressed in user-interfaces and eventually in APIs. Other customizations or user preferences include the following:

Phonetic sorting of Han characters requires use of either a lookup dictionary of words or, more typically, special construction of programs or databases to maintain an associated phonetic spelling for the words in the text.

1.5 Other Applications of Collation

The same principles about collation behavior apply to realms beyond sorting. In particular, searching should behave consistently with sorting. For example, if two letters are treated as identical base letters for sorting, then those letters should also be treated as identical for searching. The ability to set the maximal strength level is very important for searching.

Selection is the process of using the comparisons between the endpoints of a range, as when using a SELECT command in a database query. It is crucial that the range returned be correct according to the user's expectations. For example, if a German businessman making a database selection to sum up revenue in each of the cities from O... to P... for planning purposes does not realize that all cities starting with Ö were excluded because the query selection was using a Swedish collation, he will be one very unhappy customer.

A sequence of characters considered a unit in collation, such as ch in Slovak, represents a collation grapheme cluster. For applications of this concept, see Unicode Technical Standard #18, "Unicode Regular Expressions" [UTS18]. For more information on grapheme clusters, see Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29].

1.6 Merging Sort Keys

Sort keys may need to be merged. For example, the simplest way to sort a database according to two fields is to sort field by field, sequentially. This gives the results in column one in Table 6. (The examples in this table are ordered using the Shifted option for handling variable collation elements such as the space character; see Section 4 Variable Weighting for details.) All the levels in Field 1 are compared first, and then all the levels in Field 2. The problem with this approach is that high-level differences in the second field are swamped by minute differences in the first field, which results in unexpected ordering for the first names.

Table 6. Merged Fields

Sequential Weak First Merged
F1L1, F1L2, F1L3,
F2L1, F2L2, F2L3
F1L1,
F2L1, F2L2, F2L3
F1L1, F2L1,
F1L2, F2L2,
F1L3, F2L3
di Silva Fred
di Silva John
diSilva Fred
diSilva John
disílva Fred
disílva John
disílva Fred
diSilva Fred
di Silva Fred
di Silva John
diSilva John
disílva John
di Silva Fred
diSilva Fred
disílva Fred
di Silva John
diSilva John
disílva John

A second way to do the sorting is to ignore all but base-level differences in the sorting of the first field. This gives the results in the second column. The first names are all in the right order, but the problem is now that the first field is not correctly ordered except by the base character level.

The correct way to sort two fields is to merge the fields, as shown in the "Merged" column. Using this technique, all differences in the fields are taken into account, and the levels are considered uniformly. Accents in all fields are ignored if there are any base character differences in any of the field, and case in all fields is ignored if there are accent or base character differences in any of the fields.

1.7 Performance

Collation is one of the most performance-critical features in a system. Consider the number of comparison operations that are involved in sorting or searching large databases, for example. Most production implementations will use a number of optimizations to speed up string comparison.

Strings are often preprocessed into sort keys, so that multiple comparisons operations are much faster. With this mechanism, a collation engine generates a sort key from any given string. The binary comparison of two sort keys yields the same result (less, equal, or greater) as the collation engine would return for a comparison of the original strings. Thus, for a given collation C and any two strings A and B:

A ≤ B according to C if and only if sortkey(C, A) ≤ sortkey(C, B)

However, simple string comparison is faster for any individual comparison, because the generation of a sort key requires processing an entire string, while differences in most string comparisons are found before all the characters are processed. Typically, there is a considerable difference in performance, with simple string comparison being about 5 to 10 times faster than generating sort keys and then using a binary comparison.

Sort keys, on the other hand, can be much faster for multiple comparisons. Because binary comparison is much faster than string comparison, it is faster to use sort keys whenever there will be more than about 10 comparisons per string, if the system can afford the storage.

1.8 What Collation is Not

There are a number of common expectations about and misperceptions of collation. This section points out many things that collation is not and cannot be.

Collation is not aligned with character sets or repertoires of characters.

Swedish and German share most of the same characters, for example, but have very different sorting orders.

Collation is not code point (binary) order.

A simple example of this is the fact that capital Z comes before lowercase a in the code charts. As noted earlier, beginners may complain that a particular Unicode character is “not in the right place in the code chart.” That is a misunderstanding of the role of the character encoding in collation. While the Unicode Standard does not gratuitously place characters such that the binary ordering is odd, the only way to get the linguistically-correct order is to use a language-sensitive collation, not a binary ordering.

Collation is not a property of strings.

In a list of cities, with each city correctly tagged with its language, a German user will expect to see all of the cities sorted according to German order, and will not expect to see a word with ö appear after z, simply because the city has a Swedish name. As in the earlier example, it is crucially important that if a German businessman makes a database selection, such as to sum up revenue in each of of the cities from O... to P... for planning purposes, cities starting with Ö not be excluded.

Collation order is not preserved under concatenation or substring operations, in general.

For example, the fact that x is less than y does not mean that x + z is less than y + z, because characters may form contractions across the substring or concatenation boundaries. In summary:

x < y does not imply that xz < yz
x < y does not imply that zx < zy
xz < yz does not imply that x < y
zx < zy does not imply that x < y
 

Collation order is not preserved when comparing sort keys generated from different collation sequences.

Remember that sort keys are a preprocessing of strings according to a given set of collation features. Different features result in different binary sequences. For example, if there are two collations, F and G, where F is a French collation, and G is a German phonebook ordering, then:

Collation order is not a stable sort.

Stability is a property of a sort algorithm, not of a collation sequence.

Stable Sort

A stable sort is one where two records with a field that compares as equal will retain their order if sorted according to that field. This is a property of the sorting algorithm, not of the comparison mechanism. For example, a bubble sort is stable, while a Quicksort is not. This is a useful property, but cannot be accomplished by modifications to the comparison mechanism or tailorings. See also Appendix A, Deterministic Sorting.

Deterministic Comparison

A deterministic comparison is different. It is a comparison in which strings that are not canonical equivalents will not be judged to be equal. This is a property of the comparison, not of the sorting algorithm. This is not a particularly useful property—its implementation also requires extra processing in string comparison or an extra level in sort keys, and thus may degrade performance to little purpose. However, if a deterministic comparison is required, the specified mechanism is to append the NFD form of the original string after the sort key, in Section 7.3, Form Sort Keys. See also Appendix A, Deterministic Sorting.

A deterministic comparison is also sometimes referred to as a stable (or semi-stable) comparison. Those terms are not to be preferred, because they tend to be confused with stable sort.

Collation order is not fixed.

Over time, collation order will vary: there may be fixes needed as more information becomes available about languages; there may be new government or industry standards for the language that require changes; and finally, new characters added to the Unicode Standard will interleave with the previously-defined ones. This means that collations must be carefully versioned.

1.9 The Unicode Collation Algorithm

The Unicode Collation Algorithm (UCA) details how to compare two Unicode strings while remaining conformant to the requirements of the Unicode Standard. This standard includes the Default Unicode Collation Element Table (DUCET), which is data specifying the default collation order for all Unicode characters, and the CLDR root collation element table that is based on the DUCET. This table is designed so that it can be tailored to meet the requirements of different languages and customizations.

Briefly stated, the Unicode Collation Algorithm takes an input Unicode string and a Collation Element Table, containing mapping data for characters. It produces a sort key, which is an array of unsigned 16-bit integers. Two or more sort keys so produced can then be binary-compared to give the correct comparison between the strings for which they were generated.

The Unicode Collation Algorithm assumes multiple-level key weighting, along the lines widely implemented in IBM technology, and as described in the Canadian sorting standard [CanStd] and the International String Ordering standard [ISO14651].

By default, the algorithm makes use of three fully-customizable levels. For the Latin script, these levels correspond roughly to:

  1. alphabetic ordering
  2. diacritic ordering
  3. case ordering.

A final level may be used for tie-breaking between strings not otherwise distinguished.

This design allows implementations to produce culturally acceptable collation, with a minimal burden on memory requirements and performance. In particular, it is possible to construct Collation Element Tables that use 32 bits of collation data for most characters.

Implementations of the Unicode Collation Algorithm are not limited to supporting only three levels. They are free to support a fully customizable 4th level (or more levels), as long as they can produce the same results as the basic algorithm, given the right Collation Element Tables. For example, an application which uses the algorithm, but which must treat some collection of special characters as ignorable at the first three levels and must have those specials collate in non-Unicode order (for example to emulate an existing EBCDIC-based collation), may choose to have a fully customizable 4th level. The downside of this choice is that such an application will require more storage, both for the Collation Element Table and in constructed sort keys.

The Collation Element Table may be tailored to produce particular culturally required orderings for different languages or locales. As in the algorithm itself, the tailoring can provide full customization for three (or more) levels.

1.9.1 Goals

The algorithm is designed to satisfy the following goals:

  1. A complete, unambiguous, specified ordering for all characters in Unicode.
  2. A complete resolution of the handling of canonical and compatibility equivalences as relates to the default ordering.
  3. A complete specification of the meaning and assignment of collation levels, including whether a character is ignorable by default in collation.
  4. A complete specification of the rules for using the level weights to determine the default collation order of strings of arbitrary length.
  5. Allowance for override mechanisms (tailoring) to create language-specific orderings. Tailoring can be provided by any well-defined syntax that takes the default ordering and produces another well-formed ordering.
  6. An algorithm that can be efficiently implemented, in terms of both performance and memory requirements.

Given the standard ordering and the tailoring for any particular language, any two companies or individuals—with their own proprietary implementations—can take any arbitrary Unicode input and produce exactly the same ordering of two strings. In addition, when given an appropriate tailoring this algorithm can pass the Canadian and ISO 14651 benchmarks ([CanStd], [ISO14651]).

Note: The Default Unicode Collation Element Table does not explicitly list weights for all assigned Unicode characters. However, the algorithm is well defined over all Unicode code points. See Section 10.1.2, Unassigned and Other Code Points.

1.9.2 Non-Goals

The Default Unicode Collation Element Table (DUCET) explicitly does not provide for the following features:

  1. Reversibility: from a Collation Element one is not guaranteed to be able to recover the original character.
  2. Numeric formatting: numbers composed of a string of digits or other numerics will not necessarily sort in numerical order.
  3. API: no particular API is specified or required for the algorithm.
  4. Title sorting: removing articles such as a and the during bibliographic sorting is not provided.
  5. Stability of binary sort key values between versions: weights in the DUCET may change between versions. For more information, see "Collation order is not a stable sort" in Section 1.8, What Collation is Not.
  6. Linguistic applicability: to meet most user expectations, a linguistic tailoring is needed. For more information, see Section 8, Tailoring.

The feature of linguistic applicability deserves further discussion. DUCET does not and cannot actually provide linguistically correct sorting for every language without further tailoring. That would be impossible, due to conflicting requirements for ordering different languages that share the same script. It is not even possible in the specialized cases where a script may be predominantly used by a single language, because of the limitations of the DUCET table design and because of the requirement to minimize implementation overhead for all users of DUCET.

Instead, the goal of DUCET is to provide a reasonable default ordering for all scripts that are not tailored. Any characters used in the language of primary interest for collation are expected to be tailored to meet all the appropriate linguistic requirements for that language. For example, for a user interested primarily in the Malayalam language, DUCET would be tailored to get all details correct for the expected Malayalam collation order, while leaving other characters (Greek, Cyrillic, Han, and so forth) in the default order, because the order of those other characters is not of primary concern. Conversely, a user interested primarily in the Greek language would use a Greek-specific tailoring, while leaving the Malayalam (and other) characters in their default order in the table.

2 Conformance

The Unicode Collation Algorithm does not restrict the many different ways in which implementations can compare strings. However, any Unicode-conformant implementation that purports to implement the Unicode Collation Algorithm must do so as described in this document.

A conformance test for the UCA is available. See Section 12.2, Conformance Tests.

The Unicode Collation Algorithm is a logical specification. Implementations are free to change any part of the algorithm as long as any two strings compared by the implementation are ordered the same as they would be by the algorithm as specified. Implementations may also use a different format for the data in the Default Unicode Collation Element Table. The sort key is a logical intermediate object: if an implementation produces the same results in comparison of strings, the sort keys can differ in format from what is specified in this document. (See Section 9, Implementation Notes.)

2.1 Basic Conformance Requirements

The conformance requirements of the Unicode Collation Algorithm are as follows:

UTS10-C1. For a given Unicode Collation Element Table, a conformant implementation shall replicate the same comparisons of strings as those produced by Section 7, Main Algorithm.

In particular, a conformant implementation must be able to compare any two canonical-equivalent strings as being equal, for all Unicode characters supported by that implementation.

UTS10-C2. A conformant implementation shall support at least three levels of collation.

A conformant implementation is only required to implement three levels. However, it may implement four (or more) levels if desired.

UTS10-C3. A conformant implementation that supports any of the following features: backward levels, variable weighting, and semi-stability (S3.10), shall do so in accordance with this specification.

A conformant implementation is not required to support these features; however, if it does, it must interpret them properly. If an implementation intends to support the Canadian standard [CanStd] then it should implement a backwards secondary level.

UTS10-C4. An implementation that claims to conform to the UCA must specify the UCA version it conforms to.

The version number of this document is synchronized with the version of the Unicode Standard which specifies the repertoire covered.

UTS10-C5. An implementation claiming conformance to Matching and Searching according to UTS #10 shall meet the requirements described in Section 11, Searching and Matching.

2.2 Additional Conformance Requirements

If a conformant implementation compares strings in a legacy character set, it must provide the same results as if those strings had been transcoded to Unicode. The implementation should specify the conversion table and transcoding mechanism.

A claim of conformance to C6 (UCA parametric tailoring) from earlier versions of the Unicode Collation Algorithm is to be interpreted as a claim of conformance to LDML parametric tailoring. See Setting Options in [UTS35Collation].

An implementation that supports a parametric reordering which is not based on CLDR should specify the reordering groups.

3 Definitions and Notation

The Unicode Collation Algorithm depends on the concept of mapping characters in Unicode strings to sequences of collation weights called sort keys. Those sort keys, in turn, can be directly compared to determine the relative order of the strings. This section provides precise definitions of the special terminology used in the algorithm and its intermediate steps, along with explanation of the notation used in examples and in the discussion of the algorithm.

3.1 Collation Weights, Elements, and Levels

The basic concepts of collation weights, collation elements, and collation levels are defined here first, as all other aspects of the Unicode Collation Algorithm depend fundamentally on those concepts.

UTS10-D1. Collation Weight: A non-negative integer used in the UCA to establish a means for systematic comparison of constructed sort keys.

UTS10-D2. Collation Element: An ordered list of collation weights.

UTS10-D3. Collation Level: The position of a collation weight in a collation element.

In other words, the collation level refers to the first position, second position, and so forth, in a collation element. The collation level can also be used to refer collectively to all the weights at the same relative position in a sequence of collation elements.

Unless otherwise noted, all weights used in the example collation elements in this specification are displayed in hexadecimal format. Collation elements are shown in square brackets, with the collation weights for each level separated by dots for clarity. For example:

[.06D9.0020.0002]

UTS10-D4. Primary Weight: The first collation weight in a collation element.

A primary weight is also called the Level 1 weight. Level 1 is also abbreviated as L1.

UTS10-D5. Secondary Weight: The second collation weight in a collation element.

A secondary weight is also called the Level 2 weight. Level 2 is also abbreviated as L2.

UTS10-D6. Tertiary Weight: The third collation weight in a collation element.

A tertiary weight is also called the Level 3 weight. Level 3 is also abbreviated as L3.

UTS10-D7. Quaternary Weight: The fourth collation weight in a collation element.

A quaternary weight is also called the Level 4 weight. Level 4 is also abbreviated as L4.

In principle, collation levels can extend past Level 4 to add additional levels, but the specification of the Unicode Collation Algorithm does not require defining more levels. In some special cases, such as support of Japanese collation, an implementation may need to define additional levels.

For convenience, this specification uses subscripted numbers after the symbol referring to a particular collation element to refer to the collation weights of that collation element at designated levels. Thus, for a collation element X, X1 refers to the primary weight, X2 refers to the secondary weight, X3 refers to the tertiary weight, and X4 refers to the quaternary weight.

3.2 Ignorables

UTS10-D8. Ignorable Weight: A collation weight whose value is zero.

In the 4-digit hexadecimal format used in this specification, ignorable weights are expressed as "0000".

Ignorable weights are passed over by the rules that construct sort keys from sequences of collation elements. Thus, their presence in collation elements does not impact the comparison of strings using the resulting sort keys. The judicious assignment of ignorable weights in collation elements is an important concept for the UCA.

The following definitions specify collation elements which have particular patterns of ignorable weights in them.

UTS10-D9. Primary Collation Element: A collation element whose Level 1 weight is not an ignorable weight.

UTS10-D10. Secondary Collation Element: A collation element whose Level 1 weight is an ignorable weight but whose Level 2 weight is not an ignorable weight.

UTS10-D11. Tertiary Collation Element: A collation element whose Level 1 and Level 2 weights are ignorable weights but whose Level 3 weight is not an ignorable weight.

UTS10-D12. Quaternary Collation Element: A collation element whose Level 1, Level 2, and Level 3 weights are ignorable weights but whose Level 4 weight is not an ignorable weight.

UTS10-D13. Completely Ignorable Collation Element: A collation element which has ignorable weights at all levels.

UTS10-D14. Ignorable Collation Element: A collation element which is not a primary collation element.

The term ignorable collation element is a convenient cover term for any type of collation element which has a zero primary weight. It includes secondary, tertiary, quaternary, and completely ignorable collation elements. In contrast, a primary collation element, which by definition does not have a zero primary weight, can also be referred to as a non-ignorable collation element.

UTS10-D15. Level N Ignorable: A collation element which has an ignorable weight at level N, but not at level N+1.

This concept is useful for parameterized expressions with weight level as a parameter. For example "Level 1 ignorable" is a synonym for a secondary collation element. This alternate terminology is generally avoided in this specification, however, because of the potential for confusion.

UTS10-D16. Variable Collation Element: A primary collation element with a low (but non-zero) value for its primary weight.

Low primary weights are generally reserved for punctuation and symbols, to enable special handling of those kinds of characters. Variable collation elements are subject to special rules when constructing sort keys. See Section 4, Variable Weighting. In the Default Unicode Collation Element Table [Allkeys] the primary weights of all variable collation elements are prefixed with an asterisk instead of a dot, so that they can be clearly identified.

The relationship between these terms for patterns of ignorable weights in collation elements, together with schematic examples of the corresponding collation elements, is shown in the following table, constructed on the assumption that collation elements have four collation levels. Note that quaternary collation elements have the same schematic pattern of weights as variable collation elements which have been shifted.

Schematic Example Main Term General Type Level Notation
[.nnnn.nnnn.nnnn.nnnn] Primary Collation Element Non-ignorable Level 0 Ignorable
[*nnnn.nnnn.nnnn.nnnn] Variable Collation Element (not shifted)
[.0000.nnnn.nnnn.nnnn] Secondary Collation Element Ignorable Level 1 Ignorable
[.0000.0000.nnnn.nnnn] Tertiary Collation Element Level 2 Ignorable
[.0000.0000.0000.nnnn] Quaternary Collation Element Level 3 Ignorable
Variable Collation Element (shifted)
[.0000.0000.0000.0000] Completely Ignorable Collation Element Level 4 Ignorable

For collation element tables, such as DUCET, which only use three levels of collation weights, the terminology would adjust as follows, and similar adjustments can be made to express collation element tables using more than four weight levels.

Schematic Example Main Term General Type Level Notation
[.nnnn.nnnn.nnnn] Primary Collation Element Non-ignorable Level 0 Ignorable
[*nnnn.nnnn.nnnn] Variable Collation Element (not shifted)
[.0000.nnnn.nnnn] Secondary Collation Element Ignorable Level 1 Ignorable
[.0000.0000.nnnn] Tertiary Collation Element Level 2 Ignorable
[.0000.0000.0000] Completely Ignorable Collation Element Level 3 Ignorable

3.3 Mappings

An important feature of the Unicode Collation Algorithm is the systematic mapping of Unicode characters (in Unicode strings) to sequences of collation elements, for the purpose of comparing those strings. The sequence of collation elements is then converted into a sort key, suitable for direct comparison. This section defines the various types of collation element mappings discussed in the specification of the algorithm.

UTS10-D17. Collation Element Mapping: A mapping from one (or more) Unicode characters to one (or more) collation elements.

Effectively, a given collation element table defines a mathematical function. It is instantiated as a list of collation element mappings. Collectively, the input for those mappings, which generally consists of all Unicode code points plus some well-defined set of short sequences of Unicode characters, constitutes the domain of the function. Collectively, the output for those mappings consists of a defined list of collation element sequences; the set of which constitutes the codomain of the function. And the collation element table itself constitutes the graph of the function.

Note: For formal completeness, a collation element mapping is usually defined to include in its domain all Unicode code points, including noncharacters and unassigned, reserved code points. The Unicode Collation Algorithm specifies the mapping for unassigned code points, as well as some ranges of assigned characters, by means of implicit weights. See Section 10, Weight Derivation for details. However, because specific collation element tables such as the Default Unicode Collation Element Table (DUCET) generally only contain a list of collation element mappings for assigned characters, and maps those assigned characters to collation elements with explicit weights, the definitions in this section are simplified by referring to the input values just in terms of "Unicode characters".

Collation element mappings are divided into subtypes, based on a distinction between whether the input of the mapping constitutes a single Unicode character or a sequence of Unicode characters, and a separate distinction between whether the output of the mapping constitutes a single collation element or a sequence of collation elements.

UTS10-D18. Simple Mapping: A collation element mapping from one Unicode character to one collation element.

UTS10-D19. Expansion: A collation element mapping from one Unicode character to a sequence of more than one collation element.

UTS10-D20. Many-to-One Mapping: A collation element mapping from more than one Unicode character to one collation element.

UTS10-D21. Many-to-Many Mapping: A collation element mapping from more than one Unicode character to a sequence of more than one collation element.

UTS10-D22. Contraction: Either a many-to-one mapping or a many-to-many mapping.

Both many-to-many mappings and many-to-one mappings are referred to as contractions in the discussion of the Unicode Collation Algorithm, even though many-to-many mappings often do not actually shorten anything. The main issue for implementations is that for both many-to-one mappings and many-to-many mappings, the weighting algorithm must first identify a sequence of characters in the input string and "contract" them together as a unit for weight lookup in the table. The identified unit may then be mapped to any number of collation elements. Contractions pose particular issues for implementations, because all eligible contraction targets must be identified first, before the application of simple mappings, so that processing for simple mappings does not bleed away the context needed to correctly identify the contractions.

3.3.1 Simple Mappings

Most of the mappings in a collation element table are simple mappings.

The following table shows several instances of simple mappings. These simple mappings are used in the examples illustrating the algorithm.

Character Collation Element Name
0300 "`" [.0000.0021.0002] COMBINING GRAVE ACCENT
0061 "a" [.1C47.0020.0002] LATIN SMALL LETTER A
0062 "b" [.1C60.0020.0002] LATIN SMALL LETTER B
0063 "c" [.1C7A.0020.0002] LATIN SMALL LETTER C
0043 "C" [.1C7A.0020.0008] LATIN CAPITAL LETTER C
0064 "d" [.1C8F.0020.0002] LATIN SMALL LETTER D
0065 "e" [.1CAA.0020.0002] LATIN SMALL LETTER E

3.3.2 Expansions

The following table shows an example of an expansion. The first two rows show simple mappings for "a" and "e". The third row shows the single Unicode character æ is mapped to a sequence of collation elements, rather than a single collation element.

Character Collation Elements Name
0061 "a" [.1C47.0020.0002] LATIN SMALL LETTER A
0065 "e" [.1CAA.0020.0002] LATIN SMALL LETTER E
00E6 "æ" [.1C47.0020.0004][.0000.0110.0004][.1CAA.0020.0004] LATIN SMALL LETTER AE

In the row with the expansion for "æ", the two underlined primary weights have the same values as the primary weights for the simple mappings for "a" and "e", respectively. This is the basis for establishing a primary equivalence between "æ" and the sequence "ae".

3.3.3 Contractions

Similarly, where the sequence ch is treated as a single digraph letter, as for instance in Slovak, it is represented as a mapping from two characters to a single collation element, as shown in the following example:

Character Collation Element Name
0063 0068 "ch" [.1D19.0020.0002] LATIN SMALL LETTER C, LATIN SMALL LETTER H

In this example, the collation element [.1D19.0020.0002] has a primary weight one greater than the primary weight for the letter h by itself, so that the sequence ch will collate after h and before i. This example shows the result of a tailoring of collation element mappings to weight sequences of letters as a single unit.

3.3.4 Many-to-Many Mappings

In some cases a sequence of two or more characters is mapped to a sequence of two or more collation elements. For example, this technique is used in the Default Unicode Collation Element Table [Allkeys] to handle weighting of rearranged sequences of Thai or Lao left-side-vowel + consonant. See Section 6.1.1, Rearrangement and Contractions.

Certain characters may both expand and contract. See Section 1.3, Contextual Sensitivity.

3.4 Collation Element Tables

UTS10-D23. Collation Element Table: A table of collation element mappings.

The basic idea of a collation element table is that it contains the collation weight information necessary to construct sort keys for Unicode strings.

UTS10-D24. Explicit Weight Mapping: A mapping to one (or more) collation elements which is explicitly listed in a collation element table.

UTS10-D25. Implicit Weight Mapping: A mapping to one (or more) collation elements which is not explicitly listed in a collation element table, but which is instead derived by rule.

The convention used by the Unicode Collation Algorithm is that the mapping for any character which is not listed explicitly in a given collation element table is instead determined by the implicit weight derivation rules. This convention extends to all unassigned code points, so that all Unicode strings can have determinant sort keys constructed for them. See Section 10, Weight Derivation for the rules governing the assignment of implicit weights.

Implementations can produce the same result using various representations of weights. In particular, while the Default Unicode Collation Element Table [Allkeys] stores weights of all levels using 16-bit integers, and such weights are shown in examples in this document, other implementations may choose to store weights in larger or smaller integer units, and may store weights of different levels in integer units of different sizes. See Section 9, Implementation Notes.

The specific collation weight values shown in examples are illustrative only; they may not match the weights in the latest Default Unicode Collation Element Table [Allkeys].

UTS10-D26. Minimum Weight at a Level: The least weight in any collation element in a given collation element table, at a specified level.

The minimum weight at level n is abbreviated with the notation: MINn.

UTS10-D27. Maximum Weight at a Level: The greatest weight in any collation element in a given collation element table, at a specified level.

The maximum weight at level n is abbreviated with the notation: MAXn.

3.5 Input Matching

The Unicode Collation Algorithm involves a step where the content of an input string is matched up, piece by piece, against the mappings in the collation element table, to produce an array of collation elements, which in turn is processed further into a sort key. This section provides definitions for terms relevant to that input matching process.

UTS10-D28. Input Match: An association between a sequence of one or more Unicode characters in the input string to a collation element mapping in the collation element table, where the sequence from the input string exactly matches the Unicode characters specified for that mapping.

The exact rules for processing an input string systematically are specified in Section 7.2, Produce Collation Element Arrays. Those rules determine how to walk down an input string, to pull out the candidate character (or sequence of characters) for input matching, in order to find each successive relevant collation element.

UTS10-D29. Single Character Match: An input match for a single character.

UTS10-D30. Contraction Match: An input match for a sequence of characters.

UTS10-D31. Contiguous Match: A contraction match for a contiguous sequence of characters.

UTS10-D32. Discontiguous Match: A contraction match for a discontiguous sequence of characters.

The concept of a discontiguous match is necessary for the UCA, because the algorithm requires identical outcomes for canonically equivalent input sequences. UCA specifies that an input string shall first be normalized into NFD. (Normalize Each String) However, normalization of a combining character sequence can, in principle, separate a combining mark from a base character for which a contraction mapping has been defined, as shown in the following example:

<a, combining dot below, combining ring above>

<a, combining ring above, combining dot below>.

In this example, the first line, with the dot below occurring ahead of the ring above, is normalized to NFD; however, the sequence shown in the second line is canonically equivalent. Now, if a contraction has been defined in a collation element table for the sequence <a, combining ring above>, as might be the case for a Danish string ordering, for which "å" sorts after "z", then that sequence must be found and a weight matching be done, even in the case where the match has become discontiguous as a result of the string normalization. See Section 7.2, Produce Collation Element Array, for details about how the UCA specifies that discontiguous matching be done.

UTS10-D33. Non-Starter: An assigned character with Canonical_Combining_Class ≠ 0.

This definition of non-starter is based on the definition of starter. See D107 in [Unicode]. By the definition of Canonical_Combining_Class, a non-starter must be a combining mark; however, not all combining marks are non-starters, because many combining marks have Canonical_Combining_Class = 0. A non-starter cannot be an unassigned code point: all unassigned code points are starters, because their Canonical_Combining_Class value defaults to 0.

UTS10-D34. Blocking Context: The presence of a character B between two characters C1 and C2, where ccc(B) = 0 or ccc(B) ≥ ccc(C2).

The notation ccc(B) is an abbreviation for "the Canonical_Combining_Class value of B".

UTS10-D35. Unblocked Non-Starter: A non-starter C2 which is not in a blocking context with respect to a preceding character C1 in a string.

In the context <C1 ... B ... C2>, if there is no intervening character B which meets the criterion for being a blocking context, and if C2 is a non-starter, then it is also an unblocked non-starter.

The concept of an unblocked non-starter is pertinent to the determination of sequences which constitute a discontiguous match. The process to find a longest match can continue skipping over subsequent unblocked non-starters in an attempt to find the longest discontiguous match. The search for such a match will terminate at the first non-starter in a blocked context, or alternatively, at the first starter encountered during the lookahead. The details of this discontiguous matching are spelled out in Section 7.2, Produce Collation Element Array.

A sequence of characters which otherwise would result in a contraction match can be made to sort as separate characters by inserting, someplace within the sequence, a starter that maps to a completely ignorable collation element. By definition this creates a blocking context, even though the completely ignorable collation element would not otherwise affect the assigned collation weights. There are two characters, U+00AD SOFT HYPHEN and U+034F COMBINING GRAPHEME JOINER, that are particularly useful for this purpose. These can be used to separate sequences of characters that would normally be weighted as units, such as Slovak "ch" or Danish "aa". See Section 8.3, Use of Combining Grapheme Joiner.

3.6 Sort Keys

This section provides definitions for sort key and for other concepts used in the assembly of sort keys. See also Section 9.7, Incremental Comparison for a discussion of implementation tradeoffs which impact performance of string comparisons.

UTS10-D36. Sort Key: An array of non-negative integers, associated with an input string, systematically constructed by extraction of collation weights from a collation element array, and suitable for binary comparison.

The Unicode Collation Algorithm maps input strings, which are very complicated to compare accurately, down to these simple arrays of integers called sort keys, which can then be compared very reliably and quickly with a low-level memory comparison operation, such as memcmp().

The discussion and examples in this specification use the same 16-bit hexadecimal notation for the integers in a sort key as is used for the collation weights. There is, however, no requirement that integers in a sort key be stored with the same integral width as collation weights stored in a collation element table. The only practical requirement is that all of the integers in a given sort key (and any sort key to be compared with it) be stored with the same integral width. That means that they constitute simple arrays, amenable to efficient, low-level array processing.

UTS10-D37. Level Separator: A low integer weight used in the construction of sort keys to separate collation weights extracted from different levels in the collation element array.

The UCA uses the value zero (0000) for the level separator, to guarantee that the level separator has a lower value than any of the actual collation weights appended to the sort key from the collation element array. Implementations can, however, use a non-zero value, as long as that value is lower than the minimum weight at that level.

3.7 Comparison

This section provides the definitions and notational conventions used in describing the comparison of collation elements and of strings.

3.7.1 Equality

UTS10-D38. Two collation elements are primary equal if and only if the primary weight of each collation element is equal.

UTS10-D39. Two collation elements are secondary equal if and only if the secondary weight of each collation element is equal and the collation elements are primary equal.

UTS10-D40. Two collation elements are tertiary equal if and only if the tertiary weight of each collation element is equal and the collation elements are secondary equal.

UTS10-D41. Two collation elements are quaternary equal if and only if the quaternary weight of each collation element is equal and the collation elements are tertiary equal.

3.7.2 Inequality

UTS10-D42. A collation element X is primary less than collation element Y if and only if the primary weight of X is less than the primary weight of Y.

UTS10-D43. A collation element X is secondary less than collation element Y if and only if

UTS10-D44. A collation element X is tertiary less than collation element Y if and only if

UTS10-D45. A collation element X is quaternary less than collation element Y if and only if

Other inequality comparison relations can be defined by corollary, based on the definitions of the equality and/or less than relations, on a level-by-level basis. For example, a collation element X is primary greater than collation element Y if and only if Y is primary less than X. A collation element X is primary less than or equal to collation element Y if and only if (X is primary less than Y) or (X is primary equal to Y). And so forth for other levels.

3.7.3 Notation for Collation Element Comparison

This specification uses a notation with subscripted numbers following the equals sign to represent the level-specific equality relations, as shown in Table 7.

Table 7. Equals Notation

Notation Reading Meaning
X =1 Y X is primary equal to Y X1 = Y1
X =2 Y X is secondary equal to Y X2 = Y2 and X =1 Y
X =3 Y X is tertiary equal to Y X3 = Y3 and X =2 Y
X =4 Y X is quaternary equal to Y X4 = Y4 and X =3 Y

Similarly, subscripted numbers following the less than sign are used to indicate the level-specific less than comparison relations, as shown in Table 8.

Table 8. Less Than Notation

Notation Reading Meaning
X <1 Y X is primary less than Y X1 < Y1
X <2 Y X is secondary less than Y X <1 Y or (X =1 Y and X2 < Y2)
X <3 Y X is tertiary less than Y X <2 Y or (X =2 Y and X3 < Y3)
X <4 Y X is quaternary less than Y X <3 Y or (X =3 Y and X4 < Y4)

Other symbols for inequality relations are given their customary definitions in terms of the notational conventions just described:

3.7.4 Notation for String Comparison

This notation for collation elements is also adapted to refer to ordering between strings, as shown in Table 9, where A and B refer to two strings.

Table 9. Notation for String Ordering

Notation Meaning
A <2 B A is less than B, and there is a primary or secondary difference between them
A <2 B and A =1 B A is less than B, but there is only a secondary difference between them
A ≡ B A and B are equivalent (equal at all levels) according to a given Collation Element Table
A = B A and B are bit-for-bit identical

Where only plain text ASCII characters are available the fallback notation in Table 10 may be used.

Table 10. Fallback Notation

Notation Fallback
X <n Y X <[n] Y
Xn X[n]
X ≤n Y X <=[n] Y
A ≡ B A =[a] B

3.8