|
|
Technical Reports |
| Version | Unicode 17.0.0 |
| Editors | Josh Hadley ([email protected]) |
| Date | 2025-08-17 |
| This Version | https://www.unicode.org/reports/tr29/tr29-47.html |
| Previous Version | https://www.unicode.org/reports/tr29/tr29-45.html |
| Latest Version | https://www.unicode.org/reports/tr29/ |
| Latest Proposed Update | https://www.unicode.org/reports/tr29/proposed.html |
| Revision | 47 |
This annex describes guidelines for determining default segmentation boundaries between certain significant text elements: grapheme clusters (“user-perceived characters”), words, and sentences. For line boundaries, see [UAX14] .
This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.
A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.” For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions]. For any errata which may apply to this annex, see [Errata].
This annex describes guidelines for determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences. The process of boundary determination is also called segmentation.
A string of Unicode-encoded text often needs to be broken up into text elements programmatically. Common examples of text elements include what users think of as characters, words, lines (more precisely, where line breaks are allowed), and sentences. The precise determination of text elements may vary according to orthographic conventions for a given script or language. The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries. For example, the period (U+002E FULL STOP) is used ambiguously, sometimes for end-of-sentence purposes, sometimes for abbreviations, and sometimes for numbers. In most cases, however, programmatic text boundaries can match user perceptions quite closely, although sometimes the best that can be done is not to surprise the user.
Rather than concentrate on algorithmically searching for text elements (often called segments), a simpler and more useful computation instead detects the boundaries (or breaks) between those text elements. The determination of those boundaries is often critical to performance, so it is important to be able to make such a determination as quickly as possible. (For a general discussion of text elements, see Chapter 2, General Structure, of [Unicode].)
The default boundary determination mechanism specified in this annex provides a straightforward and efficient way to determine some of the most significant boundaries in text: user-perceived characters, words, and sentences. Boundaries used in line breaking (also called word wrapping) are defined in [UAX14].
The sheer number of characters in the Unicode Standard, together with its representational power, place requirements on both the specification of text element boundaries and the underlying implementation. The specification needs to allow the designation of large sets of characters sharing the same characteristics (for example, uppercase letters), while the implementation must provide quick access and matches to those large sets. The mechanism also must handle special features of the Unicode Standard, such as nonspacing marks and conjoining jamos.
The default boundary determination builds upon the uniform character representation of the Unicode Standard, while handling the large number of characters and special features such as nonspacing marks and conjoining jamos in an effective manner. As this mechanism lends itself to a completely data-driven implementation, it can be tailored to particular orthographic conventions or user preferences without recoding.
As in other Unicode algorithms, these specifications provide a logical description of the processes: implementations can achieve the same results without using code or data that follows these rules step-by-step. In particular, many production-grade implementations will use a state-table approach. In that case, the performance does not depend on the complexity or number of rules. Rather, performance is only affected by the number of characters that may match after the boundary position in a rule that applies.
A boundary specification summarizes boundary property values used in that specification, then lists the rules for boundary determinations in terms of those property values. The summary is provided as a list, where each element of the list is one of the following:
For example, the following is such a list:
General_Category = Line_Separator, or
General_Category = Paragraph_Separator, or
General_Category = Control, or
General_Category = Format
and not U+000D CARRIAGE RETURN (CR)
and not U+000A LINE FEED (LF)
and not U+200C ZERO WIDTH NON-JOINER (ZWNJ)
and not U+200D ZERO WIDTH JOINER (ZWJ)
In the table assigning the boundary property values, all of the values are intended to be disjoint except for the special value Any. In case of conflict, rows higher in the table have precedence in terms of assigning property values to characters. Data files containing explicit assignments of the property values are found in [Props].
Boundary determination is specified in terms of an ordered list of rules, indicating the status of a boundary position. The rules are numbered for reference and are applied in sequence to determine whether there is a boundary at any given offset. That is, there is an implicit “otherwise” at the front of each rule following the first. The rules are processed from top to bottom. As soon as a rule matches and produces a boundary status (boundary or no boundary) for that offset, the process is terminated.
Each rule consists of a left side, a boundary symbol (see Table 1), and a right side. Either of the sides can be empty. The left and right sides use the boundary property values in regular expressions. The regular expression syntax used is a simplified version of the format supplied in Unicode Technical Standard #18, Unicode Regular Expressions [UTS18].
Table 1. Boundary Symbols
| ÷ | Boundary (allow break here) |
| × | No boundary (do not allow break here) |
| → | Treat whatever on the left side as if it were what is on the right side |
An open-box symbol (“␣”) is used to indicate a space in examples.
These rules are constrained in three ways, to make implementations significantly simpler and more efficient. These constraints have not been found to be limitations for natural language use. In particular, the rules are formulated so that they can be efficiently implemented, such as with a deterministic finite-state machine based on a small number of property values.
There are many different ways to divide text elements corresponding to user-perceived characters, words, and sentences, and the Unicode Standard does not restrict the ways in which implementations can produce these divisions. However, it does provide conformance clauses to enable implementations to clearly describe their behavior in relation to the default behavior.
UAX29-C1. Extended Grapheme Cluster Boundaries: An implementation shall choose either UAX29-C1-1 or UAX29-C1-2 to determine whether an offset within a sequence of characters is an extended grapheme cluster boundary.
UAX29-C1-1. Use the property values defined in the Unicode Character Database [UCD] and the extended rules in Section 3.1 Grapheme Cluster Boundary Rules to determine the boundaries.
The default grapheme clusters are also known as extended grapheme clusters.
UAX29-C1-2. Declare the use of a profile of UAX29-C1-1, and define that profile with a precise specification of any changes in property values or rules and/or provide a description of programmatic overrides to the behavior of UAX29-C1-1.
Legacy grapheme clusters are such a profile.
UAX29-C2. Word Boundaries: An implementation shall choose either UAX29-C2-1 or UAX29-C2-2 to determine whether an offset within a sequence of characters is a word boundary.
UAX29-C2-1. Use the property values defined in the Unicode Character Database [UCD] and the rules in Section 4.1 Default Word Boundary Specification to determine the boundaries.
UAX29-C2-2. Declare the use of a profile of UAX29-C2-1, and define that profile with a precise specification of any changes in property values or rules and/or provide a description of programmatic overrides to the behavior of UAX29-C2-1.
UAX29-C3. Sentence Boundaries: An implementation shall choose either UAX29-C3-1 or UAX29-C3-2 to determine whether an offset within a sequence of characters is a sentence boundary.
UAX29-C3-1. Use the property values defined in the Unicode Character Database [UCD] and the rules in Section 5.1 Default Sentence Boundary Specification to determine the boundaries.
UAX29-C3-2. Declare the use of a profile of UAX29-C3-1, and define that profile with a precise specification of any changes in property values or rules and/or provide a description of programmatic overrides to the behavior of UAX29-C3-1.
This specification defines default mechanisms; more sophisticated implementations can and should tailor them for particular locales or environments and, for the purpose of claiming conformance, document the tailoring in the form of a profile. For example, reliable detection of word boundaries in languages such as Thai, Lao, Chinese, or Japanese requires the use of dictionary lookup or other mechanisms, analogous to English hyphenation. An implementation therefore may need to provide means for a programmatic override of the default mechanisms described in this annex. Note that a profile can both add and remove boundary positions, compared to the results specified by UAX29-C1-1, UAX29-C2-1, or UAX29-C3-1.
Notes:
- Locale-sensitive boundary specifications, including boundary suppressions, can be expressed in LDML [UTS35]. Some profiles are available in the Common Locale Data Repository [CLDR].
- Some changes to rules and data are needed for best segmentation behavior of additional emoji zwj sequences [UTS51]. Implementations are strongly encouraged to use the extended text segmentation rules in the latest version of CLDR.
To maintain canonical equivalence, all of the following specifications are defined on text normalized in form NFD, as defined in Unicode Standard Annex #15, “Unicode Normalization Forms” [UAX15]. Boundaries never occur within a combining character sequence or conjoining sequence, so the boundaries within non-NFD text can be derived from corresponding boundaries in the NFD form of that text. For convenience, the default rules have been written so that they can be applied directly to non-NFD text and yield equivalent results. (This may not be the case with tailored default rules.) For more information, see Section 6, Implementation Notes.
A single Unicode code point is often, but not always the same as a basic unit of a writing system for a language, or what a typical user might think of as a “character”. There are many cases where such a basic unit is made up of multiple Unicode code points. To avoid ambiguity with the term character as defined for encoding purposes, it can be useful to speak of a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points.
The notion of user-perceived character is not always an unambiguous concept for a given writing system: it may differ based on language, script style, or even based on context, for the same user. Drop-caps and initialisms, text selection, or "character" counting for text size limits are all contexts in which the basic unit may be defined differently.
In implementations, the notion of user-perceived characters corresponds to the concept of grapheme clusters. They are a best-effort approximation that can be determined programmatically and unambiguously. The definition of grapheme clusters attempts to achieve uniformity across all human text without requiring language or font metadata about that text. As an approximation, it may not cover all potential types of user-perceived characters, and it may have suboptimal behavior in some scripts where further metadata is needed, or where a different notion of user-perceived character is preferred. Such special cases may require a customization of the algorithm, while the generic case continues to be supported by the standard algorithm.
As far as a user is concerned, the underlying representation of text is not important, but it is important that an editing interface present a uniform implementation of what the user thinks of as characters. Grapheme clusters can be treated as units, by default, for processes such as the formatting of drop caps, as well as the implementation of text selection, arrow key movement, forward deletion, and so forth. For example, when a grapheme cluster is represented internally by a character sequence consisting of base character + accents, then using the right arrow key would skip from the start of the base character to the end of the last accent.
Grapheme cluster boundaries are also important for collation, regular expressions, UI interactions, segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text. Word boundaries, line boundaries, and sentence boundaries should not occur within a grapheme cluster: in other words, a grapheme cluster should be an atomic unit with respect to the process of determining these other boundaries.
This document defines a default specification for grapheme clusters. It may be customized for particular languages, operations, or other situations. For example, arrow key movement could be tailored by language, or could use knowledge specific to particular fonts to move in a more granular manner, in circumstances where it would be useful to edit individual components. This could apply, for example, to the complex editorial requirements for the Northern Thai script Tai Tham (Lanna). Similarly, editing a grapheme cluster element by element may be preferable in some circumstances. For example, on a given system the backspace key might delete by code point, while the delete key may delete an entire cluster.
Moreover, there is not a one-to-one relationship between grapheme clusters and keys on a keyboard. A single key on a keyboard may correspond to a whole grapheme cluster, a part of a grapheme cluster, or a sequence of more than one grapheme cluster.
Grapheme clusters can only provide an approximation of where to put cursors. Detailed cursor placement depends on the text editing framework. The text editing framework determines where the edges of glyphs are, and how they correspond to the underlying characters, based on information supplied by the lower-level text rendering engine and font. For example, the text editing framework must know if a digraph is represented as a single glyph in the font, and therefore may not be able to position a cursor at the proper position separating its two components. That framework must also be able to determine display representation in cases where two glyphs overlap—this is true generally when a character is displayed together with a subsequent nonspacing mark, but must also be determined in detail for complex script rendering. For cursor placement, grapheme clusters boundaries can only supply an approximate guide for cursor placement using least-common-denominator fonts for the script.
In those relatively rare circumstances where programmers need to supply end users with user-perceived character counts, the counts should correspond to the number of segments delimited by grapheme cluster boundaries. Grapheme clusters may also be used in searching and matching; for more information, see Unicode Technical Standard #10, “Unicode Collation Algorithm” [UTS10], and Unicode Technical Standard #18, “Unicode Regular Expressions” [UTS18].
The Unicode Standard provides a default algorithm for determining grapheme cluster boundaries; the default grapheme clusters are also known as extended grapheme clusters. For backwards compatibility with earlier versions of this specification, the Standard also defines and maintains a profile for legacy grapheme clusters.
These algorithms can be adapted to produce tailored grapheme clusters for specific locales or other customizations, such as the contractions used in collation tailoring tables. In Table 1a are some examples of the differences between these concepts. The tailored examples are only for illustration: what constitutes a grapheme cluster will depend on the customizations used by the particular tailoring in question.
Table 1a. Sample Grapheme Clusters
| Ex | Characters | Comments |
|---|---|---|
| Grapheme clusters (both legacy and extended) | ||
| g̈ | 0067 ( g ) LATIN SMALL LETTER G 0308 ( ◌̈ ) COMBINING DIAERESIS |
combining character sequences |
| 각 | AC01 ( 각 ) HANGUL SYLLABLE GAG | Hangul syllables such as gag (which may be a single character, or a sequence of conjoining jamos) |
| 1100 ( ᄀ ) HANGUL CHOSEONG KIYEOK 1161 ( ᅡ ) HANGUL JUNGSEONG A 11A8 ( ᆨ ) HANGUL JONGSEONG KIYEOK |
||
| ก | 0E01 ( ก ) THAI CHARACTER KO KAI | Thai ko |
| Extended grapheme clusters | ||
| நி | 0BA8 ( ந ) TAMIL LETTER NA 0BBF ( ி ) TAMIL VOWEL SIGN I |
Tamil ni |
| เ | 0E40 ( เ ) THAI CHARACTER SARA E | Thai e |
| กำ | 0E01 ( ก ) THAI CHARACTER KO KAI 0E33 ( ำ ) THAI CHARACTER SARA AM |
Thai kam |
| षि | 0937 ( ष ) DEVANAGARI LETTER SSA 093F ( ि ) DEVANAGARI VOWEL SIGN I |
Devanagari ssi |
| क्षि | 0915 ( क ) DEVANAGARI LETTER KA 094D ( ् ) DEVANAGARI SIGN VIRAMA 0937 ( ष ) DEVANAGARI LETTER SSA 093F ( ि ) DEVANAGARI VOWEL SIGN I |
Devanagari kshi |
| Legacy grapheme clusters | ||
| ำ | 0E33 ( ำ ) THAI CHARACTER SARA AM | Thai am |
| ष | 0937 ( ष ) DEVANAGARI LETTER SSA | Devanagari ssa |
| ि | 093F ( ि ) DEVANAGARI VOWEL SIGN I | Devanagari i |
| Possible tailored grapheme clusters in a profile | ||
| ch | 0063 ( c ) LATIN SMALL LETTER C 0068 ( h ) LATIN SMALL LETTER H |
Slovak ch digraph |
| kʷ | 006B ( k ) LATIN SMALL LETTER K 02B7 ( ʷ ) MODIFIER LETTER SMALL W |
sequence with modifier letter |
See also: Where is my Character?, and the UCD file NamedSequences.txt [Data34].
A legacy grapheme cluster is defined as a base (such as A or カ) followed by zero or more continuing characters. One way to think of this is as a sequence of characters that form a “stack”.
The base can be single characters, or be any sequence of Hangul Jamo characters that form a Hangul Syllable, as defined by D133 in The Unicode Standard, or be a pair of Regional_Indicator (RI) characters. For more information about RI characters, see [UTS51].
The continuing characters include nonspacing marks, the Join_Controls (U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER) used in Indic languages, and a few spacing combining marks to ensure canonical equivalence. There are cases in Bangla, Khmer, Malayalam, and Odiya in which a ZWNJ occurs after a consonant and before a virama or other combining mark. These cases should not provide an opportunity for a grapheme cluster break. Therefore, ZWNJ has been included in the Extend class. Additional cases need to be added for completeness, so that any string of text can be divided up into a sequence of grapheme clusters. Some of these may be degenerate cases, such as a control code, or an isolated combining mark.
An extended grapheme cluster is the same as a legacy grapheme cluster, with the addition of some other characters. The continuing characters are extended to include all spacing combining marks, such as the spacing (but dependent) vowel signs in Indic scripts. For example, this includes U+093F ( ि ) DEVANAGARI VOWEL SIGN I. The extended grapheme clusters should be used in implementations in preference to legacy grapheme clusters, because they provide better results for Indic scripts such as Tamil or Devanagari in which editing by orthographic syllable is typically preferred. For scripts such as Thai, Lao, and certain other Southeast Asian scripts, editing by visual unit is typically preferred, so for those scripts the behavior of extended grapheme clusters is similar to (but not identical to) the behavior of legacy grapheme clusters.
For the rules defining the boundaries for grapheme clusters, see Section 3.1. For more information on the composition of Hangul syllables, see Chapter 3, Conformance, of [Unicode].
A key feature of Unicode grapheme clusters (both legacy and extended) is that they remain unchanged across all canonically equivalent forms of the underlying text. Thus the boundaries remain unchanged whether the text is in NFC or NFD. Using a grapheme cluster as the fundamental unit of matching thus provides a very clear and easily explained basis for canonically equivalent matching. This is important for applications from searching to regular expressions.
Another key feature is that default Unicode grapheme clusters are atomic units with respect to the process of determining the Unicode default word, and sentence boundaries. They are usually—but not always—atomic units with respect to line boundaries: there are exceptions due to the special handling of spaces. For more information, see Section 9.2 Legacy Support for Space Character as Base for Combining Marks in [UAX14].
Grapheme clusters can be tailored to meet further requirements. Such tailoring is permitted, but the possible rules are outside of the scope of this document. One example of such a tailoring would be for the aksaras, or orthographic syllables, used in many Indic scripts. Aksaras usually consist of a consonant, sometimes with an inherent vowel and sometimes followed by an explicit, dependent vowel whose rendering may end up on any side of the consonant letter base. Extended grapheme clusters include such simple combinations.
However, aksaras may also include one or more additional consonants, typically with a virama (halant) character between each pair of consonants in the sequence. Some consonant cluster aksaras are not incorporated into the default rules for extended grapheme clusters, in part because not all such sequences are considered to be single “characters” by users. Another reason is that additional changes to the rules are made when new information becomes available. Indic scripts vary considerably in how they handle the rendering of such aksaras—in some cases stacking them up into combined forms known as consonant conjuncts, and in other cases stringing them out horizontally, with visible renditions of the halant on each consonant in the sequence. There is even greater variability in how the typical liquid consonants (or “medials”), ya, ra, la, and wa, are handled for display in combinations in aksaras. So tailorings for aksaras may need to be script-, language-, font-, or context-specific to be useful.
Note: Font-based information may be required to determine the appropriate unit to use for UI purposes, such as identification of boundaries for first-letter paragraph styling. For example, such a unit could be a ligature formed of two grapheme clusters, such as لا (Arabic lam + alef).
The Unicode specification of grapheme clusters >allows for more sophisticated profiles where appropriate. Such definitions may more precisely match the user expectations within individual languages for given processes. For example, “ch” may be considered a grapheme cluster in Slovak, for processes such as collation. The default definitions are, however, designed to provide a much more accurate match to overall user expectations for what the user perceives of as characters than is provided by individual Unicode code points.
Note: The term cluster is used to emphasize that the term grapheme is used differently in linguistics.
Display of Grapheme Clusters. Grapheme clusters are not the same as ligatures. For example, the grapheme cluster “ch” in Slovak is not normally a ligature and, conversely, the ligature “fi” is not a grapheme cluster. Default grapheme clusters do not necessarily reflect text display. For example, the sequence <f, i> may be displayed as a single glyph on the screen, but would still be two grapheme clusters.
For information on the matching of grapheme clusters with regular expressions, see Unicode Technical Standard #18, “Unicode Regular Expressions” [UTS18].
Degenerate Cases. The default specifications are designed to be simple to implement, and provide an algorithmic determination of grapheme clusters. However, they do not have to cover edge cases that will not occur in practice. For the purpose of segmentation, they may also include degenerate cases that are not thought of as grapheme clusters, such as an isolated control character or combining mark. In this, they differ from the combining character sequences and extended combining character sequences defined in [Unicode]. In addition, Unassigned (Cn) code points and Private_Use (Co) characters are given property values that anticipate potential usage.
Combining Character Sequences and Grapheme Clusters. For comparison, Table 1b shows the relationship between combining character sequences and grapheme clusters, using regex notation. Note that given alternates (X|Y), the first match is taken. The simple identifiers starting with lowercase are variables that are defined in Table 1c; those starting with uppercase letters are Grapheme_Cluster_Break Property Values defined in Table 2.
Table 1b. Combining Character Sequences and Grapheme Clusters
| Term | Regex | Notes |
|---|---|---|
| combining character sequence | ccs-base? ccs-extend+ |
A single base character is not a combining character sequence. However, a single combining mark is a (degenerate) combining character sequence. |
| extended combining character sequence | extended_base?
ccs-extend+ |
extended_base includes Hangul Syllables |
| legacy grapheme cluster | crlf |
A single base character is a grapheme cluster. Degenerate cases include any isolated non-base characters, and non-base characters like controls. |
| extended grapheme cluster | crlf |
Extended grapheme clusters add prepending and spacing marks. |
Table 1b uses several symbols defined in Table 1c. Square brackets and \p{...} are used to indicate sets of characters, using the normal UnicodeSet notion.
Table 1c. Regex Definitions
ccs-base := |
[\p{L}\p{N}\p{P}\p{S}\p{Zs}] |
ccs-extend := |
[\p{M}\p{Join_Control}] |
extended_base := |
ccs-base |
crlf := |
CR LF | CR | LF |
legacy-core := |
hangul-syllable |
legacy-postcore := |
[Extend ZWJ] |
core := |
hangul-syllable |
postcore := |
[Extend ZWJ SpacingMark]
|
precore := |
Prepend |
RI-Sequence := |
RI RI |
hangul-syllable := |
L* (V+ | LV V* | LVT) T* |
xpicto-sequence := |
\p{Extended_Pictographic}
(Extend*
ZWJ \p{Extended_Pictographic})*
|
conjunctCluster := |
\p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+ |
The following is a general specification for grapheme cluster boundaries—language-specific rules in [CLDR] should be used where available.
The Grapheme_Cluster_Break property value assignments are explicitly listed in the corresponding data file in [Props]. The values in that file are the normative property values.
For illustration, property values are summarized in Table 2, but the lists of characters are illustrative.
Table 2. Grapheme_Cluster_Break Property Values
| Value | Summary List of Characters |
|---|---|
| CR | U+000D CARRIAGE RETURN (CR) |
| LF | U+000A LINE FEED (LF) |
| Control | General_Category = Line_Separator, or General_Category = Paragraph_Separator, or General_Category = Control, or General_Category = Unassigned and Default_Ignorable_Code_Point, or General_Category = Format and not U+000D CARRIAGE RETURN and not U+000A LINE FEED and not U+200C ZERO WIDTH NON-JOINER (ZWNJ) and not U+200D ZERO WIDTH JOINER (ZWJ) and not Prepended_Concatenation_Mark = Yes |
| Extend | Grapheme_Extend = Yes, or Emoji_Modifier=Yes This includes: General_Category = Nonspacing_Mark General_Category = Enclosing_Mark U+200C ZERO WIDTH NON-JOINER plus a few General_Category = Spacing_Mark needed for canonical equivalence. |
| ZWJ | U+200D ZERO WIDTH JOINER |
| Regional_Indicator (RI) | Regional_Indicator = Yes This consists of the range: U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER A ..U+1F1FF REGIONAL INDICATOR SYMBOL LETTER Z |
| Prepend | Indic_Syllabic_Category = Consonant_Preceding_Repha,
or Indic_Syllabic_Category = Consonant_Prefixed, or Prepended_Concatenation_Mark = Yes |
| SpacingMark | Grapheme_Cluster_Break ≠ Extend, and General_Category = Spacing_Mark, or any of the following (which have General_Category = Other_Letter): U+0E33 ( ำ ) THAI CHARACTER SARA AM U+0EB3 ( ຳ ) LAO VOWEL SIGN AM Exceptions: The following (which have General_Category = Spacing_Mark and would otherwise be included) are specifically excluded: U+102B ( ါ ) MYANMAR VOWEL SIGN TALL AA U+102C ( ာ ) MYANMAR VOWEL SIGN AA U+1038 ( း ) MYANMAR SIGN VISARGA U+1062 ( ၢ ) MYANMAR VOWEL SIGN SGAW KAREN EU ..U+1064 ( ၤ ) MYANMAR TONE MARK SGAW KAREN KE PHO U+1067 ( ၧ ) MYANMAR VOWEL SIGN WESTERN PWO KAREN EU ..U+106D ( ၭ ) MYANMAR SIGN WESTERN PWO KAREN TONE-5 U+1083 ( ႃ ) MYANMAR VOWEL SIGN SHAN AA U+1087 ( ႇ ) MYANMAR SIGN SHAN TONE-2 ..U+108C ( ႌ ) MYANMAR SIGN SHAN COUNCIL TONE-3 U+108F ( ႏ ) MYANMAR SIGN RUMAI PALAUNG TONE-5 U+109A ( ႚ ) MYANMAR SIGN KHAMTI TONE-1 ..U+109C ( ႜ ) MYANMAR VOWEL SIGN AITON A U+1A61 ( ᩡ ) TAI THAM VOWEL SIGN A U+1A63 ( ᩣ ) TAI THAM VOWEL SIGN AA U+1A64 ( ᩤ ) TAI THAM VOWEL SIGN TALL AA U+AA7B ( ꩻ ) MYANMAR SIGN PAO KAREN TONE U+AA7D ( ꩽ ) MYANMAR SIGN TAI LAING TONE-5 U+11720 ( 𑜠 ) AHOM VOWEL SIGN A U+11721 ( 𑜡 ) AHOM VOWEL SIGN AA |
| L | Hangul_Syllable_Type=L, such as: U+1100 ( ᄀ ) HANGUL CHOSEONG KIYEOK U+115F ( ᅟ ) HANGUL CHOSEONG FILLER U+A960 ( ꥠ ) HANGUL CHOSEONG TIKEUT-MIEUM U+A97C ( ꥼ ) HANGUL CHOSEONG SSANGYEORINHIEUH |
| V | Hangul_Syllable_Type=V, such as: U+1160 ( ᅠ ) HANGUL JUNGSEONG FILLER U+11A2 ( ᆢ ) HANGUL JUNGSEONG SSANGARAEA U+D7B0 ( ힰ ) HANGUL JUNGSEONG O-YEO U+D7C6 ( ퟆ ) HANGUL JUNGSEONG ARAEA-E, and: U+16D63 () KIRAT RAI VOWEL SIGN AA U+16D67 () KIRAT RAI VOWEL SIGN E ..U+16D6A () KIRAT RAI VOWEL SIGN AU |
| T | Hangul_Syllable_Type=T, such as: U+11A8 ( ᆨ ) HANGUL JONGSEONG KIYEOK U+11F9 ( ᇹ ) HANGUL JONGSEONG YEORINHIEUH U+D7CB ( ퟋ ) HANGUL JONGSEONG NIEUN-RIEUL U+D7FB ( ퟻ ) HANGUL JONGSEONG PHIEUPH-THIEUTH |
| LV | Hangul_Syllable_Type=LV, that is: U+AC00 ( 가 ) HANGUL SYLLABLE GA U+AC1C ( 개 ) HANGUL SYLLABLE GAE U+AC38 ( 갸 ) HANGUL SYLLABLE GYA ... |
| LVT | Hangul_Syllable_Type=LVT, that is: U+AC01 ( 각 ) HANGUL SYLLABLE GAG U+AC02 ( 갂 ) HANGUL SYLLABLE GAGG U+AC03 ( 갃 ) HANGUL SYLLABLE GAGS U+AC04 ( 간 ) HANGUL SYLLABLE GAN ... |
| E_Base | This value is obsolete and unused. |
| E_Modifier | This value is obsolete and unused. |
| Glue_After_Zwj | This value is obsolete and unused. |
| E_Base_GAZ (EBG) | This value is obsolete and unused. |
| Any | This is not a property value; it is used in the rules to represent any code point. |
The same rules are used for the two variants of grapheme clusters,
except the rules GB9a, GB9b, and GB9c. The following table shows the
differences, which are also marked on the rules themselves. The extended rules are recommended, except where the legacy
variant is required for a specific environment.
| Grapheme Cluster Variant | Includes | Excludes |
|---|---|---|
| LG: legacy grapheme clusters | GB9a, GB9b, GB9c | |
| EG: extended grapheme clusters | GB9a, GB9b, GB9c |
When citing the Unicode definition of grapheme clusters, it must be clear which of the two alternatives are being specified: extended versus legacy.
| Break at the start and end of text, unless the text is empty. | |||
| GB1 | sot | ÷ | Any |
| GB2 | Any | ÷ | eot |
| Do not break between a CR and LF. Otherwise, break before and after controls. | |||
| GB3 | CR | × | LF |
| GB4 | (Control | CR | LF) | ÷ | |
| GB5 | ÷ | (Control | CR | LF) | |
| Do not break Hangul syllable or other conjoining sequences. | |||
| GB6 | L | × | (L | V | LV | LVT) |
| GB7 | (LV | V) | × | (V | T) |
| GB8 | (LVT | T) | × | T |
| Do not break before extending characters or ZWJ. | |||
| GB9 | × | (Extend | ZWJ) | |
| The GB9a and GB9b rules only apply to extended grapheme
clusters:
Do not break before SpacingMarks, or after Prepend characters. |
|||
| GB9a | × | SpacingMark | |
| GB9b | Prepend | × | |
|
The GB9c rule only applies to extended grapheme clusters: Do not break within certain combinations with Indic_Conjunct_Break (InCB)=Linker. |
|||
| GB9c | \p{InCB=Consonant} [ \p{InCB=Extend} \p{InCB=Linker} ]* \p{InCB=Linker} [ \p{InCB=Extend} \p{InCB=Linker} ]* | × | \p{InCB=Consonant} |
| Do not break within emoji modifier sequences or emoji zwj sequences. | |||
| GB11 | \p{Extended_Pictographic} Extend* ZWJ | × | \p{Extended_Pictographic} |
| Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point. | |||
| GB12 | sot (RI RI)* RI | × | RI |
| GB13 | [^RI] (RI RI)* RI | × | RI |
| Otherwise, break everywhere. | |||