Unicode® Standard Annex #44

Unicode Character Database

Version	Unicode 17.0.0
Editors	Ken Whistler
Date	2025-08-27
This Version	https://www.unicode.org/reports/tr44/tr44-36.html
Previous Version	https://www.unicode.org/reports/tr44/tr44-34.html
Latest Version	https://www.unicode.org/reports/tr44/
Latest Proposed Update	https://www.unicode.org/reports/tr44/proposed.html
Revision	36

Summary

This annex provides the core documentation for the Unicode Character Database (UCD). It describes the layout and organization of the Unicode Character Database and how it specifies the formal definitions of the Unicode Character Properties.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.” For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions]. For any errata which may apply to this annex, see [Errata].

1 Introduction
2 Conformance
- 2.1 Simple and Derived Properties
- 2.2 Use of Default Values
- 2.3 Stability of Releases
3 Documentation
- 3.1 Character Properties in the Standard
- 3.2 The Character Property Model
- 3.3 NamesList.html
- 3.4 StandardizedVariants.html
- 3.5 Emoji Variation Sequences
- 3.6 Unihan and UAX #38
- 3.7 UTC-Source Ideographs and UAX #45
- 3.8 Data File Comments
- 3.9 Obsolete Documentation Files
4 UCD Files
- 4.1 Directory Structure
- 4.2 File Format Conventions
- 4.3 File List
- 4.4 Zipped Files
- 4.5 UCD in XML
5 Properties
- 5.1 Property Index
- 5.2 About the Property Table
- 5.3 Property Definitions
- 5.4 Derived Extracted Properties
- 5.5 Contributory Properties
- 5.6 Case and Case Mapping
- 5.7 Property Value Lists
- 5.8 Property and Property Value Aliases
- 5.9 Matching Rules
- 5.10 Invariants
- 5.11 Validation
- 5.12 Deprecation
- 5.13 Property APIs
- 5.14 Character Age
6 Test Files
- 6.1 NormalizationTest.txt
- 6.2 Segmentation Test Files and Documentation
- 6.3 Bidirectional Test Files
7 UCD Change History
Acknowledgments
References
Modifications

Note: the information in this annex is not intended as an exhaustive description of the use and interpretation of Unicode character properties and behavior. It must be used in conjunction with the data in the other files in the Unicode Character Database, and relies on the notation and definitions supplied in The Unicode Standard. All chapter references are to Version 17.0.0 of the standard unless otherwise indicated.

1 Introduction

The Unicode Standard is far more than a simple encoding of characters. The standard also associates a rich set of semantics with each encoded character—properties that are required for interoperability and correct behavior in implementations, as well as for Unicode conformance. These semantics are cataloged in the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names. The data files define the Unicode character properties and mappings between Unicode characters (such as case mappings).

This annex describes the UCD and provides a guide to the various documentation files associated with it. Additional information about character properties and their use is contained in the Unicode Standard and its annexes. In particular, implementers should familiarize themselves with the formal definitions and conformance requirements for properties detailed in Section 3.5, Properties in [Unicode] and with the material in Chapter 4, Character Properties in [Unicode]. Additional discussion about the Unicode character property model can be found in [UTR23].

The latest version of the UCD is always located on the Unicode website at:

https://www.unicode.org/Public/UCD/latest/

The specific files for the UCD associated with this version of the Unicode Standard (17.0.0) are located at:

https://www.unicode.org/Public/17.0.0/

Stable, archived versions of the UCD associated with all earlier versions of the Unicode Standard can be accessed from:

https://www.unicode.org/ucd/

For a description of the changes in the UCD for this version and earlier versions, see the UCD Change History.

2 Conformance

The Unicode Character Database is an integral part of the Unicode Standard.

The UCD contains normative property and mapping information required for implementation of various Unicode algorithms such as the Unicode Bidirectional Algorithm, Unicode Normalization, and Unicode Casefolding. The data files also contain additional informative and provisional character property information.

Each specification of a Unicode algorithm, whether specified in the text of [Unicode] or in one of the Unicode Standard Annexes, designates which data file(s) in the UCD are needed to provide normative property information required by that algorithm.

For information on the meaning and application of the terms, normative, informative, contributory, and provisional, see Section 3.5, Properties in [Unicode].

For information about the applicable terms of use for the UCD, see the Unicode Terms of Use.

2.1 Simple and Derived Properties

2.1.1 Simple Properties

Some character properties in the UCD are simple properties. This status has no bearing on whether or not the properties are normative, but merely indicates that their values are not derived from some combination of other properties.

2.1.2 Derived Properties

Other character properties are derived. This means that their values are derived by rule from some other combination of properties. Generally such rules are stated as set operations, and may or may not include explicit exception lists for individual characters.

Certain simple properties are defined merely to make the statement of the rule defining a derived property more compact or general. Such properties are known as contributory properties. Sometimes these contributory properties are defined to encapsulate the messiness inherent in exception lists. At other times, a contributory property may be defined to help stabilize the definition of an important derived property which is subject to stability guarantees.

Derived character properties are not considered second-class citizens among Unicode character properties. They are defined to make implementation of important algorithms easier to state. Included among the first-class derived properties important for such implementations are: Uppercase, Lowercase, XID_Start, XID_Continue, Math, and Default_Ignorable_Code_Point, all defined in DerivedCoreProperties.txt, as well as derived properties for the optimization of normalization, defined in DerivedNormalizationProps.txt.

Implementations should simply use the derived properties, and should not try to rederive them from lists of simple properties and collections of rules, because of the chances for error and divergence when doing so.

Definitions of property derivations are provided for information only, typically in comment fields in the data files. Such definitions may be refactored, refined, or corrected over time. These definitions are presented in a modified set notation, expressed as set additions and/or subtractions of various other property values. For example:

# Derived Property: ID_Start
#  Characters that can start an identifier.
#  Generated from:
#      Lu + Ll + Lt + Lm + Lo + Nl
#    + Other_ID_Start
#    - Pattern_Syntax
#    - Pattern_White_Space

When interpreting definitions of derived properties of this sort, keep in mind that set subtraction is not a commutative operation. Thus "Lo + Lm - Pattern_Syntax" defines a different set than "Lo - Pattern_Syntax + Lm". The order of property set operations stated in the definitions affects the composition of the derived set.

If there are any cases of mismatches between the definition of a derived property as listed in DerivedCoreProperties.txt or similar data files in the UCD, and the definition of a derived property as a set definition rule, the explicit listing in the data file should always be taken as the normative definition of the property. As described in Stability of Releases the property listing in the data files for any given version of the standard will never change for that version.

2.1.3 Properties Dependent on External Specifications

In limited cases, a Unicode character property defined in the Unicode Character Database may have an external dependency on another specification which is not a part of the Unicode Standard, and whose data is not formally part of the UCD. In such cases, version stability for the UCD is attained by requiring that dependency to be based on a known, published version of the external specification.

Starting with Version 10.0 of the UCD and continuing through Version 12.1, the clear example of such an external dependency was the derivation of some segmentation-related character properties, in part based on emoji properties associated with UTS #51, "Unicode Emoji" [UTS51]. The details of the derivation were described in the respective annexes, [UAX14] and [UAX29], as well as in the documentation portions of the associated UCD property files. See [Data14] and [Props]. The version of UTS #51 used for those segmentation properties in each of the relevant versions of the UCD was clearly identified in those annexes and data files. Starting with Version 13.0 of the UCD, however, the emoji properties which the UCD previously depended on have been formally incorporated into the UCD, so that they no longer constitute an external dependency.

An external dependency may impact either a simple or a derived property.

2.2 Use of Default Values

Unicode character properties have default values. Default values are the value or values that a character property takes for an unassigned code point, or in some instances, for designated subranges of code points, whether assigned or unassigned. For example, the default value of a binary Unicode character property is always "N".

For the formal discussion of default values, see D26 in Section 3.5, Properties in [Unicode]. For conventions related to default values in various data files of the UCD and for documentation regarding the particular default values of individual Unicode character properties, see Default Values.

2.3 Stability of Releases

Just as for the Unicode Standard as a whole, each version of the UCD, once published, is absolutely stable and will never change. Each released version is archived in a directory on the Unicode website, with a directory number associated with that version. URLs pointing to that version's directory are also stable and will be maintained in perpetuity.

Any errors discovered for a released version of the UCD are noted in [Errata], and if appropriate will be corrected in a subsequent version of the UCD.

Stability guarantees constraining how Unicode character properties can (or cannot) change between releases of the UCD are documented in the Unicode Consortium Stability Policies [Stability].

2.3.1 Changes to Properties Between Releases

Updates to character properties in the Unicode Character Database may be required for any of three reasons:

To cover new characters added to the standard
To add new character properties to the standard
To change the assigned values for a property for some characters already in the standard

While the Unicode Consortium endeavors to keep the values of all character properties as stable as possible between versions, occasionally circumstances may arise which require changing them. In particular, as less well-documented scripts, such as those for minority languages, or historic scripts are added to the standard, the exact character properties and behavior may not fully be known when the script is first encoded. The properties for some of these characters may change as further information becomes available or as implementations turn up problems in the initial property assignments. As far as possible, any readjustment of property values based on growing implementation experience is made to be compatible with established practice.

All changes to normative or informative property values, to the status or type of a property, or to property or property value aliases, must be approved by an explicit decision taken by the Unicode Technical Committee. Changes to provisional property values are subject to less stringent oversight.

Occasionally, a character property value is changed to prevent incorrect generalizations about a character's use based on its nominal property values. For example, U+200B ZERO WIDTH SPACE was originally classified as a space character (General_Category=Zs), but it was reclassified as a Format character (General_Category=Cf) to clearly distinguish it from space characters in its function as a format control for line breaking.

There is no guarantee that a particular value for an enumerated property will actually have characters associated with it. Also, because of changes in property value assignments between versions of the standard, a property value that once had characters associated with it may later have none. Such conditions and changes are rare, but implementations must not assume that all property values are associated with non-null sets of characters. For example, currently the special Script property value Katakana_Or_Hiragana has no characters associated with it.

2.3.2 Obsolete Properties

An obsolete property is one whose original use case no longer exists. The original use case may have been overtaken by other developments, or the property may have been supplanted by a different property, and so forth. For example, the ISO_Comment property was once used to keep track of annotations for characters used in the production of name lists for ISO/IEC 10646 code charts. As of Unicode 5.2.0 that functionality was dropped, and so the property became obsolete, and its value is now defaulted to the null string for all Unicode code points.

An obsolete property is never removed from the UCD.

Obsolete properties are not recommended for use in APIs.

2.3.3 Deprecated Properties

Formally declaring a property to be deprecated is an indication that the property is no longer recommended for use, perhaps because its original intent has been replaced by another property or because its specification was somehow defective. The general practice of the UTC is to deprecate properties that have become obsolete, although there may be exceptions. See also the discussion of Deprecation.

A deprecated property is never removed from the UCD.

Deprecated properties are not recommended for use in APIs.

Table 1 lists the properties that are formally deprecated as of this version of the Unicode Standard.

Table 1. Deprecated Properties

Property Name	Deprecation Version	Reason
Grapheme_Link	5.0.0	Duplication of ccc=9
Hyphen	6.0.0	Supplanted by Line_Break property values
ISO_Comment	6.0.0	No longer needed for chart generation; otherwise not useful
Expands_On_NFC	6.0.0	Less useful than UTF-specific calculations
Expands_On_NFD	6.0.0	Less useful than UTF-specific calculations
Expands_On_NFKC	6.0.0	Less useful than UTF-specific calculations
Expands_On_NFKD	6.0.0	Less useful than UTF-specific calculations
FC_NFKC_Closure	6.0.0	Supplanted in usage by NFKC_Casefold; otherwise not useful

2.3.4 Stabilized Properties

A stabilized property is one for which the Unicode Technical Committee has declared that it will no longer actively maintain the property or extend it for newly encoded characters. The property values of a stabilized property are frozen as of a particular release of the standard.

The stabilization of a property does not indicate that the property should or should not be used. For example, if the property references a subset of characters that is unaffected by future additions to the repertoire, it may be stabilized without becoming useless. An example of a property which could be stabilized without becoming useless is ASCII_Hex_Digit, as no more such digits would ever be added to the standard.

A stabilized property is never removed from the UCD.

Table 2 lists the properties that are formally stabilized as of this version of the Unicode Standard.

Table 2. Stabilized Properties

Property Name	Stabilization Version
Hyphen	4.0.0
ISO_Comment	6.0.0

2.3.5 Provisional Properties

A provisional property has no stability guarantees. It may be changed arbitrarily or may be removed altogether. Table 9, Property Table does not list any provisional properties; however, [UAX38] documents a large number of provisional properties specified in the Unihan Database. Provisional properties are used to collect various information about Han characters, for review and testing. On occasion, a provisional property's status may change to informational or normative, in which case it then becomes subject to the same stability guarantees as other properties.

A provisional property may be removed in any subsequent version of the UCD.

Provisional properties are not recommended for use in APIs.

3 Documentation

This annex provides the core documentation for the UCD, but additional information about character properties is available in other parts of the standard and in additional documentation files contained within the UCD.

3.1 Character Properties in the Standard

The formal definitions related to character properties used by the Unicode Standard are documented in Section 3.5, Properties in [Unicode]. Understanding those definitions and related terminology is essential to the appropriate use of Unicode character properties.

See Section 4.1, Unicode Character Database, in [Unicode] for a general discussion of the UCD and its use in defining properties. The rest of Chapter 4 provides important explanations regarding the meaning and use of various normative character properties.

3.2 The Character Property Model

For a general discussion of the property model which underlies the definitions associated with the UCD, see Unicode Technical Report #23, "The Unicode Character Property Model" [UTR23]. That technical report is informative, but over the years various content from it has been incorporated into normative portions of the Unicode Standard, particularly for the definitions in Chapter 3.

UTR #23 presents the important distinction between properties defined for strings (in contrast to properties defined for characters or code points) and character properties that have values that are strings. The latter are referred to as string-valued properties in UTR #23 and in this annex. UTR #23 also discusses string functions and their relation to character properties.

3.3 NamesList.html

NamesList.html formally describes the format of the NamesList.txt data file in BNF. That data file is used to drive the PDF formatting of the Unicode code charts and names list. See also Section 24.1, Character Names List, in [Unicode] for a detailed discussion of the conventions used in the Unicode names list as formatted for the online code charts.

3.4 StandardizedVariants.html

StandardizedVariants.html has been obsoleted as of Version 9.0 of the UCD. This file formerly documented standardized variants, showing a representative glyph for each. It was closely tied to the data file, StandardizedVariants.txt, which defines those sequences normatively.

The function of StandardizedVariants.html to show representative glyphs for standardized variants has been superseded. There are now better means of illustrating the glyphs. Many standardized variation sequences are shown in the Unicode code charts directly, in summary sections at the ends of the names list for any block which contains them. Glyphs for standardized variants of CJK compatibility ideographs are also shown directly in the Unicode code charts.

3.5 Emoji Variation Sequences

Emoji variation sequences are a special class of variation sequences involving emoji characters. They are divided into two subtypes: an emoji presentation sequence, consisting of an emoji character base followed by the variation selector U+FE0F, and a text presentation sequence, consisting of an emoji character base followed by the variation selector U+FE0E. Such sequences come in pairs: the text presentation sequence shown with a black and white presentation, as seen in the Unicode code charts, and the emoji presentation sequence shown with a colorful icon, as usually seen in implementations on mobile devices and elsewhere.

Starting with Version 9.0.0, the following page in the Unicode emoji subsite area shows appropriate representative glyphs for all emoji variation sequences, with separate columns for text presentation sequences and for emoji presentation sequences:

https://www.unicode.org/emoji/charts/emoji-variants.html

The data file which defines the exact list of emoji variation sequences is emoji-variation-sequences.txt. That file is maintained in the UCD, but emoji variation sequences are documented in Unicode Technical Standard #51, Unicode Emoji [UTS51].

3.6 Unihan and UAX #38

Unicode Standard Annex #38, "Unicode Han Database (Unihan)" [UAX38] describes the format and content of the Unihan Database [Unihan], which collects together all property information for CJK unified ideographs. That annex also specifies in detail which of the Unihan character properties are normative, informative, or provisional.

The Unihan Database contains extensive and detailed mapping information for CJK unified ideographs encoded in the Unicode Standard, but it is aimed only at those ideographs, not at other characters used in the East Asian context in general. In contrast, East Asian legacy character sets, including important commercial and national character set standards, contain many non-CJK characters. As a result, the Unihan Database must be supplemented from other sources to establish mapping tables for those character sets.

The majority of the content of the Unihan Database is released for each version of the Unicode Standard as a collection of Unihan data files in the UCD. Because of their large size, these data files are released only as a zipped file, Unihan.zip. The details of the particular data files in Unihan.zip and the CJK properties each one contains are provided in [UAX38]. For versions of the UCD prior to Version 5.2.0, all of the CJK properties were listed together in a very large, single file, Unihan.txt.

3.7 UTC-Source Ideographs and UAX #45

Unicode Standard Annex #45, "U-Source Ideographs" [UAX45] describes the format of USourceData.txt, which lists all of the information for UTC-Source ideographs.

3.8 Data File Comments

In addition to the specific documentation files for the UCD, individual data files often contain extensive header comments describing their content and any special conventions used in the data.

In some instances, individual property definition sections also contain comments with information about how the property may be derived. Such comments are informative; while they are intended to convey the intent of the derivation, in case of any mismatch between a statement of a derivation in a comment field and the actual listing of the derived property, the list is considered to be definitive. See Simple and Derived Properties.

3.9 Obsolete Documentation Files

UCD.html was formerly the primary documentation file for the UCD. As of Version 5.2.0, its content has been wholly incorporated into this document.

Unihan.html was formerly the primary documentation file for the Unihan Database. As of Version 5.1.0, its content has been wholly incorporated into [UAX38].

Versions of the Unicode Standard prior to Version 4.0.0 contained small, focused documentation files, UnicodeCharacterDatabase.html, PropList.html, and DerivedProperties.html, which were later consolidated into UCD.html.

StandardizedVariants.html has been obsoleted as of Version 9.0.0. See Section 3.4, StandardizedVariants.html.

4 UCD Files

The heart of the UCD consists of the data files themselves. This section describes the directory structure for the UCD, the format conventions for the data files, and provides documentation for data files not documented elsewhere in this annex.

4.1 Directory Structure

Each version of the UCD is released in a separate, numbered directory under the Public directory on the Unicode website. The content of that directory is complete for that release. It is also stable—once released, it will be archived permanently in that directory, unchanged, at a stable URL.

The specific files for the UCD associated with this version of the Unicode Standard (17.0.0) are located at:

https://www.unicode.org/Public/17.0.0/

The UCD data files proper are located under the ucd/ subdirectory. Other data files and charts associated with a release of the Unicode Standard are located in other subdirectories. For details regarding the data files for other UTSes synchronized with each release of the Unicode Standard, see [UTS10], [UTS39], [UTS46], and [UTS51].

The latest released version of the UCD is always accessible via the following stable URL:

https://www.unicode.org/Public/UCD/latest/

A draft version of the UCD under development for a subsequent release is always accessible via the following stable URL:

https://www.unicode.org/Public/draft/

Prior to Version 6.3.0, access to the latest released version of the UCD was via the following stable URL:

https://www.unicode.org/Public/UNIDATA/

That "UNIDATA" URL will be maintained, but is no longer recommended, because it points to the ucd subdirectory of the latest release, rather than to the parent directory for the release. The "UNIDATA" naming convention is also very old, and does not follow the directory naming conventions currently used for other data releases in the Public directory on the Unicode website.

4.1.1 UCD Files Proper

The UCD proper is located in the ucd subdirectory of the numbered version directory. That directory contains all of the documentation files and most of the data files for the UCD, including some data files for derived properties.

Although all UCD data files are version-specific for a release and most contain internal date and version stamps, the file names of the released data files do not differ from version to version. When linking to a version-specific data file, the version will be indicated by the version number of the directory for the release.

All files for derived extracted properties are in the extracted subdirectory of the ucd subdirectory. See Derived Extracted Properties for documentation regarding those data files and their content.

A number of auxiliary properties are specified in files in the auxiliary subdirectory of the ucd subdirectory. It contains data files specifying properties associated with Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29] and with Unicode Standard Annex #14, "Unicode Line Breaking Algorithm" [UAX14], as well as test data for those algorithms. See Segmentation Test Files and Documentation for more information about the test data.

Certain data files associated with emoji properties are maintained in the emoji subdirectory of the ucd subdirectory. Those data files define the simple character properties associated with emoji characters, as well as the emoji variation sequences. Other data files associated with emoji, including those which define the RGI ("recommended for general interchange") sets of various types of emoji sequences, as well as emoji test data, are maintained elsewhere, and are not considered formally a part of the UCD. See [UTS51] for documentation regarding those data files and their content.

4.1.2 UCD XML Files

The XML version of the UCD is located in the ucdxml subdirectory of the numbered version directory. See the UCD in XML for more details.

4.1.3 Charts

The code charts specific to a version of Unicode are archived as a single large PDF file in the charts subdirectory of the numbered version directory. See the readme.txt in that subdirectory and the general web page explaining the Unicode Code Charts for more details.

4.1.4 Beta Review Considerations

Prior to the formal release of a version of the UCD, draft files are made available for review in a subdirectory named draft, under the /Public directory on the Unicode server. The files in this directory may include temporary files, including documentation of differences between draft versions. The number of reviews is not fixed—a beta review will always take place, but an alpha review is optional.

Notices contained in a ReadMe.txt file in the draft directory during the beta review period also make it clear that that directory contains preliminary material under review, rather than a final, stable release.

4.1.5 File Directory Differences for Early Releases

The UCD in XML was introduced in Version 5.1.0, so UCD directories prior to that do not contain the ucdxml subdirectory.

UCD directories prior to Version 13.0.0 do not contain the emoji subdirectory.

UCD directories prior to Version 4.1.0 do not contain the auxiliary subdirectory.

UCD directories prior to Version 3.2.0 do not contain the extracted subdirectory.

The general structure of the file directory for a released version of the UCD described above applies to Versions 4.1.0 and later. Prior to Version 4.1.0, versions of the UCD were not self-contained, complete sets of data files for that version, but instead only contained any new data files or any data files which had changed since the prior release.

Because of this, the property files for a given version prior to Version 4.1.0 can be spread over several directories. Consult the component listings at Enumerated Versions to find out which files in which directories comprise a complete set of data files for that version.

The directory naming conventions and the file naming conventions also differed prior to Version 4.1.0. So, for example, Version 4.0.0 of the UCD is contained in a directory named 4.0-Update, and Version 4.0.1 of the UCD in a directory named 4.0-Update1. Furthermore, for these earlier versions, the data file names do contain explicit version numbers.

4.2 File Format Conventions

Files in the UCD use the format conventions described in this section, unless otherwise specified.

4.2.1 Data Fields

Each line of data consists of fields separated by semicolons. The fields are numbered starting with zero.
The first field (0) of each line in the Unicode Character Database files represents a code point or range. The remaining fields (1..n) are properties associated with that code point.
Leading and trailing spaces within a field are not significant. However, no leading or trailing spaces are allowed in any field of UnicodeData.txt.
The Unihan data files [Unihan] in the UCD have a separate format, using tab characters instead of semicolons to separate fields. See [UAX38] for the detailed specification of the format of the Unihan data files. The data files TangutSources.txt and NushuSources.txt also use this format.

4.2.2 Code Points and Sequences

Code points are expressed as hexadecimal numbers with four to six digits. (See Appendix A, Notational Conventions in [Unicode] for a full, formal definition of this convention.) They are written without the "U+" prefix in all data files except the Unihan data files. The Unihan data files use the "U+" prefix for all Unicode code points, to distinguish them from other decimal and hexadecimal numerical references occurring in their data fields.
When a data field contains a sequence of code points, spaces separate the code points.

4.2.3 Code Point Ranges

A range of code points is specified by the form "X..Y".
Each code point in a range has the associated property value specified on a data file. For example (from Blocks.txt):
```
0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
      
```
For backward compatibility, ranges in the file UnicodeData.txt are specified by entries for the start and end characters of the range, rather than by the form "X..Y". The start character is indicated by a range identifier, followed by a comma and the string "First", in angle brackets. This entry takes the place of a regular character name in field 1 for that line. The end character is indicated on the next line with the same range identifier, followed by a comma and the string "Last", in angle brackets:
```
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FEF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
      
```
For character ranges using this convention, the names of all characters in the range are algorithmically derivable. See Section 4.8, Name in [Unicode] for more information on derivation of character names for such ranges.

4.2.4 Comments

U+0023 NUMBER SIGN ("#") is used to indicate comments: all characters from the number sign to the end of the line are considered part of the comment, and are disregarded when parsing data.
In many files, the comments on data lines use a common format, as illustrated here (from Scripts.txt):
```
09B2          ; Bengali # Lo       BENGALI LETTER LA
```
The first part of a comment using this common format is the General_Category value, provided for information. This is followed by the character name for the code point in the first field (0).
The printing of the General_Category value is suppressed in instances where it would be redundant, as for DerivedGeneralCategory.txt, in which the value of the property value in the data field is already the General_Category value.
The symbol "L&" indicates characters of General_Category Lu, Ll, or Lt (uppercase, lowercase, or titlecase letter). For example:
```
0386          ; Greek # L&       GREEK CAPITAL LETTER ALPHA WITH TONOS
```
L& as used in these comments is an alias for the derived LC value (cased letter) for the General_Category property, as documented in PropertyValueAliases.txt.
When the data line contains a range of code points, this common format for a comment also indicates a range of character names, separated by "..", as illustrated here (from DerivedNumericType.txt):
```
00BC..00BE    ; Numeric # No   [3] VULGAR FRACTION ONE QUARTER..VULGAR FRACTION THREE QUARTERS
```
Normally, consecutive characters with the same property value would be represented by a single code point range. In data files using this comment convention, such ranges are subdivided so that all characters in a range also have the same General_Category value (or LC). While this convention results in more ranges than are strictly necessary, it makes the contents of the ranges clearer.
When a code point range occurs, the number of items in the range is included in the comment (in square brackets), immediately following the General_Category value.
The comments are purely informational, and may change format or be omitted in the future. They should not be parsed for content. However, see Section 4.2.10 @missing Conventions.

4.2.5 Code Point Labels

Surrogate code points, private-use characters, control codes, noncharacters, and unassigned code points have no names. When such code points are listed in the data files, for example to list their General_Category values, the comments use code point labels instead of character names. For example (from DerivedCoreProperties.txt):
```
2065          ; Default_Ignorable_Code_Point # Cn       <reserved-2065>
```
Although code point labels are not formally character names and are not considered values of the Name property for characters, they are designed to be maintained as unique values within the namespace for Unicode character names. Hence, implementations can safely use them as identifiers for code points without overlap with actual character names.
Code point labels use one of the tags as documented in Section 4.8, Name in [Unicode] and as shown in Table 3, followed by "-" and the code point expressed in hexadecimal. The entire label is then enclosed in angle brackets when listed in data files of the UCD.

Table 3. Code Point Label Tags

Tag	General_Category	Note
reserved	Cn	Noncharacter_Code_Point=F
noncharacter	Cn	Noncharacter_Code_Point=T
control	Cc
private-use	Co
surrogate	Cs

4.2.6 Multiple Properties in One Data File

When a file contains the specification for multiple properties, the second field specifies the name of the property and the third field specifies the property value. For example (from DerivedNormalizationProps.txt):
```
03D2  ; FC_NFKC; 03C5           # L&  GREEK UPSILON WITH HOOK SYMBOL
03D3  ; FC_NFKC; 03CD           # L&  GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
      
```

4.2.7 Binary Property Values

For binary properties, the second field specifies the name of the applicable property, with the implied value of the property being "True". Only the ranges of characters with the binary property value of "Y" (= True) are listed. For example (from PropList.txt):
```
1680       ; White_Space # Zs      OGHAM SPACE MARK
2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
      
```

4.2.8 Multiple Values for Properties

When a data file defines a property which may take multiple values for a single code point, the multiple values are expressed in a space-delimited list. For example (from ScriptExtensions.txt):
```
0640          ; Adlm Arab Mand Mani Phlp Rohg Sogd Syrc # Lm       ARABIC TATWEEL
      
```
In some cases—but not all—the order of multiple elements in a space-delimited list may be significant. When the order of multiple elements is significant, it is documented along with the property itself. For example (from Unihan_Readings.txt), for the tag kMandarin, when there are two values for a code point, the first value is used to indicate a preferred pronunciation for zh-Hans (CN) and the second a preferred pronunciation for zh-Hant (TW).
For further discussion, see Section 5.7.6 Properties Whose Values Are Sets of Values.

4.2.9 Default Values

Entries for a code point may be omitted in a data file if the code point has a default value for the property in question.
For most string-valued properties, including the definition of foldings and mappings, the default value is the code point of the character itself.
For some string-valued properties which define a property that applies primarily to a small, defined set of code points, the default value is <none>, which is interpreted as no value is defined. (This contrasts with specification of an actual value consisting of an empty string. See Section 4.2.11 Empty Fields.) Current examples include Bidi_Paired_Bracket, as well as some Unihan-related properties.
For miscellaneous properties which take strings as values, such as the Unicode Name property, the default value is an empty string.
For binary properties except for Extended_Pictographic, the default value is always "N" (= False) and is always omitted.
For enumerated and catalog properties, the default value is listed in a comment. For example (from Scripts.txt):
```
#  All code points not explicitly listed for Script
#  have the value Unknown (Zzzz).
      
```
A few properties of the enumerated type have multiple default values. In those cases, comments in the file explain the code point ranges for applicable values. See also Table 4.
Default values are also listed in specially-formatted comment lines, using the keyword "@missing". Parsers which extract and process these lines can algorithmically determine the default values for all code points. See @missing Conventions for details about the syntax and use of these lines.
Because of the legacy format constraints for UnicodeData.txt, that file contains no specific information about default values for properties. The default values for fields in UnicodeData.txt are documented in Table 4 below if they cannot be derived from the general rules about default values for properties.
The file ArabicShaping.txt is also exceptional, because it omits the listing of many characters whose property value (jt=T) can be derived by rule. Adding an "@missing" line to that file would result in the wrong interpretation of Joining_Type values for omitted characters. The full explicit listing of Joining_Type values and the correct "@missing" line for the default Joining_Type value (jt=U) can be found in the file DerivedJoiningType.txt instead. The values of Joining_Type listed in DerivedJoiningType.txt should be taken as definitive, because of the difficulty of deriving the correct values for all characters based only on the entries in ArabicShaping.txt.

Default values for common catalog, enumeration, and numeric properties are listed in Table 4, along with the exceptional binary property, Extended_Pictographic. Further explanation is provided below the table, in those cases where the default values are complex, as indicated in the third column.

Table 4. Default Values for Properties

Property Name	Default Value(s)	Complex?
Age	Unassigned (= NA)	No
Bidi_Class	L, AL, R, BN, ET	Yes
Block	No_Block	No
Canonical_Combining_Class	Not_Reordered (= 0)	No
Decomposition_Type	None	No
East_Asian_Width	Neutral (= N), Wide (= W)	Yes
Extended_Pictographic	N (= False), Y (= True)	Yes
General_Category	Cn	No
Line_Break	Unknown (= XX), ID, PR	Yes
Numeric_Type	None	No
Numeric_Value	NaN	No
Script	Unknown (= Zzzz)	No
Vertical_Orientation	Rotated (= R), Upright (= U)	Yes

4.2.9.1 Complex Default Values

Complex default values are those which take multiple values, contingent on code point ranges or other conditions. Complex default values other than those specified in the "@missing" line are explicitly listed in the relevant property file, except for instances noted in this section. This means that a parser extracting property values from the UCD should never encounter an ambiguous condition for which the default value of a property for a particular code point is unclear.

Bidi_Class:
See Unicode Standard Annex #9, "Unicode Bidirectional Algorithm" [UAX9] and DerivedBidiClass.txt for full details.
East_Asian_Width:
This property defaults to Neutral for most code points, but defaults to Wide for unassigned code points in blocks associated with CJK ideographs. See Unicode Standard Annex #11, "East Asian Width" [UAX11] and EastAsianWidth.txt for documentation of the default values and DerivedEastAsianWidth.txt for the full listing of values.
Line_Break:
This property defaults to Unknown for most code points, but defaults to ID for unassigned code points in blocks associated with CJK ideographs, and in blocks in the ranges U+1F000..U+1FAFF and U+1FC00..U+1FFFD. The property defaults to PR for unassigned code points in the Currency Symbols block. See Unicode Standard Annex #14, "Unicode Line Breaking Algorithm" [UAX14] and LineBreak.txt for documentation of the default values and DerivedLineBreak.txt for the full listing of values.
Extended_Pictographic:
This property defaults to N (= False) for most code points, but defaults to Y (= True) for unassigned code points in blocks in the ranges U+1F000..U+1FAFF and U+1FC00..U+1FFFD. Those ranges are correlated with the ranges associated with default values for the Line_Break property, and have the same rationale. They help future-proof the behavior of Unicode segmentation algorithms for code point ranges most likely to be used for future assignment of new emoji characters.
Vertical_Orientation:
This property defaults to Rotated (R) for most code points, but defaults to Upright (U) for unassigned code points in blocks associated with scripts that are themselves predominantly Upright, in blocks for some notational systems, and in blocks predominantly associated with pictographic symbols and emoji. See Unicode Standard Annex #50, "Unicode Vertical Text Layout" [UAX50] and VerticalOrientation.txt for full details.

4.2.10 @missing Conventions

Specially-formatted comment lines with the keyword "@missing" are used to define default property values for ranges of code points not explicitly listed in a data file. These lines follow regular conventions that make them machine-readable.

An @missing line starts with the comment character "#", followed by a space, then the "@missing" keyword, followed by a colon, another space, a code point range, and a semicolon. Then the line typically continues with a semicolon-delimited list of one or more default property values. For example:

# @missing: 0000..10FFFF; Unknown

In general, the code point range and semicolon-delimited list follow the same syntactic conventions as the data file in which the @missing line occurs, so that any parser which interprets that data file can easily be adapted to also parse and interpret an @missing line to pick up default property values for code points.

@missing lines are also supplied for many properties in the file PropertyValueAliases.txt. In this case, because there are many @missing lines in that single data file, each @missing line in that file uses the syntactic pattern code_point_range; property_name; default_prop_val.

An @missing line is never provided for a binary property, because the default value for binary properties is always "N" and need not be defined redundantly for each binary property.

Because of the addition of property names when @missing lines are included in PropertyValueAliases.txt, there are currently two syntactic patterns used for @missing lines, as summarized schematically below:

code_point_range; default_prop_val
code_point_range; property_name; default_prop_val

In this schematic representation, "default_prop_val" stands in for either an explicit property value or for a special tag such as <none> or <script>.

Pattern #1 is used in most primary and derived UCD files. For example:

# @missing: 0000..10FFFF; <none>

Pattern #2 is used in PropertyValueAliases.txt and in DerivedNormalizationProps.txt, both of which contain values associated with many properties. For example:

# @missing: 0000..10FFFF; NFD_QC; Yes

The special tag values which may occur in the default_prop_val field in an @missing line are interpreted as follows:

Tag	Interpretation
<none>	no value is defined
<code point>	the string representation of the code point value
<script>	the value equal to the Script property value for this code point

Starting with Version 15.0, some data files in the UCD may contain multiple @missing lines defined for the same property. When multiple @missing lines are defined this way, they are to be interpreted as follows: Each successive @missing line specifies an overriding range value for all previous @missing definitions. This convention allows a generic default value to be specified first for the entire Unicode code point range, followed by other specific default values for more constrained, specific sub-ranges. This enables an easy-to-understand and easy-to-maintain way of handling complex default values, as for the Bidi_Class or Line_Break properties. (See Complex Default Values.) The following simple example for East_Asian_Width, extracted from DerivedEastAsianWidth.txt, illustrates this mechanism:

# @missing: 0000..10FFFF; Neutral
# @missing: 3400..4DBF; Wide
# @missing: 4E00..9FFF; Wide
# @missing: F900..FAFF; Wide
# @missing: 20000..2FFFD; Wide
# @missing: 30000..3FFFD; Wide

Implementation of parsing for multiple @missing lines for a single property is straightforward. Each time an @missing line is encountered, simply assign the given default value to the specified range. With this strategy, each successive @missing line will automatically override any prior assigned values for a given sub-range.

4.2.11 Empty Fields

The data file UnicodeData.txt defines many property values in each record. When a field in a data line for a code point is empty, that indicates that the property takes the default value for that code point. For example:

0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;

In that data line, the empty numeric fields indicate that the value of Numeric_Value for U+0022 is NaN and that the value of Numeric_Type is None. The empty case mapping fields indicate that the value of Simple_Uppercase_Mapping for U+0022 takes the default value, namely the code point itself, and so forth.

The interpretation of empty fields in other data files of the UCD differs. In the case of data files which define string-valued properties, the omission of an entry for a code point indicates that the property takes the default value for that code point. However, if there is an entry for a code point, but the property value field for that entry is empty, that indicates that the property value is an explicit empty string (""). For example, the derived property NFKC_Casefold may map a code point to a sequence of code points, to a single different code point, to the same single code point, or to no code point at all (an empty string). See the following entries from the data file DerivedNormalizationProps.txt:

00AA          ; NFKC_CF; 0061           # Lo       FEMININE ORDINAL INDICATOR
00AD          ; NFKC_CF;                # Cf       SOFT HYPHEN
00AF          ; NFKC_CF; 0020 0304      # Sk       MACRON

The empty field for U+00AD indicates that the property NFKC_Casefold maps SOFT HYPHEN to an empty string. By contrast, the absence of the entry for U+00AE in the data file indicates that the property NFKC_Casefold maps U+00AE REGISTERED SIGN to itself—the default value.

4.2.12 Text Encoding

The data files use UTF-8. Unless otherwise noted, non-ASCII characters only appear in comments.
The Unihan data files [Unihan] in the UCD make extensive use of UTF-8 in data fields. (See [UAX38] for details.)
For legacy reasons, NamesList.txt was exceptional; it was encoded in Latin-1 prior to Unicode 6.2. For Unicode 6.2 and later, the encoding is UTF-8. See NamesList.html.
Segmentation test data files, such as WordBreakTest.txt, make use of non-ASCII (UTF-8) characters as delimiters for data fields.

4.2.13 Line Termination

All data files in the UCD use LF line termination (not CRLF line termination). When copied to different systems, these line endings may be automatically changed to use the native line termination conventions for that system. Make sure your editor (or parser) can deal with the line termination style in the local copy of the data files.

4.2.14 Other Conventions

In some test data files, segments of the test data are distinguished by a line starting with an "@" sign. For example (from NormalizationTest.txt):
```
@Part1 # Character by character test
      
```

4.2.15 Other File Formats

The data format for Unihan data files and for TangutSources.txt and NushuSources.txt in the UCD differs from the standard format. See the discussion of Unihan and UAX #38 earlier in this annex for more information.
The format for NamesList.txt, which documents the Unicode names list and which is used programmatically to drive the formatting program for Unicode code charts, also differs significantly from regular UCD data files. See NamesList.html
Index.txt is another exception. It uses a tab-delimited format, with field 0 consisting of an index entry string, and field 1 a code point. Index.txt is used to maintain the Unicode Character Name Index.
The various segmentation test data files make use of "#" to delimit comments, but have distinct conventions for their data fields. See the documentation in their header sections for details of the data field formats for those files.
The XML version of the UCD has its own file format conventions. In those files, "#" is used to stand for the code point in algorithmically derivable character names such as CJK UNIFIED IDEOGRAPH-4E00 or TANGUT IDEOGRAPH-17000, so as to allow for name sharing in more compact representations of the data. See Unicode Standard Annex #42, "Unicode Character Database in XML" [UAX42] for details.

4.3 File List

The exact list of files associated with any particular version of the UCD is available on the Unicode website by referring to the component listings at Enumerated Versions.

The majority of the data files in the UCD provide specifications of character properties for Unicode characters. Those files and their contents are documented in detail in the Property Definitions section below.

The data files in the extracted subdirectory constitute reformatted listings of single character properties extracted from UnicodeData.txt or other primary data files. The reformatting is provided to make it easier to see the particular set of characters having certain values for enumerated properties, or to separate the statement of that property from other properties defined together in UnicodeData.txt. These files also include explicit listings of default values for the respective properties. These extracted, derived data files are further documented in the Derived Extracted Properties section below.

The UCD also contains a number of test data files, whose purpose is to provide standard test cases useful in verifying the implementation of complex Unicode algorithms. See the Test Files section below for more documentation.

The remaining files in the Unicode Character Database do not directly specify Unicode character properties. The important files and their functions are listed in Table 5. The Status column indicates whether the file (and its content) is considered Normative, Informative, or Provisional.

Table 5. UCD Files That Do Not Specify Character Properties

File Name	Reference	Status	Description
CJKRadicals.txt	[UAX38]	I	List of Unified CJK Ideographs and CJK Radicals that correspond to specific radical numbers used in the CJK radical stroke counts.
USourceData.txt	[UAX45]	N	The list of formal references for UTC-Source ideographs, together with data regarding their status and sources.
USourceGlyphs.pdf	[UAX45]	I	A table containing a representative glyph for each UTC-Source ideograph.
USourceRSChart.pdf	[UAX45]	I	A radical-stroke index of all the UTC-Source ideographs.
TangutSources.txt	Chapter 18	N	Specifies normative source mappings for Tangut ideographs and components. This data file also includes informative radical-stroke values that are used in the preparation of the code charts for the Tangut blocks. kTGT_MergedSrc: normative source mapping to various Tangut source references kTGT_RSUnicode: informative radical-stroke value
NushuSources.txt	Chapter 18	N	Specifies normative source mappings for Nushu ideographs. This data file also includes informative readings for Nushu characters. kNSHU_DubenSrc: normative source mapping to the Nushu Duben kNSHU_Reading: informative example phonetic reading
EmojiSources.txt	Chapter 22	N	Specifies source mappings to SJIS values for emoji symbols in the original implementations of these symbols by Japanese telecommunications companies.
Index.txt	Chapter 24	I	Index to Unicode characters.
NamesList.txt	Chapter 24	I	Names list used for production of the code charts, derived from UnicodeData.txt. It contains additional annotations.
NamesList.html	Chapter 24	I	Documents the format of NamesList.txt.
StandardizedVariants.txt	Chapter 23	N	Lists all the standardized variant sequences that have been defined, plus a textual description of their desired appearance.
StandardizedVariants.html	Chapter 23	N	An obsolete derived documentation file.
NamedSequences.txt	[UAX34]	N	Lists the names for all approved named sequences. This is a string-valued property of strings.
NamedSequencesProv.txt	[UAX34]	P	Lists the names for all provisional named sequences. This is a (provisional) string-valued property of strings.
emoji-variation-sequences.txt	[UTS51]	N	Lists all emoji presentation sequences and text presentation sequences involving currently encoded emoji characters.
DoNotEmit.txt	--	I	This file lists characters and sequences that should not ordinarily be emitted, for example, by keyboards and input methods, along with mappings to preferred sequences. (This data is gathered from various sources, including the “Do Not Use” tables in numerous sections of the core specification.)

For more information about these files and their use, see the referenced annexes or chapters of Unicode Standard, or, in the case of emoji sequences data, [UTS51].

4.4 Zipped Files

Two different zipped files are provided for each version:

Unihan.zip is the zipped version of the very large Unihan data files
UCD.zip is the zipped version of all of the rest of the UCD data files, excluding the Unihan data files.

This bifurcation allows for better management of downloading version-specific information, because Unihan.zip contains all the pertinent CJK-related property information, while UCD.zip contains all of the rest of the UCD property information, for those who may not need the voluminous CJK data.

Most versions prior to Version 17.0 have copies of the zipped files also posted in versioned subdirectories under the Public/zipped/ directory on the Unicode website. This practice has since been discontinued.

The practice of including a copy of UCD.zip in the main versioned directories for the UCD started with Version 6.1.0.

In versions of the UCD prior to Version 4.1.0, zipped copies of the Unihan data files (which for those versions were released as a single large text file, Unihan.txt) are provided in the same directory as the UCD data files. These zipped files are only posted for versions of the UCD in which Unihan.txt was updated.

4.5 UCD in XML

Starting with Version 5.1.0, a set of XML data files are also released with each version of the UCD. Those data files make it possible to import and process the UCD property data using standard XML parsing tools, instead of the specialized parsing required for the various individual data files of the UCD.

4.5.1 UAX #42

Unicode Standard Annex #42, "Unicode Character Database in XML" [UAX42] defines an XML schema which is used to incorporate all of the Unicode character property information into the XML version of the UCD. See that annex for details of the schema and conventions regarding the grouping of property values for more compact representations.

4.5.2 XML File List

The XML version of the UCD is contained in the ucdxml subdirectory of the UCD. The files are all zipped. The list of files is shown in Table 6.

Table 6. XML File List

File Name	CJK	non-CJK
ucd.all.flat.zip	x	x
ucd.all.grouped.zip	x	x
ucd.nounihan.flat.zip		x
ucd.nounihan.grouped.zip		x
ucd.unihan.flat.zip	x
ucd.unihan.grouped.zip	x

The "flat" file versions simply list all attributes with no particular compression. The "grouped" file versions apply the grouping mechanism described in [UAX42] to cut down on the size of the data files.

5 Properties

This section documents the Unicode character properties, relating them in detail to the particular UCD data files in which they are specified. For enumerated properties in particular, this section also documents the actual values which those properties can have.

5.1 Property Index

Table 7 provides a summary list of the Unicode character properties, excluding most of those specific to the Unihan data files [Unihan]. For a comparable index of CJK character properties, see Unicode Standard Annex #38, "Unicode Han Database (Unihan)" [UAX38].

The properties are roughly organized into groups based on their usage. This grouping is primarily for documentation convenience and except for contributory properties, has no normative implications. Contributory properties are shown in this index with a gray background, to better distinguish them visually from ordinary (simple or derived) properties. Deprecated and obsolete properties and other properties not recommended for support in public property APIs are also shown with a gray background. The link on each property leads to its description in Table 9, Property Table. Any property marked as deprecated in this index is also automatically considered obsolete.

Table 7. Property Index by Scope of Use

General
Name
Name_Alias
Block
Age
General_Category
Script
Script_Extensions
White_Space
Alphabetic
Hangul_Syllable_Type
Noncharacter_Code_Point
Default_Ignorable_Code_Point
Deprecated
Logical_Order_Exception
Variation_Selector
Case
Uppercase
Lowercase
Lowercase_Mapping
Titlecase_Mapping
Uppercase_Mapping
Case_Folding
Simple_Lowercase_Mapping
Simple_Titlecase_Mapping
Simple_Uppercase_Mapping
Simple_Case_Folding
Soft_Dotted
Cased
Case_Ignorable
Changes_When_Lowercased
Changes_When_Uppercased
Changes_When_Titlecased
Changes_When_Casefolded
Changes_When_Casemapped
Emoji
Emoji
Emoji_Presentation
Emoji_Modifier
Emoji_Modifier_Base
Emoji_Component
Extended_Pictographic
Hieroglyphic
kEH_HG
kEH_IFAO
kEH_JSesh
kEH_Cat
kEH_Desc
kEH_NoMirror
kEH_NoRotate

Numeric
Numeric_Value
Numeric_Type
Hex_Digit
ASCII_Hex_Digit
Normalization
Canonical_Combining_Class
Decomposition_Mapping
Composition_Exclusion
Full_Composition_Exclusion
Decomposition_Type
FC_NFKC_Closure (deprecated)
NFC_Quick_Check
NFKC_Quick_Check
NFD_Quick_Check
NFKD_Quick_Check
Expands_On_NFC (deprecated)
Expands_On_NFD (deprecated)
Expands_On_NFKC (deprecated)
Expands_On_NFKD (deprecated)
NFKC_Casefold
Changes_When_NFKC_Casefolded
NFKC_Simple_Casefold
Shaping and Rendering
Join_Control
Joining_Group
Joining_Type
Modifier_Combining_Mark
Vertical_Orientation
East_Asian_Width
Prepended_Concatenation_Mark
Bidirectional
Bidi_Class
Bidi_Control
Bidi_Mirrored
Bidi_Mirroring_Glyph
Bidi_Paired_Bracket
Bidi_Paired_Bracket_Type
Identifiers
ID_Continue
ID_Start
XID_Continue
XID_Start
ID_Compat_Math_Continue
ID_Compat_Math_Start
Pattern_Syntax
Pattern_White_Space

Segmentation
Line_Break
Grapheme_Cluster_Break
Sentence_Break
Word_Break
CJK
Ideographic
Unified_Ideograph
Radical
IDS_Unary_Operator
IDS_Binary_Operator
IDS_Trinary_Operator
Unicode_Radical_Stroke
Equivalent_Unified_Ideograph
Miscellaneous
Math
Quotation_Mark
Dash
Hyphen (deprecated, stabilized)
Sentence_Terminal
Terminal_Punctuation
Diacritic
Extender
Grapheme_Base
Grapheme_Extend
Grapheme_Link (deprecated)
Unicode_1_Name (obsolete)
ISO_Comment (deprecated, stabilized)
Regional_Indicator
Indic_Conjunct_Break
Indic_Positional_Category
Indic_Syllabic_Category
Contributory Properties
Other_Alphabetic
Other_Default_Ignorable_Code_Point
Other_Grapheme_Extend
Other_ID_Start
Other_ID_Continue
Other_Lowercase
Other_Math
Other_Uppercase
Jamo_Short_Name

5.2 About the Property Table

Table 9, Property Table specifies the list of character properties defined in the UCD. That table is divided into separate sections for each data file in the UCD. Data files which define a single property or a small number of properties are listed first, followed by the data files which define a large number of properties: DerivedCoreProperties.txt, DerivedNormalizationProps.txt, PropList.txt, UnicodeData.txt, and emoji-data.txt. In some instances for these files defining many properties, the entries in the property table are grouped by type, for clarity in presentation, rather than being listed alphabetically.

In Table 9, Property Table each property is described as follows:

First Column. This column contains the name of each of the character properties specified in the respective data file. Any special status for a property, such as whether it is obsolete, deprecated, or stabilized, is also indicated in the first column.

Second Column. This column indicates the type of the property, according to the key in Table 8.

Table 8. Property Type Key

Property Type	Symbol	Examples
Catalog	C	Age, Block
Enumeration	E	Joining_Type, Line_Break
Binary	B	Uppercase, White_Space
String-valued	S	Uppercase_Mapping, Case_Folding
Numeric	N	Numeric_Value
Miscellaneous	M	Name, Jamo_Short_Name

Catalog properties have enumerated values which are expected to be regularly extended in successive versions of the Unicode Standard. This distinguishes them from Enumeration properties.
Enumeration properties have enumerated values which constitute a logical partition space; new values will generally not be added to them in successive versions of the standard.
Binary properties are a special case of Enumeration properties, which have exactly two values: Yes and No (or True and False).
String-valued properties are typically mappings from a Unicode code point to another Unicode code point or sequence of Unicode code points; examples include case mappings and decomposition mappings.
Properties of strings are properties defined for strings; in other words, their domain is a set of strings rather than a set of characters or code points. Properties of strings are sometimes called "string properties" for short. For example, the file NamedSequences.txt defines names (which are themselves string values) for a certain set of specific character sequences. Properties of strings are not explicitly listed for the UCD in the Property Table, and hence are given no specific type symbol in the Property Type Key.
Numeric properties specify the actual numeric values for digits and other characters associated with numbers in some way.
Miscellaneous properties are those properties that do not fit neatly into the other property categories; they currently include character names, comments about characters, the Script_Extensions property, and the Unicode_Radical_Stroke property (a combination of numeric values) documented in Unicode Standard Annex #38, "Unicode Han Database (Unihan)" [UAX38].

For a more complete discussion of types of character properties, including formal definitions, see Unicode Technical Report 23, "The Unicode Character Property Model" [