CSS Text Module Level 3

1. Introduction

Tests

The test coverage information in this specification covers wpt/css/css-text/ and subdirectories, as well as those tests in wpt/css/CSS2/ and subdirectories that relate to this specification.

This module describes the typesetting controls of CSS; that is, the features of CSS that control the translation of source text to formatted, line-wrapped text. Various CSS properties provide control over case transformation, white space collapsing, text wrapping, line breaking rules and hyphenation, alignment and justification, spacing, and indentation.

Note: Font selection is covered in the CSS Fonts Module. [CSS-FONTS-3]

Features for decorating text, such as underlines, emphasis marks, and shadows, (previously part of this module) are covered in the CSS Text Decoration Module. [CSS-TEXT-DECOR-3]

Bidirectional and vertical text are addressed in the CSS Writing Modes Module. [CSS-WRITING-MODES-4].

Further information about the typesetting requirements of various languages and writing systems around the world can be found in the Internationalization Working Group’s Language Enablement Index. [TYPOGRAPHY]

Tests

The following tests are crash tests that relate to general usage of the features described in this specification but are not tied to any particular normative statement.

1.1. Module Interactions

Tests

Tests not needed for this section.

This module, together with the CSS Text Decoration Module, replaces and extends the text-level features defined in Cascading Style Sheets Level 2 chapter 16. [CSS-TEXT-DECOR-3] [CSS2]

In addition to the terms defined below, other terminology and concepts used in this specification are defined in Cascading Style Sheets Level 2 and the CSS Writing Modes Module. [CSS2] and [CSS-WRITING-MODES-4].

1.2. Value Definitions

Tests

Tests not really needed for this section; could possibly test that css-wide keywords apply to every property.

This specification follows the CSS property definition conventions from [CSS2] using the value definition syntax from [CSS-VALUES-3]. Value types not defined in this specification are defined in CSS Values & Units [CSS-VALUES-3]. Combination with other CSS modules may expand the definitions of these value types.

In addition to the property-specific values listed in their definitions, all properties defined in this specification also accept the CSS-wide keywords as their property value. For readability they have not been repeated explicitly.

1.3. Languages and Typesetting

Tests

Tests not needed for this section: these are definitions, they get tested through their application, not by themselves.

Authors should accurately language-tag their content for the best typographic behavior.

Many typographic effects vary by linguistic context. Language and writing system conventions can affect line breaking, hyphenation, justification, glyph selection, and many other typographic effects. In CSS, language-specific typographic tailorings are only applied when the content language is known (declared). Therefore, higher quality typography requires authors to communicate to the UA the correct linguistic context of the text in the document.

The content language of an element is the (human) language the element is declared to be in, according to the rules of the document language. Note that it is possible for the content language of an element to be unknown—e.g. untagged content, or content in a document language that does not have a language-tagging facility, is considered to have an unknown content language.

Note: Authors can declare the content language using the global lang attribute in HTML or the universal xml:lang attribute in XML. See the rules for determining the content language of an HTML element in HTML, and the rules for determining the content language of an XML element in XML 1.0. [HTML] [XML10]

The content language an element is declared to be in also identifies the specific written form of that language used in that element, known as the content writing system. Depending on the document language’s facilities for identifying the content language, this information can be explicit or implied. See the normative Appendix F: Identifying the Content Writing System.

Note: Some languages have more than one writing system tradition; in other cases a language can be transliterated into a foreign writing system. Authors should subtag such cases so that the UA can adapt appropriately.

For example, Korean (ko) can be written in Hangul (-Hang), Hanja (-Hani), or a combination (-Kore). Historical documents written solely in Hanja do not use word spaces and are formatted more like modern Chinese than modern Korean. In other words, for typographic purposes ko-Hani behaves more like zh-Hant than ko (ko-Kore).

As another example Japanese (ja) is typically written in a combination (-Japn) of Hiragana (-Hira), Katakana (-Kana), and Kanji (-Hani). However, it can also be “romanized” into Latin (-Latn) for special purposes like language-learning textbooks, in which case it should be formatted more like English than Japanese.

As a third example contemporary Mongolian is written in two scripts: Cyrillic (-Cyrl, officially used in Mongolia) and Mongolian (-Mong, more common in Inner Mongolia, part of China). These have very different formatting requirements, with Cyrillic behaving similar to Latin and Greek, and Mongolian deriving from both Arabic and Chinese writing conventions.

1.4. Characters and Letters

Tests

For the most part, tests not really needed for this section: these are definitions, they get tested through their applications, by themselves. The few testable assertions that are made have coverage.

Possible additions:

turning the content of example 1 into tests (first, check that it’s not already done).

The basic unit of typesetting is the character. However, because writing systems are not always as simple as the basic English alphabet, what a character actually is depends on the context in which the term is used. For example, in Hangul (the Korean writing system), each square representation of a syllable (e.g. 한=Han) can be considered a character. However, the square symbol is really composed of multiple letters each representing a phoneme (e.g. ㅎ=h, ㅏ=a, ㄴ=n) and these also could each be considered a character.

A basic unit of computer text encoding, for any given encoding, is also called a character, and depending on the encoding, a single encoding character might correspond to the entire pre-composed syllabic character (e.g. 한), to the individual phonemic character (e.g. ㅎ), or to smaller units such as a base letterform (e.g. ㅇ) and any combining marks that vary it (e.g. extra strokes that represent aspiration).

In turn, a single encoding character can be represented in the data stream as one or more bytes; and in programming environments one byte is sometimes also called a character.

Therefore the term character is fairly ambiguous where technical precision is required.

For text layout, we will refer to the typographic character unit as the basic unit of text. Even within the realm of text layout, the relevant character unit depends on the operation. For example, line-breaking and letter-spacing will segment a sequence of Thai characters that include U+0E33 ำ THAI CHARACTER SARA AM differently; or the behavior of a conjunct consonant in a script such as Devanagari may depend on the font in use. So the typographic character represents a unit of the writing system—such as a Latin alphabetic letter (including its diacritics), Hangul syllable, Chinese ideographic character, Myanmar syllable cluster—that is indivisible with respect to a particular typographic operation (line-breaking, first-letter effects, tracking, justification, vertical arrangement, etc.).

Tests

Unicode Standard Annex #29: Text Segmentation defines a unit called the grapheme cluster which approximates the typographic character. [UAX29] A UA must use the extended grapheme cluster (not legacy grapheme cluster), as defined in UAX29, as the basis for its typographic character unit. However, the UA should tailor the definitions as required by typographic tradition since the default rules are not always appropriate or ideal—and is expected to tailor them differently depending on the operation as needed.

Tests

Note: The rules for such tailorings are out of scope for CSS.

The following are some examples of typographic character unit tailorings required by standard typesetting practice:

In some scripts such as Myanmar or Devanagari, the typographic character unit for both justification and line-breaking is an entire syllable, which can include more than one Unicode grapheme cluster. [UAX29]
In other scripts such as Thai or Lao, even though for line-breaking the typographic character matches Unicode’s default grapheme clusters, for letter-spacing the relevant unit is less than a Unicode grapheme cluster, and may require decomposition or other substitutions before spacing can be inserted. [UAX29]
For instance, to properly letter-space the Thai word คำ (U+0E04 + U+0E33), the U+0E33 needs to be decomposed into U+0E4D + U+0E32, and then the extra letter-space inserted before the U+0E32: คํ า.

A slightly more complex example is น้ำ (U+0E19 + U+0E49 + U+0E33). In this case, normal Thai shaping will first decompose the U+0E33 into U+0E4D + U+0E32 and then swap the U+0E4D with the U+0E49, giving U+0E19 + U+0E4D + U+0E49 + U+0E32. As before the extra letter-space is then inserted before the U+0E32: นํ้ า.
Vertical typesetting can also require tailoring. For example, when typesetting upright text, Tibetan tsek and shad marks are kept with the preceding grapheme cluster, rather than treated as an independent typographic character unit. [CSS-WRITING-MODES-4]

A typographic letter unit (or letter for the purpose of this specification) is a typographic character unit belonging to one of the Letter or Number general categories. See Appendix E: Characters and Properties for how to determine the Unicode properties of a typographic character unit.

The rendering characteristics of a typographic character unit divided by an element boundary is undefined. Ideally each component should be rendered according to the formatting requirements of its respective element’s properties while maintaining correct shaping and positioning of the typographic character unit as a whole. However, depending on the nature of the formatting differences between its parts and the capabilities of the font technology in use, this is not always possible. Therefore such a typographic character unit may be rendered as belonging to either side of the boundary, or as some approximation of belonging to both. Authors are forewarned that dividing grapheme clusters or ligatures by element boundaries may give inconsistent or undesired results.

1.5. Text Processing

Tests

This section has adequate coverage. Exhaustive coverage unrealistic, since this section is effectively a dependency on all of Unicode. Some tests nonetheless provided for key functionality (such as the effect of certain control characters on Arabic shaping).

CSS is built on Unicode. [UNICODE] UAs that support Unicode must adhere to all normative requirements of the Unicode Core Standard, except where explicitly overridden by CSS. UAs implemented on the basis of a non-Unicode text encoding model are still expected to fulfill the same text handling requirements by assuming an appropriate mapping and analogous behavior.

Tests

For the purpose of determining adjacency for text processing (such as white space processing, text transformation, line-breaking, etc.), and thus in general within this specification, intervening inline box boundaries and out-of-flow elements must be ignored. With respect to text shaping, however, see § 7.3 Shaping Across Element Boundaries.

Tests

2. Transforming Text

Tests

This section and its subsections have good test coverage overall, and very good i18n coverage in particular.

Missing tests:

no test of Animation type.
Applies to text

Possible additions:

An automated test for plain text copy&paste not applying transforms. Not clear such an automated test is possible, but it would be nice to have one if it were.

2.1. Case Transforms: the text-transform property

Name:	text-transform
Value:	none \| [capitalize \| uppercase \| lowercase ] \|\| full-width \|\| full-size-kana
Initial:	none
Applies to:	text
Inherited:	yes
Percentages:	n/a
Computed value:	specified keyword
Canonical order:	n/a
Animation type:	discrete

Tests

c545-txttrans-000.xht (live test) (source)

This property transforms text for styling purposes. It has no effect on the underlying content, and must not affect the content of a plain text copy & paste operation.

Tests

text-transform-upperlower-105.html (live test) (source)

Authors must not rely on text-transform for semantic purposes; rather the correct casing and semantics should be encoded in the source document text and markup.

Tests

text-transform-copy-paste-001-manual.html (manual test) (source)

Values have the following meanings:

none

No effects.

Tests

text-transform-none-001.xht (live test) (source)

text-transform-004.xht (live test) (source)

capitalize

Puts the first typographic letter unit of each word, if lowercase, in titlecase; other characters are unaffected.

Tests

uppercase

Puts all letters in uppercase.

Tests

lowercase

Puts all letters in lowercase.

Tests

full-width

Puts all typographic character units in full-width form. If a character does not have a corresponding full-width form, it is left as is. This value is typically used to typeset Latin letters and digits as if they were ideographic characters.

Tests

full-size-kana

Converts all small Kana characters to the equivalent full-size Kana. This value is typically used for ruby annotation text, where authors may want all small Kana to be drawn as large Kana to compensate for legibility issues at the small font sizes typically used in ruby.

Tests

The following example converts the ASCII characters used in abbreviations in Japanese text to their full-width variants so that they lay out and line break like ideographs:

abbr:lang(ja) { text-transform: full-width; }

Note: The purpose of text-transform is to allow for presentational casing transformations without affecting the semantics of the document. Note in particular that text-transform casing operations are lossy, and can distort the meaning of a text. While accessibility interfaces may wish to convey the apparent casing of the rendered text to the user, the transformed text cannot be relied on to accurately represent the underlying meaning of the document.

In this example, the first line of text is capitalized as a visual effect.

section > p:first-of-type::first-line {
  text-transform: uppercase;
}

This effect cannot be written into the source document because the position of the line break depends on layout. But also, the capitalization is not reflecting a semantic distinction and is not intended to affect the paragraph’s reading; therefore it belongs in the presentation layer.

In this example, the ruby annotations, which are half the size of the main paragraph text, are transformed to use regular-size kana in place of small kana.

rt { font-size: 50%; text-transform: full-size-kana; }
:is(h1, h2, h3, h4) rt { text-transform: none; /* unset for large text*/ }

Note that while this makes such letters easier to see at small type sizes, the transformation distorts the text: the reader needs to mentally substitute small kana in the appropriate places—not unlike reading a Latin inscription where all “U”s look like “V”s.

For example, if text-transform: full-size-kana were applied to the following source, the annotation would read “じゆう” (jiyū), which means “liberty”, instead of “じゅう” (jū), which means “ten”, the correct reading and meaning for the annotated “十”.

<ruby>十<rt>じゅう</ruby>

2.1.1. Mapping Rules

For capitalize, what constitutes a “word“ is UA-dependent; [UAX29] is suggested (but not required) for determining such word boundaries. Out-of-flow elements and inline element boundaries must not introduce a text-transform word boundary and must be ignored when determining such word boundaries.

Tests

Note: Authors cannot depend on capitalize to follow language-specific titlecasing conventions (such as skipping articles in English).

The UA must use the full case mappings for Unicode characters, including any conditional casing rules, as defined in the Default Case Algorithms section of The Unicode Standard. [UNICODE] If (and only if) the content language of the element is, according to the rules of the document language, known, then any appropriate language-specific rules must be applied as well. These minimally include, but are not limited to, the language-specific rules in Unicode’s SpecialCasing.txt.

Tests

For example, in Turkish there are two “i”s, one with a dot—“İ” and “i”—and one without—“I” and “ı”. Thus the usual case mappings between “I” and “i” are replaced with a different set of mappings to their respective dotless/dotted counterparts, which do not exist in English. This mapping must only take effect if the content language is Turkish written in its modern Latin-based writing system (or another Turkic language that uses Turkish casing rules); in other languages, the usual mapping of “I” and “i” is required. This rule is thus conditionally defined in Unicode’s SpecialCasing.txt file.

Tests

The definition of full-width and half-width forms can be found in Unicode Standard Annex #11: East Asian Width. [UAX11] The mapping to full-width form is defined by taking code points with the <wide> or the <narrow> tag in their Decomposition_Mapping in Unicode Standard Annex #44: Unicode Character Database. [UAX44] For the <narrow> tag, the mapping is from the code point to the decomposition (minus <narrow> tag), and for the <wide> tag, the mapping is from the decomposition (minus the <wide> tag) back to the original code point.

Tests

The mappings for small Kana to full-size Kana are defined in Appendix G: Small Kana Mappings.

2.1.2. Order of Operations

When multiple values are specified and therefore multiple transformations need to be applied, they are applied in the following order:

Tests

text-transform-multiple-001.html (live test) (source)

Text transformation happens after § 4.1.1 Phase I: Collapsing and Transformation but before § 4.1.2 Phase II: Trimming and Positioning. This means that full-width only transforms spaces (U+0020) to U+3000 IDEOGRAPHIC SPACE within preserved white space.

Tests

Note: As defined in Appendix A: Text Processing Order of Operations, transforming text affects line-breaking and other formatting operations.

3. White Space and Wrapping: the white-space property

Tests

This section has good overall test coverage, particularly through tests for [[#white-space-processing]] and subsections.

Missing tests:

No test of Animation type

Name:	white-space
Value:	normal \| pre \| nowrap \| pre-wrap \| break-spaces \| pre-line
Initial:	normal
Applies to:	text
Inherited:	yes
Percentages:	n/a
Computed value:	specified keyword
Canonical order:	n/a
Animation type:	discrete

Tests

c562-white-sp-000.xht (live test) (source)

This property specifies two things:

whether and how white space is collapsed
whether lines may wrap at unforced soft wrap opportunities

Values have the following meanings, which must be interpreted according to the White Space Processing and Line Breaking rules:

normal

This value directs user agents to collapse sequences of white space into a single character (or in some cases, no character). Lines may wrap at allowed soft wrap opportunities, as determined by the line-breaking rules in effect, in order to minimize inline-axis overflow.

Tests

pre

This value prevents user agents from collapsing sequences of white space. Segment breaks such as line feeds are preserved as forced line breaks. Lines only break at forced line breaks; content that does not fit within the block container overflows it.

Tests

white-space-pre-001.xht (live test) (source)
white-space-pre-002.xht (live test) (source)
white-space-pre-005.xht (live test) (source)
white-space-pre-006.xht (live test) (source)
white-space-pre-007.xht (visual test) (source)
white-space-pre-element-001.xht (live test) (source)

nowrap

Like normal, this value collapses white space; but like pre, it does not allow wrapping.

Tests

pre-wrap

Like pre, this value preserves white space; but like normal, it allows wrapping.

Tests

break-spaces

The behavior is identical to that of pre-wrap, except that:

CSS Text Module Level 3

Abstract

Status of this document

1. Introduction

1.1. Module Interactions

1.2. Value Definitions

1.3. Languages and Typesetting

1.4. Characters and Letters

1.5. Text Processing

2. Transforming Text

2.1. Case Transforms: the text-transform property

2.1.1. Mapping Rules

2.1.2. Order of Operations

3. White Space and Wrapping: the white-space property