std.uni

The std.uni module provides an implementation of fundamental Unicode algorithms and data structures. This doesn't include UTF encoding and decoding primitives, see std._utf.decode and std._utf.encode in std.utf for this functionality.

All primitives listed operate on Unicode characters and sets of characters. For functions which operate on ASCII characters and ignore Unicode $(S_LINK Character, characters) , see std.ascii. For definitions of Unicode $(S_LINK Character, character) , $(S_LINK Code point, code point) and other terms used throughout this module see the terminology section below.

The focus of this module is the core needs of developing Unicode-aware applications. To that effect it provides the following optimized primitives:

It's recognized that an application may need further enhancements and extensions, such as less commonly known algorithms, or tailoring existing ones for region specific needs. To help users with building any extra functionality beyond the core primitives, the module provides:

  • CodepointSet, a type for easy manipulation of sets of characters. Besides the typical set algebra it provides an unusual feature: a D source code generator for detection of $(S_LINK Code point, code points) in this set. This is a boon for meta-programming parser frameworks, and is used internally to power classification in small sets like isWhite.
  • A way to construct optimal packed multi-stage tables also known as a special case of Trie. The functions codepointTrie, codepointSetTrie construct custom tries that map dchar to value. The end result is a fast and predictable O(1) lookup that powers functions like isAlpha and combiningClass, but for user-defined data sets.
  • A useful technique for Unicode-aware parsers that perform character classification of encoded $(S_LINK Code point, code points) is to avoid unnecassary decoding at all costs. utfMatcher provides an improvement over the usual workflow of decode-classify-process, combining the decoding and classification steps. By extracting necessary bits directly from encoded code units matchers achieve significant performance improvements. See MatcherConcept for the common interface of UTF matchers.
  • Generally useful building blocks for customized normalization: combiningClass for querying combining class and allowedIn for testing the Quick_Check property of a given normalization form.
  • Access to a large selection of commonly used sets of $(S_LINK Code point, code points) . Supported sets include Script, Block and General Category. The exact contents of a set can be observed in the CLDR utility, on the property index page of the Unicode website. See unicode for easy and (optionally) compile-time checked set queries.

Synopsis

1 import std.uni;
2 void main()
3 {
4     // initialize code point sets using script/block or property name
5     // now 'set' contains code points from both scripts.
6     auto set = unicode("Cyrillic") | unicode("Armenian");
7     // same thing but simpler and checked at compile-time
8     auto ascii = unicode.ASCII;
9     auto currency = unicode.Currency_Symbol;
10 
11     // easy set ops
12     auto a = set & ascii;
13     assert(a.empty); // as it has no intersection with ascii
14     a = set | ascii;
15     auto b = currency - a; // subtract all ASCII, Cyrillic and Armenian
16 
17     // some properties of code point sets
18     assert(b.length > 45); // 46 items in Unicode 6.1, even more in 6.2
19     // testing presence of a code point in a set
20     // is just fine, it is O(logN)
21     assert(!b['$']);
22     assert(!b['\u058F']); // Armenian dram sign
23     assert(b['¥']);
24 
25     // building fast lookup tables, these guarantee O(1) complexity
26     // 1-level Trie lookup table essentially a huge bit-set ~262Kb
27     auto oneTrie = toTrie!1(b);
28     // 2-level far more compact but typically slightly slower
29     auto twoTrie = toTrie!2(b);
30     // 3-level even smaller, and a bit slower yet
31     auto threeTrie = toTrie!3(b);
32     assert(oneTrie['£']);
33     assert(twoTrie['£']);
34     assert(threeTrie['£']);
35 
36     // build the trie with the most sensible trie level
37     // and bind it as a functor
38     auto cyrillicOrArmenian = toDelegate(set);
39     auto balance = find!(cyrillicOrArmenian)("Hello ընկեր!");
40     assert(balance == "ընկեր!");
41     // compatible with bool delegate(dchar)
42     bool delegate(dchar) bindIt = cyrillicOrArmenian;
43 
44     // Normalization
45     string s = "Plain ascii (and not only), is always normalized!";
46     assert(s is normalize(s));// is the same string
47 
48     string nonS = "A\u0308ffin"; // A ligature
49     auto nS = normalize(nonS); // to NFC, the W3C endorsed standard
50     assert(nS == "Äffin");
51     assert(nS != nonS);
52     string composed = "Äffin";
53 
54     assert(normalize!NFD(composed) == "A\u0308ffin");
55     // to NFKD, compatibility decomposition useful for fuzzy matching/searching
56     assert(normalize!NFKD("2¹⁰") == "210");
57 }

Terminology

The following is a list of important Unicode notions and definitions. Any conventions used specifically in this module alone are marked as such. The descriptions are based on the formal definition as found in chapter three of The Unicode Standard Core Specification.

A unit of information used for the organization, control, or representation of textual data. Note that:

  • When representing data, the nature of that data is generally symbolic as opposed to some other kind of data (for example, visual).
  • An abstract character has no concrete form and should not be confused with a glyph.
  • An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a Grapheme.
  • The abstract characters encoded (see Encoded character) are known as Unicode abstract characters.
  • Abstract characters not directly encoded by the Unicode Standard can often be represented by the use of combining character sequences.

The decomposition of a character or character sequence that results from recursively applying the canonical mappings found in the Unicode Character Database and these described in Conjoining Jamo Behavior (section 12 of Unicode Conformance).

The precise definition of the Canonical composition is the algorithm as specified in Unicode Conformance section 11. Informally it's the process that does the reverse of the canonical decomposition with the addition of certain rules that e.g. prevent legacy characters from appearing in the composed result.

Two character sequences are said to be canonical equivalents if their full canonical decompositions are identical.

Typically differs by context. For the purpose of this documentation the term character implies encoded character, that is, a code point having an assigned abstract character (a symbolic meaning).

Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF (hex). Not all code points are assigned to encoded characters.

The minimal bit combination that can represent a unit of encoded text for processing or interchange. Depending on the encoding this could be: 8-bit code units in the UTF-8 (char), 16-bit code units in the UTF-16 (wchar), and 32-bit code units in the UTF-32 (dchar). Note that in UTF-32, a code unit is a code point and is represented by the D dchar type.

A character with the General Category of Combining Mark(M).

  • All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero combining class.
  • These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.

A numerical value used by the Unicode Canonical Ordering Algorithm to determine which sequences of combining marks are to be considered canonically equivalent and which are not.

The decomposition of a character or character sequence that results from recursively applying both the compatibility mappings and the canonical mappings found in the Unicode Character Database, and those described in Conjoining Jamo Behavior no characters can be further decomposed.

Two character sequences are said to be compatibility equivalents if their full compatibility decompositions are identical.

An association (or mapping) between an abstract character and a code point.

The actual, concrete image of a glyph representation having been rasterized or otherwise imaged onto some display surface.

A character with the property Grapheme_Base, or any standard Korean syllable block.

Defined as the text between grapheme boundaries as specified by Unicode Standard Annex #29, Unicode text segmentation. Important general properties of a grapheme:

  • The grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.
  • A grapheme cluster typically starts with a grapheme base and then extends across any subsequent sequence of nonspacing marks. A grapheme cluster is most directly relevant to text rendering and processes such as cursor placement and text selection in editing, but may also be relevant to comparison and searching.
  • For many processes, a grapheme cluster behaves as if it was a single character with the same properties as its grapheme base. Effectively, nonspacing marks apply graphically to the base, but do not change its properties.

A combining character with the General Category of Nonspacing Mark (Mn) or Enclosing Mark (Me).

A combining character that is not a nonspacing mark.

Normalization

The concepts of canonical equivalent or compatibility equivalent characters in the Unicode Standard make it necessary to have a full, formal definition of equivalence for Unicode strings. String equivalence is determined by a process called normalization, whereby strings are converted into forms which are compared directly for identity. This is the primary goal of the normalization process, see the function normalize to convert into any of the four defined forms.

A very important attribute of the Unicode Normalization Forms is that they must remain stable between versions of the Unicode Standard. A Unicode string normalized to a particular Unicode Normalization Form in one version of the standard is guaranteed to remain in that Normalization Form for implementations of future versions of the standard.

The Unicode Standard specifies four normalization forms. Informally, two of these forms are defined by maximal decomposition of equivalent sequences, and two of these forms are defined by maximal composition of equivalent sequences.

The choice of the normalization form depends on the particular use case. NFC is the best form for general text, since it's more compatible with strings converted from legacy encodings. NFKC is the preferred form for identifiers, especially where there are security concerns. NFD and NFKD are the most useful for internal processing.

Construction of lookup tables

The Unicode standard describes a set of algorithms that depend on having the ability to quickly look up various properties of a code point. Given the codespace of about 1 million $(S_LINK Code point, code points) , it is not a trivial task to provide a space-efficient solution for the multitude of properties.

Common approaches such as hash-tables or binary search over sorted code point intervals (as in InversionList) are insufficient. Hash-tables have enormous memory footprint and binary search over intervals is not fast enough for some heavy-duty algorithms.

The recommended solution (see Unicode Implementation Guidelines) is using multi-stage tables that are an implementation of the Trie data structure with integer keys and a fixed number of stages. For the remainder of the section this will be called a fixed trie. The following describes a particular implementation that is aimed for the speed of access at the expense of ideal size savings.

Taking a 2-level Trie as an example the principle of operation is as follows. Split the number of bits in a key (code point, 21 bits) into 2 components (e.g. 15 and 8). The first is the number of bits in the index of the trie and the other is number of bits in each page of the trie. The layout of the trie is then an array of size 2^^bits-of-index followed an array of memory chunks of size 2^^bits-of-page/bits-per-element.

The number of pages is variable (but not less then 1) unlike the number of entries in the index. The slots of the index all have to contain a number of a page that is present. The lookup is then just a couple of operations - slice the upper bits, lookup an index for these, take a page at this index and use the lower bits as an offset within this page.

auto elemsPerPage = (2 ^^ bits_per_page) / Value.sizeOfInBits;
pages[index[n >> bits_per_page]][n & (elemsPerPage - 1)];

Where if elemsPerPage is a power of 2 the whole process is a handful of simple instructions and 2 array reads. Subsequent levels of the trie are introduced by recursing on this notion - the index array is treated as values. The number of bits in index is then again split into 2 parts, with pages over 'current-index' and the new 'upper-index'.

For completeness a level 1 trie is simply an array. The current implementation takes advantage of bit-packing values when the range is known to be limited in advance (such as bool). See also BitPacked for enforcing it manually. The major size advantage however comes from the fact that multiple identical pages on every level are merged by construction.

The process of constructing a trie is more involved and is hidden from the user in a form of the convenience functions codepointTrie, codepointSetTrie and the even more convenient toTrie. In general a set or built-in AA with dchar type can be turned into a trie. The trie object in this module is read-only (immutable); it's effectively frozen after construction.

Unicode properties

This is a full list of Unicode properties accessible through unicode with specific helpers per category nested within. Consult the CLDR utility when in doubt about the contents of a particular set.

General category sets listed below are only accessible with the unicode shorthand accessor.

General category
Abb.Long formAbb.Long formAbb.Long form
LLetterCnUnassignedPoOther_Punctuation
LlLowercase_LetterCoPrivate_UsePsOpen_Punctuation
LmModifier_LetterCsSurrogateSSymbol
LoOther_LetterNNumberScCurrency_Symbol
LtTitlecase_LetterNdDecimal_NumberSkModifier_Symbol
LuUppercase_LetterNlLetter_NumberSmMath_Symbol
MMarkNoOther_NumberSoOther_Symbol
McSpacing_MarkPPunctuationZSeparator
MeEnclosing_MarkPcConnector_PunctuationZlLine_Separator
MnNonspacing_MarkPdDash_PunctuationZpParagraph_Separator
COtherPeClose_PunctuationZsSpace_Separator
CcControlPfFinal_Punctuation-Any
CfFormatPiInitial_Punctuation-ASCII

Sets for other commonly useful properties that are accessible with unicode:

Common binary properties
NameNameName
AlphabeticIdeographicOther_Uppercase
ASCII_Hex_DigitIDS_Binary_OperatorPattern_Syntax
Bidi_ControlID_StartPattern_White_Space
CasedIDS_Trinary_OperatorQuotation_Mark
Case_IgnorableJoin_ControlRadical
DashLogical_Order_ExceptionSoft_Dotted
Default_Ignorable_Code_PointLowercaseSTerm
DeprecatedMathTerminal_Punctuation
DiacriticNoncharacter_Code_PointUnified_Ideograph
ExtenderOther_AlphabeticUppercase
Grapheme_BaseOther_Default_Ignorable_Code_PointVariation_Selector
Grapheme_ExtendOther_Grapheme_ExtendWhite_Space
Grapheme_LinkOther_ID_ContinueXID_Continue
Hex_DigitOther_ID_StartXID_Start
HyphenOther_Lowercase
ID_ContinueOther_Math

Below is the table with block names accepted by unicode.block. Note that the shorthand version unicode requires "In" to be prepended to the names of blocks so as to disambiguate scripts and blocks.

Blocks
Aegean NumbersEthiopic ExtendedMongolian
Alchemical SymbolsEthiopic Extended-AMusical Symbols
Alphabetic Presentation FormsEthiopic SupplementMyanmar
Ancient Greek Musical NotationGeneral PunctuationMyanmar Extended-A
Ancient Greek NumbersGeometric ShapesNew Tai Lue
Ancient SymbolsGeorgianNKo
ArabicGeorgian SupplementNumber Forms
Arabic Extended-AGlagoliticOgham
Arabic Mathematical Alphabetic SymbolsGothicOl Chiki
Arabic Presentation Forms-AGreek and CopticOld Italic
Arabic Presentation Forms-BGreek ExtendedOld Persian
Arabic SupplementGujaratiOld South Arabian
ArmenianGurmukhiOld Turkic
ArrowsHalfwidth and Fullwidth FormsOptical Character Recognition
AvestanHangul Compatibility JamoOriya
BalineseHangul JamoOsmanya
BamumHangul Jamo Extended-APhags-pa
Bamum SupplementHangul Jamo Extended-BPhaistos Disc
Basic LatinHangul SyllablesPhoenician
BatakHanunooPhonetic Extensions
BengaliHebrewPhonetic Extensions Supplement
Block ElementsHigh Private Use SurrogatesPlaying Cards
BopomofoHigh SurrogatesPrivate Use Area
Bopomofo ExtendedHiraganaRejang
Box DrawingIdeographic Description CharactersRumi Numeral Symbols
BrahmiImperial AramaicRunic
Braille PatternsInscriptional PahlaviSamaritan
BugineseInscriptional ParthianSaurashtra
BuhidIPA ExtensionsSharada
Byzantine Musical SymbolsJavaneseShavian
CarianKaithiSinhala
ChakmaKana SupplementSmall Form Variants
ChamKanbunSora Sompeng
CherokeeKangxi RadicalsSpacing Modifier Letters
CJK CompatibilityKannadaSpecials
CJK Compatibility FormsKatakanaSundanese
CJK Compatibility IdeographsKatakana Phonetic ExtensionsSundanese Supplement
CJK Compatibility Ideographs SupplementKayah LiSuperscripts and Subscripts
CJK Radicals SupplementKharoshthiSupplemental Arrows-A
CJK StrokesKhmerSupplemental Arrows-B
CJK Symbols and PunctuationKhmer SymbolsSupplemental Mathematical Operators
CJK Unified IdeographsLaoSupplemental Punctuation
CJK Unified Ideographs Extension ALatin-1 SupplementSupplementary Private Use Area-A
CJK Unified Ideographs Extension BLatin Extended-ASupplementary Private Use Area-B
CJK Unified Ideographs Extension CLatin Extended AdditionalSyloti Nagri
CJK Unified Ideographs Extension DLatin Extended-BSyriac
Combining Diacritical MarksLatin Extended-CTagalog
Combining Diacritical Marks for SymbolsLatin Extended-DTagbanwa
Combining Diacritical Marks SupplementLepchaTags
Combining Half MarksLetterlike SymbolsTai Le
Common Indic Number FormsLimbuTai Tham
Control PicturesLinear B IdeogramsTai Viet
CopticLinear B SyllabaryTai Xuan Jing Symbols
Counting Rod NumeralsLisuTakri
CuneiformLow SurrogatesTamil
Cuneiform Numbers and PunctuationLycianTelugu
Currency SymbolsLydianThaana
Cypriot SyllabaryMahjong TilesThai
CyrillicMalayalamTibetan
Cyrillic Extended-AMandaicTifinagh
Cyrillic Extended-BMathematical Alphanumeric SymbolsTransport And Map Symbols
Cyrillic SupplementMathematical OperatorsUgaritic
DeseretMeetei MayekUnified Canadian Aboriginal Syllabics
DevanagariMeetei Mayek ExtensionsUnified Canadian Aboriginal Syllabics Extended
Devanagari ExtendedMeroitic CursiveVai
DingbatsMeroitic HieroglyphsVariation Selectors
Domino TilesMiaoVariation Selectors Supplement
Egyptian HieroglyphsMiscellaneous Mathematical Symbols-AVedic Extensions
EmoticonsMiscellaneous Mathematical Symbols-BVertical Forms
Enclosed AlphanumericsMiscellaneous SymbolsYijing Hexagram Symbols
Enclosed Alphanumeric SupplementMiscellaneous Symbols and ArrowsYi Radicals
Enclosed CJK Letters and MonthsMiscellaneous Symbols And PictographsYi Syllables
Enclosed Ideographic SupplementMiscellaneous Technical
EthiopicModifier Tone Letters

Below is the table with script names accepted by unicode.script and by the shorthand version unicode:

Scripts
ArabicHanunooOld_Italic
ArmenianHebrewOld_Persian
AvestanHiraganaOld_South_Arabian
BalineseImperial_AramaicOld_Turkic
BamumInheritedOriya
BatakInscriptional_PahlaviOsmanya
BengaliInscriptional_ParthianPhags_Pa
BopomofoJavanesePhoenician
BrahmiKaithiRejang
BrailleKannadaRunic
BugineseKatakanaSamaritan
BuhidKayah_LiSaurashtra
Canadian_AboriginalKharoshthiSharada
CarianKhmerShavian
ChakmaLaoSinhala
ChamLatinSora_Sompeng
CherokeeLepchaSundanese
CommonLimbuSyloti_Nagri
CopticLinear_BSyriac
CuneiformLisuTagalog
CypriotLycianTagbanwa
CyrillicLydianTai_Le
DeseretMalayalamTai_Tham
DevanagariMandaicTai_Viet
Egyptian_HieroglyphsMeetei_MayekTakri
EthiopicMeroitic_CursiveTamil
GeorgianMeroitic_HieroglyphsTelugu
GlagoliticMiaoThaana
GothicMongolianThai
GreekMyanmarTibetan
GujaratiNew_Tai_LueTifinagh
GurmukhiNkoUgaritic
HanOghamVai
HangulOl_ChikiYi

Below is the table of names accepted by unicode.hangulSyllableType.

Hangul syllable type
Abb.Long form
LLeading_Jamo
LVLV_Syllable
LVTLVT_Syllable
TTrailing_Jamo
VVowel_Jamo

References: ASCII Table, Wikipedia, The Unicode Consortium, Unicode normalization forms, Unicode text segmentation Unicode Implementation Guidelines Unicode Conformance Trademarks: Unicode(tm) is a trademark of Unicode, Inc.

Members

Aliases

CodepointSet
alias CodepointSet = InversionList!GcPolicy

The recommended default type for set of $(CODEPOINTS). For details, see the current implementation: InversionList.

Enums

Canonical
anonymousenum Canonical

Shorthand aliases for character decomposition type, passed as a template parameter to decompose.

NormalizationForm
enum NormalizationForm

Enumeration type for normalization forms, passed as template parameter for functions like normalize.

UnicodeDecomposition
enum UnicodeDecomposition

Unicode character decomposition type.

isUtfMatcher
eponymoustemplate isUtfMatcher(M, C)

Test if M is an UTF Matcher for ranges of Char.

Functions

allowedIn
bool allowedIn(dchar ch)

Tests if dchar ch is always allowed (Quick_Check=YES) in normalization form norm.

asCapitalized
auto asCapitalized(Range str)

Capitalize an input range or string, meaning convert the first character to upper case and subsequent characters to lower case.

asLowerCase
auto asLowerCase(Range str)
asUpperCase
auto asUpperCase(Range str)

Convert an input range or a string to upper or lower case.

byCodePoint
auto byCodePoint(Range range)

Lazily transform a range of Graphemes to a range of code points.

byGrapheme
auto byGrapheme(Range range)

Iterate a string by Grapheme.

combiningClass
ubyte combiningClass(dchar ch)

Returns the combining class of ch.

compose
dchar compose(dchar first, dchar second)

Try to canonically compose 2 $(CHARACTERS). Returns the composed $(CHARACTER) if they do compose and dchar.init otherwise.

composeJamo
dchar composeJamo(dchar lead, dchar vowel, dchar trailing)

Try to compose hangul syllable out of a leading consonant (lead), a vowel and optional trailing consonant jamos.

decodeGrapheme
Grapheme decodeGrapheme(Input inp)

Reads one full grapheme cluster from an input range of dchar inp.

decompose
Grapheme decompose(dchar ch)

Returns a full Canonical (by default) or Compatibility decomposition of $(CHARACTER) ch. If no decomposition is available returns a Grapheme with the ch itself.

decomposeHangul
Grapheme decomposeHangul(dchar ch)

Decomposes a Hangul syllable. If ch is not a composed syllable then this function returns Grapheme containing only ch as is.

graphemeStride
size_t graphemeStride(C[] input, size_t index)

Computes the length of grapheme cluster starting at index. Both the resulting length and the index are measured in code units.

icmp
int icmp(S1 r1, S2 r2)

Does case insensitive comparison of r1 and r2. Follows the rules of full case-folding mapping. This includes matching as equal german ß with "ss" and other 1:M $(CODEPOINT) mappings unlike sicmp. The cost of icmp being pedantically correct is slightly worse performance.

isAlpha
bool isAlpha(dchar c)

Returns whether c is a Unicode alphabetic $(CHARACTER) (general Unicode category: Alphabetic).

isAlphaNum
bool isAlphaNum(dchar c)

Returns whether c is a Unicode alphabetic $(CHARACTER) or number. (general Unicode category: Alphabetic, Nd, Nl, No).

isControl
bool isControl(dchar c)

Returns whether c is a Unicode control $(CHARACTER) (general Unicode category: Cc).

isFormat
bool isFormat(dchar c)

Returns whether c is a Unicode formatting $(CHARACTER) (general Unicode category: Cf).

isGraphical
bool isGraphical(dchar c)

Returns whether c is a Unicode graphical $(CHARACTER) (general Unicode category: L, M, N, P, S, Zs).

isLower
bool isLower(dchar c)

Return whether c is a Unicode lowercase $(CHARACTER).

isMark
bool isMark(dchar c)

Returns whether c is a Unicode mark (general Unicode category: Mn, Me, Mc).

isNonCharacter
bool isNonCharacter(dchar c)

Returns whether c is a Unicode non-character i.e. a $(CODEPOINT) with no assigned abstract character. (general Unicode category: Cn)

isNumber
bool isNumber(dchar c)

Returns whether c is a Unicode numerical $(CHARACTER) (general Unicode category: Nd, Nl, No).

isPrivateUse
bool isPrivateUse(dchar c)

Returns whether c is a Unicode Private Use $(CODEPOINT) (general Unicode category: Co).

isPunctuation
bool isPunctuation(dchar c)

Returns whether c is a Unicode punctuation $(CHARACTER) (general Unicode category: Pd, Ps, Pe, Pc, Po, Pi, Pf).

isSpace
bool isSpace(dchar c)

Returns whether c is a Unicode space $(CHARACTER) (general Unicode category: Zs) Note: This doesn't include '\n', '\r', \t' and other non-space $(CHARACTER). For commonly used less strict semantics see isWhite.

isSurrogate
bool isSurrogate(dchar c)

Returns whether c is a Unicode surrogate $(CODEPOINT) (general Unicode category: Cs).

isSurrogateHi
bool isSurrogateHi(dchar c)

Returns whether c is a Unicode high surrogate (lead surrogate).

isSurrogateLo
bool isSurrogateLo(dchar c)

Returns whether c is a Unicode low surrogate (trail surrogate).

isSymbol
bool isSymbol(dchar c)

Returns whether c is a Unicode symbol $(CHARACTER) (general Unicode category: Sm, Sc, Sk, So).

isUpper
bool isUpper(dchar c)

Return whether c is a Unicode uppercase $(CHARACTER).

isWhite
bool isWhite(dchar c)

Whether or not c is a Unicode whitespace $(CHARACTER). (general Unicode category: Part of C0(tab, vertical tab, form feed, carriage return, and linefeed characters), Zs, Zl, Zp, and NEL(U+0085))

normalize
inout(C)[] normalize(inout(C)[] input)

Returns input string normalized to the chosen form. Form C is used by default.

sicmp
int sicmp(S1 r1, S2 r2)

Does basic case-insensitive comparison of r1 and r2. This function uses simpler comparison rule thus achieving better performance than icmp. However keep in mind the warning below.

toDelegate
auto toDelegate(Set set)

Builds a Trie with typically optimal speed-size trade-off and wraps it into a delegate of the following type: bool delegate(dchar ch).

toLower
dchar toLower(dchar c)

If c is a Unicode uppercase $(CHARACTER), then its lowercase equivalent is returned. Otherwise c is returned.

toLower
ElementEncodingType!S[] toLower(S s)

Creates a new array which is identical to s except that all of its characters are converted to lowercase (by performing Unicode lowercase mapping). If none of s characters were affected, then s itself is returned if s is a string-like type.

toLowerInPlace
void toLowerInPlace(C[] s)

Converts s to lowercase (by performing Unicode lowercase mapping) in place. For a few characters string length may increase after the transformation, in such a case the function reallocates exactly once. If s does not have any uppercase characters, then s is unaltered.

toTrie
auto toTrie(Set set)

Convenience function to construct optimal configurations for packed Trie from any set of $(CODEPOINTS).

toUpper
dchar toUpper(dchar c)

If c is a Unicode lowercase $(CHARACTER), then its uppercase equivalent is returned. Otherwise c is returned.

toUpper
ElementEncodingType!S[] toUpper(S s)

Allocates a new array which is identical to s except that all of its characters are converted to uppercase (by performing Unicode uppercase mapping). If none of s characters were affected, then s itself is returned if s is a string-like type.

toUpperInPlace
void toUpperInPlace(C[] s)

Converts s to uppercase (by performing Unicode uppercase mapping) in place. For a few characters string length may increase after the transformation, in such a case the function reallocates exactly once. If s does not have any lowercase characters, then s is unaltered.

utfMatcher
auto utfMatcher(Set set)

Constructs a matcher object to classify $(CODEPOINTS) from the set for encoding that has Char as code unit.

Structs

CodepointInterval
struct CodepointInterval

The recommended type of std._typecons.Tuple to represent [a, b) intervals of $(CODEPOINTS). As used in InversionList. Any interval type should pass isIntegralPair trait.

Grapheme
struct Grapheme

A structure designed to effectively pack $(CHARACTERS) of a $(CLUSTER).

InversionList
struct InversionList(SP = GcPolicy)

InversionList is a set of $(CODEPOINTS) represented as an array of open-right [a, b) intervals (see CodepointInterval above). The name comes from the way the representation reads left to right. For instance a set of all values [10, 50), [80, 90), plus a singular value 60 looks like this:

MatcherConcept
struct MatcherConcept

Conceptual type that outlines the common properties of all UTF Matchers.

unicode
struct unicode

A single entry point to lookup Unicode $(CODEPOINT) sets by name or alias of a block, script or general category.

Templates

CodepointSetTrie
template CodepointSetTrie(sizes...)

Type of Trie generated by codepointSetTrie function.

CodepointTrie
template CodepointTrie(T, sizes...)

A slightly more general tool for building fixed Trie for the Unicode data.

codepointSetTrie
template codepointSetTrie(sizes...)

A shorthand for creating a custom multi-level fixed Trie from a CodepointSet. sizes are numbers of bits per level, with the most significant bits used first.

codepointTrie
template codepointTrie(T, sizes...)

A slightly more general tool for building fixed Trie for the Unicode data.

isCodepointSet
template isCodepointSet(T)

Tests if T is some kind a set of code points. Intended for template constraints.

isIntegralPair
template isIntegralPair(T, V = uint)

Tests if T is a pair of integers that implicitly convert to V. The following code must compile for any pair T:

Variables

lineSep
enum dchar lineSep;

Constant $(CODEPOINT) (0x2028) - line separator.

nelSep
enum dchar nelSep;

Constant $(CODEPOINT) (0x0085) - next line.

paraSep
enum dchar paraSep;

Constant $(CODEPOINT) (0x2029) - paragraph separator.

Meta

Standards

Unicode v6.2

Authors

Dmitry Olshansky