std.encoding

Classes and functions for handling and transcoding between various encodings.

For cases where the encoding is known at compile-time, functions are provided for arbitrary encoding and decoding of characters, arbitrary transcoding between strings of different type, as well as validation and sanitization.

Encodings currently supported are UTF-8, UTF-16, UTF-32, ASCII, ISO-8859-1 (also known as LATIN-1), ISO-8859-2 (LATIN-2), WINDOWS-1250, WINDOWS-1251 and WINDOWS-1252.


Category	Functions
Decode	codePoints decode decodeReverse safeDecode
Conversion	codeUnits sanitize transcode
Classification	canEncode isValid isValidCodePoint isValidCodeUnit
BOM	BOM BOMSeq getBOM utfBOM
Length & Index	firstSequence encodedLength index lastSequence validLength
Encoding schemes	encodingName EncodingScheme EncodingSchemeASCII EncodingSchemeLatin1 EncodingSchemeLatin2 EncodingSchemeUtf16Native EncodingSchemeUtf32Native EncodingSchemeUtf8 EncodingSchemeWindows1250 EncodingSchemeWindows1251 EncodingSchemeWindows1252
Representation	AsciiChar AsciiString Latin1Char Latin1String Latin2Char Latin2String Windows1250Char Windows1250String Windows1251Char Windows1251String Windows1252Char Windows1252String
Exceptions	INVALID_SEQUENCE EncodingException

For cases where the encoding is not known at compile-time, but is known at run-time, the abstract class EncodingScheme and its subclasses is provided. To construct a run-time encoder/decoder, one does e.g.

auto e = EncodingScheme.create("utf-8");

This library supplies EncodingScheme subclasses for ASCII, ISO-8859-1 (also known as LATIN-1), ISO-8859-2 (LATIN-2), WINDOWS-1250, WINDOWS-1251, WINDOWS-1252, UTF-8, and (on little-endian architectures) UTF-16LE and UTF-32LE; or (on big-endian architectures) UTF-16BE and UTF-32BE.

This library provides a mechanism whereby other modules may add EncodingScheme subclasses for any other encoding.

Members

Aliases

AsciiString alias AsciiString = immutable(AsciiChar)[]: Defines various character sets.
BOMSeq alias BOMSeq = Tuple!(BOM, "schema", ubyte[], "sequence"): The type stored inside bomTable.
Latin1String alias Latin1String = immutable(Latin1Char)[]: Defines an Latin1-encoded string (as an array of immutable(Latin1Char)).
Latin2String alias Latin2String = immutable(Latin2Char)[]: Defines an Latin2-encoded string (as an array of immutable(Latin2Char)).
Windows1250String alias Windows1250String = immutable(Windows1250Char)[]: Defines an Windows1250-encoded string (as an array of immutable(Windows1250Char)).
Windows1251String alias Windows1251String = immutable(Windows1251Char)[]: Defines an Windows1251-encoded string (as an array of immutable(Windows1251Char)).
Windows1252String alias Windows1252String = immutable(Windows1252Char)[]: Defines an Windows1252-encoded string (as an array of immutable(Windows1252Char)).

Classes

EncodingException class EncodingException: The base class for exceptions thrown by this module
EncodingScheme class EncodingScheme: Abstract base class of all encoding schemes
EncodingSchemeASCII class EncodingSchemeASCII: EncodingScheme to handle ASCII
EncodingSchemeLatin1 class EncodingSchemeLatin1: EncodingScheme to handle Latin-1
EncodingSchemeLatin2 class EncodingSchemeLatin2: EncodingScheme to handle Latin-2
EncodingSchemeUtf16Native class EncodingSchemeUtf16Native: EncodingScheme to handle UTF-16 in native byte order
EncodingSchemeUtf32Native class EncodingSchemeUtf32Native: EncodingScheme to handle UTF-32 in native byte order
EncodingSchemeUtf8 class EncodingSchemeUtf8: EncodingScheme to handle UTF-8
EncodingSchemeWindows1250 class EncodingSchemeWindows1250: EncodingScheme to handle Windows-1250
EncodingSchemeWindows1251 class EncodingSchemeWindows1251: EncodingScheme to handle Windows-1251
EncodingSchemeWindows1252 class EncodingSchemeWindows1252: EncodingScheme to handle Windows-1252

Enums

AsciiChar enum AsciiChar: Defines various character sets.
BOM enum BOM: Definitions of common Byte Order Marks. The elements of the enum can used as indices into bomTable to get matching BOMSeq.
Latin1Char enum Latin1Char: Defines an Latin1-encoded character.
Latin2Char enum Latin2Char: Defines a Latin2-encoded character.
Windows1250Char enum Windows1250Char: Defines a Windows1250-encoded character.
Windows1251Char enum Windows1251Char: Defines a Windows1251-encoded character.
Windows1252Char enum Windows1252Char: Defines a Windows1252-encoded character.

Functions

canEncode bool canEncode(dchar c): Returns true iff it is possible to represent the specified codepoint in the encoding.
codePoints CodePoints!(E) codePoints(immutable(E)[] s): Returns a foreachable struct which can bidirectionally iterate over all code points in a string.
codeUnits CodeUnits!(E) codeUnits(dchar c): Returns a foreachable struct which can bidirectionally iterate over all code units in a code point.
decode dchar decode(S s): Decodes a single code point.
decodeReverse dchar decodeReverse(const(E)[] s): Decodes a single code point from the end of a string.
encode E[] encode(dchar c): Encodes a single code point.
encode size_t encode(dchar c, E[] array): Encodes a single code point into an array.
encode void encode(dchar c, void delegate(E) dg): Encodes a single code point to a delegate.
encode size_t encode(Src[] s, R range): Encodes the contents of s in units of type Tgt, writing the result to an output range.
encodedLength size_t encodedLength(dchar c): Returns the number of code units required to encode a single code point.
firstSequence size_t firstSequence(const(E)[] s): Returns the length of the first encoded sequence.
getBOM immutable(BOMSeq) getBOM(Range input): Returns a BOMSeq for a given input. If no BOM is present the BOMSeq for BOM.none is returned. The BOM sequence at the beginning of the range will not be comsumed from the passed range. If you pass a reference type range make sure that save creates a deep copy.
index ptrdiff_t index(const(E)[] s, int n): Returns the array index at which the (n+1)th code point begins.
isValid bool isValid(const(E)[] s): Returns true if the string is encoded correctly
isValidCodePoint bool isValidCodePoint(dchar c): Returns true if c is a valid code point
isValidCodeUnit bool isValidCodeUnit(E c): Returns true if the code unit is legal. For example, the byte 0x80 would not be legal in ASCII, because ASCII code units must always be in the range 0x00 to 0x7F.
lastSequence size_t lastSequence(const(E)[] s): Returns the length of the last encoded sequence.
safeDecode dchar safeDecode(S s): Decodes a single code point. The input does not have to be valid.
sanitize immutable(E)[] sanitize(immutable(E)[] s): Sanitizes a string by replacing malformed code unit sequences with valid code unit sequences. The result is guaranteed to be valid for this encoding.
transcode void transcode(Src[] s, Dst[] r): Convert a string from one encoding to another.
validLength size_t validLength(const(E)[] s): Returns the length of the longest possible substring, starting from the first code unit, which is validly encoded.

Properties

encodingName string encodingName [@property getter]: Returns the name of an encoding.

Variables

INVALID_SEQUENCE enum dchar INVALID_SEQUENCE;: Special value returned by safeDecode
bomTable auto bomTable;: Mapping of a byte sequence to Byte Order Mark (BOM)
utfBOM enum dchar utfBOM;: Constant defining a fully decoded BOM

std.encoding

Members

Aliases

Classes

Enums

Functions

Properties

Variables

Meta

Source

License

Copyright

Authors