Unicode

Description

static class CodeX.Unicode

This helper class provides low-level functionality for using the Unicode standard.

Within the scope of this class, the term "character" refers to a 32-bit signed integer value and is interpreted as follows:

Equal to -1:
This value indicates that a character is invalid, undefined or missing.
Less than -1:
A sequence of UTF-16 code units that represents a valid surrogate pair, see CU.
Less than SurrogateFirst or greater than SurrogateLast, but not greater than 65535:
A single non-surrogate UTF-16 code unit, which is also a valid Unicode code point.
Greater than or equal to SurrogateFirst and less than or equal to SurrogateLast:
A surrogate UTF-16 code unit, which is not a valid Unicode code point.
Greater than 65535 and less than CodePoints:
A single UTF-32 code unit, which is also a valid Unicode code point.
Greater than or equal to CodePoints:
All values in that range are defined as invalid.

The CU and CP methods can be used to convert a character between its representation as a Unicode code point and a sequence of UTF-16 code units. Use the C² method to extract a character from a string, taking into account surrogate pairs. Use the C¹ method for the reverse direction.

Public / Constants

CodePoints

public constant CodePoints → (1114112:int32)

The number of code points that are defined by the Unicode standard.

SurrogateFirst

public constant SurrogateFirst → (0xD800:int32)

The first UTF-16 code unit for surrogates.

SurrogateLast

public constant SurrogateLast → (0xDFFF:int32)

The last UTF-16 code unit for surrogates.

Public / Methods

C

_2 overloads

public static method C¹ → (1)

character ⁱⁿ : int32: The character.
returns → string: The string or null iff character ⁱⁿ is invalid.

Constructs a string for the given character.

[Pure]
public static method C² → (2)

str ⁱⁿ : string: The string value.
index ^opt : int32 = 0: The index into str ⁱⁿ.
returns → int32: The UTF-16 code unit sequence that captures the single code unit or (possibly invalid) surrogate pair at index ^opt. Will be -1 if str ⁱⁿ is null or index ^opt is out of range.

Constructs a UTF-16 code unit sequence from the given string position.

CP

[Pure]
public static method CP → (1)

character ⁱⁿ : int32: The character to convert.
returns → int32: The Unicode code point or -1 iff character ⁱⁿ is invalid or represents a surrogate code unit.

Converts the given character ⁱⁿ to a Unicode code point.

See also: Unicode.CPS

CPS

[Pure]
public static method CPS → (1)

character ⁱⁿ : int32: The character to convert.
returns → int32: The Unicode code point, UTF-16 surrogate code unit or -1 iff character ⁱⁿ is invalid.

Converts the given character ⁱⁿ to a Unicode code point or a UTF-16 surrogate code unit.

See also: Unicode.CP

CU

[Pure]
public static method CU → (1)

character ⁱⁿ : int32: The character to convert.
returns → int32: The UTF-16 code unit sequence, encoded as follows:
1 code unit:
0x0000FFFF : single non-surrogate unit
0xFFFF0000 : not used, set to zero
2 code units:
0x0000FFFF : high-surrogate code unit
0xFFFF0000 : low-surrogate code unit
Will be -1 only if character ⁱⁿ is invalid or a single surrogate. Otherwise, will be negative only if 2 code units are used to form a valid surrogate pair. A positive value always represents a valid code point.

Converts the given character ⁱⁿ to a UTF-16 code unit sequence.

See also: Unicode.CUS

CUS

[Pure]
public static method CUS → (1)

character ⁱⁿ : int32: The character to convert.
returns → int32: The UTF-16 code unit sequence, encoded as follows:
1 code unit:
0x0000FFFF : single code unit
0xFFFF0000 : not used, set to zero
2 code units:
0x0000FFFF : high-surrogate code unit
0xFFFF0000 : low-surrogate code unit
Will be -1 only if character ⁱⁿ is invalid. Otherwise, will be negative only if 2 code units are used to form a valid surrogate pair. A positive value always represents either a valid code point or a single surrogate code unit.

Converts the given character ⁱⁿ to a UTF-16 code unit sequence.

See also: Unicode.CU

ISO_8859_1_Encode

[Pure]
public static method ISO_8859_1_Encode → (1)

character ⁱⁿ : int32: The character to convert.
returns → int32: The converted ISO-8859-1 code unit or -1 iff character ⁱⁿ cannot be represented in that encoding.

Converts the given character to an ISO-8859-1 code unit.

S

[Pure]
public static method S → (1)

codeUnit ⁱⁿ : int32: The UTF-16 code unit to check.
returns → bool: true if codeUnit ⁱⁿ is a surrogate,
false if not.

Checks whether the given UTF-16 code unit is a surrogate (high or low).

SH

[Pure]
public static method SH → (1)

codeUnit ⁱⁿ : int32: The UTF-16 code unit to check.
returns → bool: true if codeUnit ⁱⁿ is a high surrogate,
false if not.

Checks whether the given UTF-16 code unit is a high surrogate.

SL

[Pure]
public static method SL → (1)

codeUnit ⁱⁿ : int32: The UTF-16 code unit to check.
returns → bool: true if codeUnit ⁱⁿ is a low surrogate,
false if not.

Checks whether the given UTF-16 code unit is a low surrogate.

SP

[Pure]
public static method SP → (1)

codeUnits ⁱⁿ : int32: The UTF-16 code unit sequence to check.
returns → bool: true if codeUnits ⁱⁿ is a valid surrogate pair,
false if not.

Checks whether the given UTF-16 code unit sequence is a valid surrogate pair.

UTF_8_Decode

[Pure]
public static method UTF_8_Decode → (1)

codeUnits ⁱⁿ : int32: The UTF-8 code unit sequence, as returned by UTF_8_Encode
returns → int32: The code point or -1 iff invalid.

Converts the given UTF-8 code unit sequence to a code point.

UTF_8_DecodeCount

[Pure]
public static method UTF_8_DecodeCount → (1)

codeUnit ⁱⁿ : int32: The first UTF-8 code unit in the sequence.
returns → int32: The number of code units in the sequence. Will be 0 iff codeUnit ⁱⁿ is invalid.

Returns the number of code units in a sequence that starts with the given code unit.

UTF_8_Encode

[Pure]
public static method UTF_8_Encode → (1)

codePoint ⁱⁿ : int32: The code point.
returns → int32: The UTF-8 code unit sequence, encoded as follows:
1 code unit:
0x000000FF : first code unit
0xFFFFFF00 : not used, set to zero
2 code units:
0x000000FF : first code unit
0x0000FF00 : second code unit
0xFFFF0000 : not used, set to zero
3 code units:
0x000000FF : first code unit
0x0000FF00 : second code unit
0x00FF0000 : third code unit
0xFF000000 : not used, set to zero
4 code units:
0x000000FF : first code unit
0x0000FF00 : second code unit
0x00FF0000 : third code unit
0xFF000000 : fourth code unit
Will be -1 iff codePoint ⁱⁿ is invalid.

Converts the given code point to a UTF-8 code unit sequence.

UTF_8_EncodeCount

[Pure]
public static method UTF_8_EncodeCount → (1)

codePoint ⁱⁿ : int32: The code point.
returns → int32: The number of UTF-8 code units. Will be zero iff codePoint ⁱⁿ is invalid.

Returns the number of UTF-8 code units for the given code point.

Unicode

Description

Public / Constants

Code​Points

Surrogate​First

Surrogate​Last

Public / Methods

C

CP

CPS

CU

CUS

ISO_8859_1_​Encode

S

SH

SL

SP

UTF_8_​Decode

UTF_8_​Decode​Count

UTF_8_​Encode

UTF_8_​Encode​Count