Unicode

Description

static class CodeX.Unicode

This helper class provides low-level functionality for using the Unicode standard.

Within the scope of this class, the term "character" refers to a 32-bit signed integer value and is interpreted as follows:

  1. Equal to -1:
    This value indicates that a character is invalid, undefined or missing.

  2. Less than -1:
    A sequence of UTF-16 code units that represents a valid surrogate pair, see CU.

  3. Less than SurrogateFirst or greater than SurrogateLast, but not greater than 65535:
    A single non-surrogate UTF-16 code unit, which is also a valid Unicode code point.

  4. Greater than or equal to SurrogateFirst and less than or equal to SurrogateLast:
    A surrogate UTF-16 code unit, which is not a valid Unicode code point.

  5. Greater than 65535 and less than CodePoints:
    A single UTF-32 code unit, which is also a valid Unicode code point.

  6. Greater than or equal to CodePoints:
    All values in that range are defined as invalid.

The CU and CP methods can be used to convert a character between its representation as a Unicode code point and a sequence of UTF-16 code units. Use the C2 method to extract a character from a string, taking into account surrogate pairs. Use the C1 method for the reverse direction.

Public / Constants

Code​Points


public constant CodePoints → (1114112:int32)

The number of code points that are defined by the Unicode standard.

Surrogate​First


public constant SurrogateFirst → (0xD800:int32)

The first UTF-16 code unit for surrogates.

Surrogate​Last


public constant SurrogateLast → (0xDFFF:int32)

The last UTF-16 code unit for surrogates.

Public / Methods

C

2 overloads


public static method C1 → (1)

character in : int32

The character.

returns → string

The string or null iff character in is invalid.

Constructs a string for the given character.


[Pure]
public static method C2 → (2)

str in : string

The string value.

index opt : int32 = 0

The index into str in.

returns → int32

The UTF-16 code unit sequence that captures the single code unit or (possibly invalid) surrogate pair at index opt. Will be -1 if str in is null or index opt is out of range.

Constructs a UTF-16 code unit sequence from the given string position.

CP


[Pure]
public static method CP → (1)

character in : int32

The character to convert.

returns → int32

The Unicode code point or -1 iff character in is invalid or represents a surrogate code unit.

Converts the given character in to a Unicode code point.

See also

Unicode.CPS

CPS


[Pure]
public static method CPS → (1)

character in : int32

The character to convert.

returns → int32

The Unicode code point, UTF-16 surrogate code unit or -1 iff character in is invalid.

Converts the given character in to a Unicode code point or a UTF-16 surrogate code unit.

See also

Unicode.CP

CU


[Pure]
public static method CU → (1)

character in : int32

The character to convert.

returns → int32

The UTF-16 code unit sequence, encoded as follows:
1 code unit:
0x0000FFFF : single non-surrogate unit
0xFFFF0000 : not used, set to zero
2 code units:
0x0000FFFF : high-surrogate code unit
0xFFFF0000 : low-surrogate code unit
Will be -1 only if character in is invalid or a single surrogate. Otherwise, will be negative only if 2 code units are used to form a valid surrogate pair. A positive value always represents a valid code point.

Converts the given character in to a UTF-16 code unit sequence.

See also

Unicode.CUS

CUS


[Pure]
public static method CUS → (1)

character in : int32

The character to convert.

returns → int32

The UTF-16 code unit sequence, encoded as follows:
1 code unit:
0x0000FFFF : single code unit
0xFFFF0000 : not used, set to zero
2 code units:
0x0000FFFF : high-surrogate code unit
0xFFFF0000 : low-surrogate code unit
Will be -1 only if character in is invalid. Otherwise, will be negative only if 2 code units are used to form a valid surrogate pair. A positive value always represents either a valid code point or a single surrogate code unit.

Converts the given character in to a UTF-16 code unit sequence.

See also

Unicode.CU

ISO_8859_1_​Encode


[Pure]
public static method ISO_8859_1_Encode → (1)

character in : int32

The character to convert.

returns → int32

The converted ISO-8859-1 code unit or -1 iff character in cannot be represented in that encoding.

Converts the given character to an ISO-8859-1 code unit.

S


[Pure]
public static method S → (1)

codeUnit in : int32

The UTF-16 code unit to check.

returns → bool

true if codeUnit in is a surrogate,
false if not.

Checks whether the given UTF-16 code unit is a surrogate (high or low).

SH


[Pure]
public static method SH → (1)

codeUnit in : int32

The UTF-16 code unit to check.

returns → bool

true if codeUnit in is a high surrogate,
false if not.

Checks whether the given UTF-16 code unit is a high surrogate.

SL


[Pure]
public static method SL → (1)

codeUnit in : int32

The UTF-16 code unit to check.

returns → bool

true if codeUnit in is a low surrogate,
false if not.

Checks whether the given UTF-16 code unit is a low surrogate.

SP


[Pure]
public static method SP → (1)

codeUnits in : int32

The UTF-16 code unit sequence to check.

returns → bool

true if codeUnits in is a valid surrogate pair,
false if not.

Checks whether the given UTF-16 code unit sequence is a valid surrogate pair.

UTF_8_​Decode


[Pure]
public static method UTF_8_Decode → (1)

codeUnits in : int32

The UTF-8 code unit sequence, as returned by UTF_8_Encode

returns → int32

The code point or -1 iff invalid.

Converts the given UTF-8 code unit sequence to a code point.

UTF_8_​Decode​Count


[Pure]
public static method UTF_8_DecodeCount → (1)

codeUnit in : int32

The first UTF-8 code unit in the sequence.

returns → int32

The number of code units in the sequence. Will be 0 iff codeUnit in is invalid.

Returns the number of code units in a sequence that starts with the given code unit.

UTF_8_​Encode


[Pure]
public static method UTF_8_Encode → (1)

codePoint in : int32

The code point.

returns → int32

The UTF-8 code unit sequence, encoded as follows:
1 code unit:
0x000000FF : first code unit
0xFFFFFF00 : not used, set to zero
2 code units:
0x000000FF : first code unit
0x0000FF00 : second code unit
0xFFFF0000 : not used, set to zero
3 code units:
0x000000FF : first code unit
0x0000FF00 : second code unit
0x00FF0000 : third code unit
0xFF000000 : not used, set to zero
4 code units:
0x000000FF : first code unit
0x0000FF00 : second code unit
0x00FF0000 : third code unit
0xFF000000 : fourth code unit
Will be -1 iff codePoint in is invalid.

Converts the given code point to a UTF-8 code unit sequence.

UTF_8_​Encode​Count


[Pure]
public static method UTF_8_EncodeCount → (1)

codePoint in : int32

The code point.

returns → int32

The number of UTF-8 code units. Will be zero iff codePoint in is invalid.

Returns the number of UTF-8 code units for the given code point.