/******************************************************************************
* This file is part of The Unicode Tools Of Rexx (TUTOR) *
* See https://rexx.epbcn.com/TUTOR/ *
* and https://github.com/JosepMariaBlasco/TUTOR *
* Copyright © 2023-2025 Josep Maria Blasco <josep.maria.blasco@epbcn.com> *
* License: Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0) *
******************************************************************************/
The Rexx Preprocessor for Unicode implements a series of new built-in functions (BIFs). Follow this link if you want to read about modifications to existing BIFs.
Returns the string converted to the BYTES format. BYTES
strings are composed of 8-bit bytes, and every character in the string
can be an arbitrary 8-bit value, including binary data. Rexx
built-in-functions operate at the byte level, and no Unicode features
are available (for example, LOWER operates only on the ranges
"A".."Z" and "a".."z"). This is equivalent to
Classic Rexx strings, but with some enhancements. See the description of
the BYTES class for details.
Converts string to a CODEPOINTS string and returns it. CODEPOINTS strings are composed of Unicode codepoints, and every character in the string can be an arbitrary Unicode codepoint. The argument string has to contain well-formed UTF-8, or a Syntax error will be raised. When working with CODEPOINTS strings, Rexx built-in functions operate at the codepoint level, and can produce much richer results than when operating on BYTES strings.
Please note that CODEPOINTS, GRAPHEMES and TEXT strings are
guaranteed to contain well-formed UTF-8 sequences. To test if a string
contains well-formed UTF-8, you can use the
DECODE(string,"UTF-8") or UTF8(string)
function calls.
Returns a string, in character format, that represents string converted to Unicode codepoints.
By default, C2U returns a list of blank-separated hexadecimal representations of the codepoints. The format argument allows to select different formats for the returned string:
"FFFF"X will have their
leading zeros removed, if any. Codepoints smaller than
"10000"X will always have four digits (by adding zeros to
the left if necessary)."U+".C2U("S") == "(LATIN CAPITAL LETTER S)", and
C2U("0A"X) = "(<control-000A>)".Examples (assuming an ambient encoding of UTF-8):
C2U("Sí") = "0053 00ED" -- And "0053 00ED"U == "53 C3AD"X == "Sí".
C2U("Sí","U+") = "U+0053 U+00ED" -- Again, "U+0053 U+00ED"U == "53 C3AD"X == "Sí".
C2U("Sí","Na") = "(LATIN CAPITAL LETTER S) (LATIN SMALL LETTER I WITH ACUTE)"
-- And "(LATIN CAPITAL LETTER S) (LATIN SMALL LETTER I WITH ACUTE)"U == "Sí"
C2U("Sí","UTF-32") = "0000 0053 0000 00ED"X
Tests whether a string is encoded according to a certain encoding, and optionally decodes it to a certain format.
DECODE works as an encoding validator when format is omitted, and as a decoder when format is specified. It is an error to omit format and to specify a value for error_handling at the same time (that is, if format was omitted, then error_handling should be omitted too).
When DECODE is used as validator, it returns a boolean value,
indicating if the string is well-formed according to the specified
encoding. For example, DECODE(string,"UTF-8") returns
1 when string contains well-formed UTF-8, and
0 if it contains ill-formed UTF-8.
To use DECODE as a decoder, you have to specify a format. This argument accepts a blank-separated set of tokens. Each token can have one of the following values: UTF8, UTF-8, UTF32, or UTF-32 (duplicates are allowed and ignored). When UTF8 or UTF-8 have been specified, a UTF-8 representation of the decoded string is returned. When UTF32 or UTF-32 have been specified, UTF-32 representation of the decoded string is returned. When both have been specified, a two-items array is returned. The first item of the array is the UTF-8 representation of the decoded string, and the second item of the array contains the UTF-32 representation of the decoded string.
The optional error_handling argument determines the
behaviour of the function when the format argument has been
specified. If it has the value "" (the default) or
NULL, a null string is returned when there a decoding
error is encountered. If it has the value REPLACE, any
ill-formed character will be replaced by the Unicode Replacement
Character (U+FFFD). If it has the value
SYNTAX, a syntax condition will be raised when a
decoding error is encountered.
Examples:
DECODE(string, "UTF-16") -- Returns 1 if string contains proper UTF-16, and 0 otherwise
var = DECODE(string, "UTF-16", "UTF-8") -- Decodes string to the UTF-8 format. A null string is returned if string contains ill-formed UTF-16.
DECODE(string, "UTF-16",,"SYNTAX") -- The fourth argument is checked for validity and then ignored.
DECODE(string, "UTF-16",,"POTATO") -- Syntax error (Invalid option 'POTATO').
var = DECODE(string, "UTF-16", "UTF-8", "REPLACE") -- Decodes string to the UTF-8 format. Ill-formed character sequences are replaced by U+FFFD.
var = DECODE(string, "UTF-16", "UTF-8", "SYNTAX") -- Decodes string to the UTF-8 format. Any ill-formed character sequence will raise a Syntax error.
ENCODE first validates that string contains well-formed UTF-8. Once the string is validated, encoding is attempted using the specified encoding. ENCODE returns the encoded string, or a null string if validation or encoding failed. You can influence the behaviour of the function when an error is encountered by specifying the optional error_handling argument. * When error_handling is not specified, is "" or is NULL (the default), a null string is returned if an error is encountered. * When error_handling has the value SYNTAX, a Syntax error is raised if an error is encountered.
Examples:
ENCODE(string, "IBM1047") -- The encoded string, or "" if string can not be encoded to IBM1047.
ENCODE(string, "IBM1047","SYNTAX") -- The encoded string. If the encoding fails, a Syntax error is raised.
Converts string to a GRAPHEMES string and returns it. GRAPHEMES strings are composed of extended grapheme clusters, and every character in a GRAPHEMES string can be an arbitrary extended grapheme cluster. The argument string has to contain well-formed UTF-8, or a Syntax error is raised. When working with GRAPHEMES strings, Rexx built-in functions operate at the extended grapheme cluster level, and can produce much richer results than when operating with BYTES or CODEPOINTS strings.
Please note that CODEPOINTS, GRAPHEMES and TEXT strings are
guaranteed to contain well-formed UTF-8 sequences. To test if a string
contains well-formed UTF-8, you can use the
DECODE(string,"UTF-8") or UTF8(string)
function calls.
Returns the hexadecimal Unicode codepoint corresponding to name, or the null string if name does not correspond to a Unicode codepoint.
N2P accepts names, as defined in the second
column of UnicodeData.txt (that is, the Unicode "Name"
["Na"] property), like "LATIN CAPITAL LETTER F" or
"BELL"; aliases, as defined in
NameAliases.txt, like "LF" or
"FORM FEED", and labels identifying codepoints
that have no names, like "<Control-0001>" or
"<Private Use-E000>".
When specifying a name, case is ignored, as are certain
characters: spaces, medial dashes (except for the
"HANGUL JUNGSEONG O-E" codepoint) and underscores that
replace dashes. Hence, "BELL", "bell" and
"Bell" are all equivalent, as are
"LATIN CAPITAL LETTER F",
"Latin capital letter F" and
"latin_capital_letter_f".
Returned codepoints will be normalized, i.e., they will have a minimum length of four digits, and they will never start with a zero if they have more than four digits.
Examples:
N2P("LATIN CAPITAL LETTER F") = "0046" -- Padded to four digits
N2P("BELL") = "1F514" -- Not "01F514"
N2P("Potato") = "1F954" -- Unicode has "Potato" (a vegetable emoticon)..
N2P("Potatoes") = "" -- ..but no "Potatoes".
Returns the name or label corresponding to the hexadecimal Unicode codepoint argument, or the null string if the codepoint has no name or label.
The argument codepoint is first verified for validity. If it is not a valid hexadecimal number or it is out-of-range, a null string is returned. If the codepoint is found to be valid, it is then normalized: if it has less than four digits, zeros are added to the left, until the codepoint has exactly four digits; and if the codepoint has more than four digits, leading zeros are removed, until no more zeros are found or the codepoint has exactly four characters.
Once the codepoint has been validated and normalized, it is uppercased, and the Unicode Character Database is then searched for the "Name" ("Na") property.
If the codepoint has a name, that name is returned. If the
codepoint does not have a name but it has a label, like
"<control-0010>", then that label is returned. In all
other cases, the null string is returned.
Note. Labels are always enclosed between
"<" and ">" signs. This allows to
quickly distinguish them from names.
Examples:
P2N("46") = "LATIN CAPITAL LETTER F" -- Normalized to "0046"
P2N("0046") = "LATIN CAPITAL LETTER F" -- Normalized to "0046"
P2N("0000046") = "LATIN CAPITAL LETTER F" -- Normalized to "0046"
P2N("1F342") = "FALLEN LEAF" -- An emoji
P2N("0012") = "<control-0012>" -- A label, not a name
P2N("XXX") = "" -- Invalid codepoint
P2N("110000") = "" -- Out-of-range
If you specify only string, it returns TEXT when string is a TEXT string, GRAPHEMES when string is a GRAPHEMES string, CODEPOINTS when string is a CODEPOINTS string, and BYTES when string is a BYTES string. If you specify type, it returns 1 when string matches the type. Otherwise, it returns 0. The following are valid types:
Converts string to a TEXT string and returns it. TEXT strings are composed of extended grapheme clusters, and every character in a TEXT string can be an arbitrary extended grapheme cluster. The argument string has to contain well-formed UTF-8, or a Syntax error is raised. When working with TEXT strings, Rexx built-in functions operate at the extended grapheme cluster level, and can produce much richer results than when operating with BYTES or CODEPOINTS strings.
Please note that CODEPOINTS, GRAPHEMES and TEXT strings are
guaranteed to contain well-formed UTF-8 sequences. To test if a string
contains well-formed UTF-8, you can use the
DECODE(string,"UTF-8") or UTF8(string)
function calls.
Returns a string, in character format, that represents
u-string converted to characters. U-string must be a
blank-separated sequence of hexadecimal codepoints, or parenthesized
code point names, alias or labels (separator blanks are not needed
outside parentheses). The function will succeed if and only if an
equivalent "U-string"U string would be syntactically
correct, and produce a syntax error otherwise.
Function can be one of:
UnicodeData.txt file of the Unicode Character Database
(UCD). Two exceptions to this mapping are defined in the
SpecialCasing.txt file of the UCD. One exception is due to
the fact that the mapping is not one to one:
"0130"U, LATIN CAPITAL LETTER I WITH DOT ABOVE lowercases
to "0069 0307"U. The second exception is for
"03A3"U, the final greek sigma, which lowercases to
"03C2"U only in certain contexts (i.e., when it is not in a
medial position).UnicodeData.txt file of the Unicode Character Database
(UCD), but a number of exceptions, defined in the
SpecialCasing.txt file of the UCD have to be applied.
Additionally, the Iota-subscript, "0345"X, receives a
special treatment.Examples:
UNICODE("Café", toNFD) -- "Cafe" || "301"U
UNICODE("Café","isNFD") -- 0 (Since "Café" normalizes to something else)
UNICODE("Cafe" || "301"U,"isNFD") -- 1
UNICODE("Café",toUppercase) -- "CAFÉ"
UNICODE("ὈΔΥΣΣΕΎΣ"T,toLowercase) -- "ὀδυσσεύς" (note the difference between medial and final sigmas)
The first argument, code, must be either a UTF-32 codepoint (i.e., a four-byte BYTES string representing a 32-bit positive integer) or a hexadecimal codepoint (without the leading "U+").
The string name must be one of:
UNICODE(AA, Property,Alphabetic) -- 1 ("ª", Feminine ordinal indicator)
UNICODE(301, Property, Canonical_Combining_Class) -- 230 ("301"U, Combining grave accent)
UNICODE(C7, Property, Canonical_Decomposition_Mapping) -- "0043 0327" ("Ç", Latin capital letter C with Cedilla)
UNICODE(B8, Property, Case_Ignorable) -- 1 ("B8"U, Cedilla)
UNICODE(F8, Property, Cased) -- 1 ("ù", Latin small letter u with grave)
UNICODE(110, Property, Changes_When_Lowercased) -- 1 ("Đ", Latin capital letter D with stroke)
UNICODE(128, Property, Changes_When_Casefolded) -- 1 ("Ĩ", Latin capital letter I with tilde)
UNICODE(222, Property, Changes_When_Casemapped) -- 1 ("Ȣ", Latin capital letter Ou)
UNICODE(105, Property, Changes_When_Titlecased) -- 1 ("ą", Latin small letter a with ogonek)
UNICODE(113, Property, Changes_When_Uppercased) -- 1 ("ē", Latin small letter e with macron)
UNICODE(340, Property, Full_Composition_Exclusion) -- 1 ("◌̀ ", Combining grave tone mark)
UNICODE(7A, Property, Lowercase) -- 1
UNICODE(7C, Property, Math) -- 1
UNICODE(41, Property, Name) -- "LATIN CAPITAL LETTER A"
UNICODE(D800, Property, Name) -- "<surrogate-D800>"
UNICODE(313, Property, NFC_Quick_Check) -- "M"
UNICODE(38C, Property, NFD_Quick_Check) -- "N"
UNICODE(CD5, Property, NFKC_Quick_Check) -- "M"
UNICODE(BC, Property, NFKD_Quick_Check) -- "N"
UNICODE(730, Property, Other_Alphabetic) -- 1
UNICODE(2071, Property, Other_Lowercase) -- 1
UNICODE(2160, Property, Other_Uppercase) -- 1
UNICODE(41, Property, Simple_Lowercase_Mapping) -- "0061"
UNICODE(61, Property, Simple_Uppercase_Mapping) -- "0041"
UNICODE(3F3, Property, Soft_Dotted) -- 1
UNICODE(102, Property, Uppercase) -- 1
Note: Although this routine is part of TUTOR, The Unicode Tools Of Rexx, it can also be used separately, as it has no dependencies on the rest of components of TUTOR.
╭───────╮ ┌────────┐ ╭───╮ ╭───╮
▸▸─┤ UTF8( ├──┤ string ├──┤ , ├─┬────────────┬─┬──────────────────┬─┬──────────────────────────┬─┤ ) ├─▸◂
╰───────╯ └────────┘ ╰───╯ │ ┌────────┐ │ │ ╭───╮ ┌────────┐ │ │ ╭───╮ ┌────────────────┐ │ ╰───╯
└─┤ format ├─┘ └─┤ , ├─┤ target ├─┘ └─┤ , ├─┤ error_handling ├─┘
└────────┘ ╰───╯ └────────┘ ╰───╯ └────────────────┘
Tests whether string contains well-formed UTF-8 (this is the default when format has not been specified), or is a well-formed string in the format encoding. Optionally, it decodes it to a certain set of target encodings.
UTF8 works as a format encoding validator when target is omitted, and as a decoder when target is specified. It is an error to omit target and to specify a value for error_handling at the same time (that is, if target was omitted, then error_handling should be omitted too).
When UTF8 is used as validator, it returns a boolean value,
indicating if the string is well-formed according to the format
encoding. For example, UTF8(string) returns
1 when string contains well-formed UTF-8, and
0 if it contains ill-formed UTF-8.
UTF8 always returns BYTES strings, except when it is used as a
standalone routine (i.e., not in combination with
Unicode.cls, the RXU Rexx Preprocessor for Unicode, etc.),
in which case it returns standard ooRexx strings.
UTF8 performs a verification, at initialization time, to see whether .Bytes is a .Class, and, additionally, if .Bytes subclasses .String. If both conditions are met, UTF8 returns BYTES strings; if not, it returns standard ooRexx strings.
The format argument can be omitted or specified as the null string, in which case UTF-8 is assumed, or in can be one of UTF8 (or UTF-8), UTF8Z (or UTF-8Z), WTF8 (or WTF-8), CESU8 (or CESU-8), and MUTF8 (or MUTF-8).
UTF-8 and UTF-8Z do not allow sequences containing lone surrogates. All the other formats allow lone surrogates.
To use UTF8 as a decoder, you have to specify a target encoding. This argument accepts a single encoding, or a blank-separated set of tokens.
Each token can have one of the following values: UTF8 (or UTF-8), WTF8 (or WTF-8), UTF32 (or UTF-32), WTF32 (or WTF-32).
The W- forms of the encodings allow lone surrogates, while the U- do not.
Duplicates, when specified, are ignored. If one of the specified encodings is a W-encoding, the rest of the encodings should also be W-encodings. If format allows lone surrogates (i.e., if it is not UTF-8 or UTF-8Z), then all the specified encodings should be W-encodings.
When several targets have been specified, a stem is returned. The stem will contain a tail for every specified encoding name (uppercased, and without dashes), and the compound variable value will be the decoded string.
The optional error_handling argument determines the behaviour of the function when a decoding error is encountered. It is an error to specify error_handling withour specifying format at the same time.
"FFFD"U).Specifying format and target. Combination examples:
UTF8("00"X, utf8, utf8) -- "00"X. Validate and return UTF-8
UTF8("00"X, utf8, wtf8) -- "00"X. Validate and return WTF-8
UTF8("00"X, mutf8, utf8) -- Syntax error: MUTF-8 allows lone surrogates, but UTF-8 does not
UTF8("00"X, mutf8, wtf8) -- "". "00"X is ill-formed MUTF-8
UTF8("00"X, utf8, utf8 utf32) -- A stem s.: s.utf8 == "00"X, and s.utf32 == "0000 0000"X
UTF8("00"X, utf8, wtf8 wtf32) -- A stem s.: s.wtf8 == "00"X, and s.wtf32 == "0000 0000"X
UTF8("00"X, utf8, utf8 wtf32) -- Syntax error: cannot specify UTF-8 and WTF-32 at the same time
Validation examples:
UTF8("") -- 1 (The null string always validates)
UTF8("ascii") -- 1 (Equivalent to UTF8("ascii", "UTF-8") )
UTF8("José") -- 1
UTF8("FF"X) -- 0 ("FF"X is ill-formed)
UTF8("00"X) -- 1 (ASCII)
UTF8("00"X, "UTF-8Z") -- 0 (UTF-8Z encodes "00"U differently)
UTF8("C080"X) -- 1
UTF8("C080"X, "UTF-8Z") -- 1
UTF8("C081"X, "UTF-8Z") -- 0 (Only "C080" is well-formed)
UTF8("ED A0 80"X) -- 0 (High surrogate)
UTF8("ED A0 80"X,"WTF-8") -- 1 (UTF-8 allows surrogates)
UTF8("ED A0 80"X,"WTF-8") -- 1 (UTF-8 allows surrogates)
UTF8("F0 9F 94 94"X) -- 1 ( "(Bell)"U )
UTF8("F0 9F 94 94"X,"CESU-8") -- 0 ( CESU-8 doesn't allow four-byte sequences... )
UTF8("ED A0 BD ED B4 94"X,"CESU-8") -- 1 ( ...it expects two three-byte surrogates instead)
Error handling:
-- "C080" is ill-formed utf8
UTF8("C080"X,,utf8) -- "" (By default, UTF8 returns the null string when an error is found)
UTF8("C080"X,,utf8, replace) -- "EFBFBD EFBFBD"X ("EFBFBD" is the Unicode Replacement character)
-- "C0"X is ill-formed, and then "80"X is ill-formed too
-- That's why we get two replacement characters
UTF8("C080"X,,utf8, syntax) -- Syntax error 23.900:
-- "Invalid UTF-8 sequence in position 1 of string: 'C0'X".
Conversion examples:
UTF8("José",,UTF32) -- "0000004A 0000006F 00000073 0000E9"X ("é" is "E9"U)
UTF8("FF"X,,UTF32) -- "" (an error)
UTF8("FF"X,,UTF32,REPLACE) -- "�" ("FFFD"X, the replacement character)
UTF8("FF"X,,UTF32,SYNTAX) -- Raises a Syntax error
See The Unicode® Standard. Version 15.0 – Core Specification, p. 125:
Table 3-7. Well-Formed UTF-8 Byte Sequences
Code Points First Byte Second Byte Third Byte Fourth Byte U+0000..U+007F 00..7F U+0080..U+07FF C2..DF 80..BF U+0800..U+0FFF E0 A0..BF 80..BF U+1000..U+CFFF E1..EC 80..BF 80..BF U+D000..U+D7FF ED 80..9F 80..BF U+E000..U+FFFF EE..EF 80..BF 80..BF U+10000..U+3FFFF F0 90..BF 80..BF 80..BF U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF U+100000..U+10FFFF F4 80..8F 80..BF 80..BF In Table 3-7, cases where a trailing byte range is not 80..BF are shown in bold italic to draw attention to them. These exceptions to the general pattern occur only in the second byte of a sequence.
Based on this table, on the first run, UTF8 will build a Finite State Machine. States will be coded into two TRANSLATE tables, stored in the .local directory.
| Bytes | Mapping | Description |
|---|---|---|
| 00..7F | "A" | ASCII byte |
| 80..BF | "C" | Continuation byte |
| C0..C1 | "I" | Illegal byte |
| C2..DF | "20"X | Two-bytes sequence |
| E0 | "3a"X | Three bytes, case (a) |
| E1..EC | "3b"X | Three bytes, case (b) |
| ED | "3c"X | Three bytes, case (c) |
| EE..EF | "3b"X | Three bytes, case (b) |
| F0 | "4a"X | Four bytes, case (a) |
| F1..F3 | "4b"X | Four bytes, case (b) |
| F4 | "4c"X | Four bytes, case (c) |
| F5..FF | "I" | Illegal byte |
UTF-8Z is identical to UTF-8, with only one exception: "00"U is encoded using the overlong encoding "C080"X, so that a well-formed UTF-8Z string cannot contain NULL characters. Thus allows the continued use of old-style string C functions, which expect strings to be terminated by a NULL character.
For UTF8Z, table 3-7 has to be modified in the following way:
| Bytes | Mapping | Description |
|---|---|---|
| 00 | "I" | Illegal byte |
| 00..7F | "A" | ASCII byte |
| 80..BF | "C" | Continuation byte |
| C0 | "0" | "C080"X -> "0000"U, error otherwise |
| C1 | "I" | Illegal byte |
| C2..DF | "20"X | Two-bytes sequence |
| E0 | "3a"X | Three bytes, case (a) |
| E1..EC | "3b"X | Three bytes, case (b) |
| ED | "3c"X | Three bytes, case (c) |
| EE..EF | "3b"X | Three bytes, case (b) |
| F0 | "4a"X | Four bytes, case (a) |
| F1..F3 | "4b"X | Four bytes, case (b) |
| F4 | "4c"X | Four bytes, case (c) |
| F5..FF | "I" | Illegal byte |
See The WTF-8 encoding.
For WTF-8, table 3-7 has to be modified in the following way:
| Bytes | Mapping | Description |
|---|---|---|
| 00 | "I" | Illegal byte |
| 01..7F | "A" | ASCII byte |
| 80..BF | "C" | Continuation byte |
| C0 | "0" | "C080"X -> "0000"U, error otherwise |
| C1 | "I" | Illegal byte |
| C2..DF | "20"X | Two-bytes sequence |
| E0 | "3a"X | Three bytes, case (a) |
| E1..EC | "3b"X | Three bytes, case (b) |
| ED | "3d"X | Three bytes, case (d): 2nd byte in 80..9F, normal char; in A0..AF, lead surrogate; in B0..BF, trail surrogate; surrogate pair: error |
| EE..EF | "3b"X | Three bytes, case (b) |
| F0 | "4a"X | Four bytes, case (a) |
| F1..F3 | "4b"X | Four bytes, case (b) |
| F4 | "4c"X | Four bytes, case (c) |
| F5..FF | "I" | Illegal byte |
See Unicode Technical Report #26. COMPATIBILITY ENCODING SCHEME FOR UTF-16: 8-BIT (CESU-8).
For CESU-8, table 3-7 has to be modified in the following way:
| Bytes | Mapping | Description |
|---|---|---|
| 00..7F | "A" | ASCII byte |
| 80..BF | "C" | Continuation byte |
| C0..C1 | "I" | Illegal byte |
| C2..DF | "20"X | Two-bytes sequence |
| E0 | "3a"X | Three bytes, case (a) |
| E1..EC | "3b"X | Three bytes, case (b) |
| ED | "3e"X | Three bytes, case (e) |
| EE..EF | "3b"X | Three bytes, case (b) |
| F0..FF | "I" | Illegal byte |
MUTF-8 (Modified UTF-8) is identical to CESU-8, except for the encoding of "00"U, which is the overlong sequence "C080"X.
See the Wikipedia entry about MUTF-8.
For MUTF-8, table 3-7 has to be modified in the following way:
| Bytes | Mapping | Description |
|---|---|---|
| 00 | "I" | Illegal byte |
| 01..7F | "A" | ASCII byte |
| 80..BF | "C" | Continuation byte |
| C0 | "0" | "C080"X -> "0000"U, error otherwise |
| C1 | "I" | Illegal byte |
| C2..DF | "20"X | Two-bytes sequence |
| E0 | "3a"X | Three bytes, case (a) |
| E1..EC | "3b"X | Three bytes, case (b) |
| ED | "3e"X | Three bytes, case (e) |
| EE..EF | "3b"X | Three bytes, case (b) |
| F0..FF | "I" | Illegal byte |