UTF8.CLS is a ooRexx package containing a public ooRexx routine called UTF8.
Note: Although this routine is distributed as part of TUTOR, The Unicode Tools Of Rexx, it can also be used separately, as it has no dependencies on the rest of components of TUTOR.
Tests whether string contains well-formed UTF-8 (this is the default when format has not been specified), or is a well-formed string in the format encoding. Optionally, it decodes it to a certain set of target encodings.
UTF8 works as a format encoding validator when target is omitted, and as a decoder when target is specified. It is an error to omit target and to specify a value for error_handling at the same time (that is, if target was omitted, then error_handling should be omitted too).
When UTF8 is used as validator, it returns a boolean value, indicating if the string is well-formed according to the format encoding. For example, ``UTF8(string)`` returns 1 when string contains well-formed UTF-8, and 0 if it contains ill-formed UTF-8.
UTF8 always returns BYTES strings, except when it is used as a standalone routine (i.e., not in combination with ``Unicode.cls``, the RXU Rexx Preprocessor for Unicode, etc.), in which case it returns standard ooRexx strings.
UTF8 performs a verification, at initialization time, to see whether .Bytes is a .Class, and, additionally, if .Bytes subclasses .String. If both conditions are met, UTF8 returns BYTES strings; if not, it returns standard ooRexx strings.
The format argument can be omitted or specified as the null string, in which case UTF-8 is assumed, or in can be one of UTF8 (or UTF-8), UTF8Z (or UTF-8Z), WTF8 (or WTF-8), CESU8 (or CESU-8), and MUTF8 (or MUTF-8).
UTF-8 and UTF-8Z do not allow sequences containing lone surrogates. All the other formats allow lone surrogates.
To use UTF8 as a decoder, you have to specify a target encoding. This argument accepts a single encoding, or a blank-separated set of tokens.
Each token can have one of the following values: UTF8 (or UTF-8), WTF8 (or WTF-8), UTF32 (or UTF-32), WTF32 (or WTF-32).
The W- forms of the encodings allow lone surrogates, while the U- do not.
Duplicates, when specified, are ignored. If one of the specified encodings is a W-encoding, the rest of the encodings should also be W-encodings. If format allows lone surrogates (i.e., if it is not UTF-8 or UTF-8Z), then all the specified encodings should be W-encodings.
When several targets have been specified, a stem is returned. The stem will contain a tail for every specified encoding name (uppercased, and without dashes), and the compound variable value will be the decoded string.
The optional error_handling argument determines the behaviour of the function when a decoding error is encountered. It is an error to specify error_handling withour specifying format at the same time.
UTF8("00"X, utf8, utf8) -- "00"X. Validate and return UTF-8 UTF8("00"X, utf8, wtf8) -- "00"X. Validate and return WTF-8 UTF8("00"X, mutf8, utf8) -- Syntax error: MUTF-8 allows lone surrogates, but UTF-8 does not UTF8("00"X, mutf8, wtf8) -- "". "00"X is ill-formed MUTF-8 UTF8("00"X, utf8, utf8 utf32) -- A stem s.: s.utf8 == "00"X, and s.utf32 == "0000 0000"X UTF8("00"X, utf8, wtf8 wtf32) -- A stem s.: s.wtf8 == "00"X, and s.wtf32 == "0000 0000"X UTF8("00"X, utf8, utf8 wtf32) -- Syntax error: cannot specify UTF-8 and WTF-32 at the same time
UTF8("") -- 1 (The null string always validates) UTF8("ascii") -- 1 (Equivalent to UTF8("ascii", "UTF-8") ) UTF8("José") -- 1 UTF8("FF"X) -- 0 ("FF"X is ill-formed) UTF8("00"X) -- 1 (ASCII) UTF8("00"X, "UTF-8Z") -- 0 (UTF-8Z encodes "00"U differently) UTF8("C080"X) -- 1 UTF8("C080"X, "UTF-8Z") -- 1 UTF8("C081"X, "UTF-8Z") -- 0 (Only "C080" is well-formed) UTF8("ED A0 80"X) -- 0 (High surrogate) UTF8("ED A0 80"X,"WTF-8") -- 1 (UTF-8 allows surrogates) UTF8("ED A0 80"X,"WTF-8") -- 1 (UTF-8 allows surrogates) UTF8("F0 9F 94 94"X) -- 1 ( "(Bell)"U ) UTF8("F0 9F 94 94"X,"CESU-8") -- 0 ( CESU-8 doesn't allow four-byte sequences... ) UTF8("ED A0 BD ED B4 94"X,"CESU-8") -- 1 ( ...it expects two three-byte surrogates instead)
-- "C080" is ill-formed utf8 UTF8("C080"X,,utf8) -- "" (By default, UTF8 returns the null string when an error is found) UTF8("C080"X,,utf8, replace) -- "EFBFBD EFBFBD"X ("EFBFBD" is the Unicode Replacement character) -- "C0"X is ill-formed, and then "80"X is ill-formed too -- That's why we get two replacement characters UTF8("C080"X,,utf8, syntax) -- Syntax error 23.900: -- "Invalid UTF-8 sequence in position 1 of string: 'C0'X".
UTF8("José",,UTF32) -- "0000004A 0000006F 00000073 0000E9"X ("é" is "E9"U) UTF8("FF"X,,UTF32) -- "" (an error) UTF8("FF"X,,UTF32,REPLACE) -- "�" ("FFFD"X, the replacement character) UTF8("FF"X,,UTF32,SYNTAX) -- Raises a Syntax error
See The Unicode® Standard. Version 15.0 – Core Specification, p. 125:
Based on this table, on the first run, UTF8 will build a Finite State Machine. States will be coded into two TRANSLATE tables, stored in the .local directory.Table 3-7. Well-Formed UTF-8 Byte Sequences
In Table 3-7, cases where a trailing byte range is not 80..BF are shown in bold italic to draw attention to them. These exceptions to the general pattern occur only in the second byte of a sequence.
Code Points First Byte Second Byte Third Byte Fourth Byte U+0000..U+007F 00..7F U+0080..U+07FF C2..DF 80..BF U+0800..U+0FFF E0 A0..BF 80..BF U+1000..U+CFFF E1..EC 80..BF 80..BF U+D000..U+D7FF ED 80..9F 80..BF U+E000..U+FFFF EE..EF 80..BF 80..BF U+10000..U+3FFFF F0 90..BF 80..BF 80..BF U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
Bytes | Mapping | Description |
00..7F | "A" | ASCII byte |
80..BF | "C" | Continuation byte |
C0..C1 | "I" | Illegal byte |
C2..DF | "20"X | Two-bytes sequence |
E0 | "3a"X | Three bytes, case (a) |
E1..EC | "3b"X | Three bytes, case (b) |
ED | "3c"X | Three bytes, case (c) |
EE..EF | "3b"X | Three bytes, case (b) |
F0 | "4a"X | Four bytes, case (a) |
F1..F3 | "4b"X | Four bytes, case (b) |
F4 | "4c"X | Four bytes, case (c) |
F5..FF | "I" | Illegal byte |
UTF-8Z is identical to UTF-8, with only one exception: "00"U is encoded using the overlong encoding "C080"X, so that a well-formed UTF-8Z string cannot contain NULL characters. Thus allows the continued use of old-style string C functions, which expect strings to be terminated by a NULL character.
For UTF8Z, table 3-7 has to be modified in the following way:
Bytes | Mapping | Description |
00 | "I" | Illegal byte |
00..7F | "A" | ASCII byte |
80..BF | "C" | Continuation byte |
C0 | "0" | "C080"X -> "0000"U, error otherwise |
C1 | "I" | Illegal byte |
C2..DF | "20"X | Two-bytes sequence |
E0 | "3a"X | Three bytes, case (a) |
E1..EC | "3b"X | Three bytes, case (b) |
ED | "3c"X | Three bytes, case (c) |
EE..EF | "3b"X | Three bytes, case (b) |
F0 | "4a"X | Four bytes, case (a) |
F1..F3 | "4b"X | Four bytes, case (b) |
F4 | "4c"X | Four bytes, case (c) |
F5..FF | "I" | Illegal byte |
See The WTF-8 encoding.
For WTF-8, table 3-7 has to be modified in the following way:
Bytes | Mapping | Description |
00 | "I" | Illegal byte |
01..7F | "A" | ASCII byte |
80..BF | "C" | Continuation byte |
C0 | "0" | "C080"X -> "0000"U, error otherwise |
C1 | "I" | Illegal byte |
C2..DF | "20"X | Two-bytes sequence |
E0 | "3a"X | Three bytes, case (a) |
E1..EC | "3b"X | Three bytes, case (b) |
ED | "3d"X | Three bytes, case (d): 2nd byte in 80..9F, normal char; in A0..AF, lead surrogate; in B0..BF, trail surrogate; surrogate pair: error |
EE..EF | "3b"X | Three bytes, case (b) |
F0 | "4a"X | Four bytes, case (a) |
F1..F3 | "4b"X | Four bytes, case (b) |
F4 | "4c"X | Four bytes, case (c) |
F5..FF | "I" | Illegal byte |
See Unicode Technical Report #26. COMPATIBILITY ENCODING SCHEME FOR UTF-16: 8-BIT (CESU-8).
For CESU-8, table 3-7 has to be modified in the following way:
Bytes | Mapping | Description |
00..7F | "A" | ASCII byte |
80..BF | "C" | Continuation byte |
C0..C1 | "I" | Illegal byte |
C2..DF | "20"X | Two-bytes sequence |
E0 | "3a"X | Three bytes, case (a) |
E1..EC | "3b"X | Three bytes, case (b) |
ED | "3e"X | Three bytes, case (e) |
EE..EF | "3b"X | Three bytes, case (b) |
F0..FF | "I" | Illegal byte |
MUTF-8 (Modified UTF-8) is identical to CESU-8, except for the encoding of "00"U, which is the overlong sequence "C080"X.
See the Wikipedia entry about MUTF-8.
For MUTF-8, table 3-7 has to be modified in the following way:
Bytes | Mapping | Description |
00 | "I" | Illegal byte |
01..7F | "A" | ASCII byte |
80..BF | "C" | Continuation byte |
C0 | "0" | "C080"X -> "0000"U, error otherwise |
C1 | "I" | Illegal byte |
C2..DF | "20"X | Two-bytes sequence |
E0 | "3a"X | Three bytes, case (a) |
E1..EC | "3b"X | Three bytes, case (b) |
ED | "3e"X | Three bytes, case (e) |
EE..EF | "3b"X | Three bytes, case (b) |
F0..FF | "I" | Illegal byte |