UTF-8 utilities


UTF-8 utilities


UTF8.CLS is a ooRexx package containing a public ooRexx routine called UTF8.

Documentation for the 0.5 release of the UTF8 function

Note: Although this routine is distributed as part of TUTOR, The Unicode Tools Of Rexx, it can also be used separately, as it has no dependencies on the rest of components of TUTOR.


rr diagram for the UTF8 function

Tests whether string contains well-formed UTF-8 (this is the default when format has not been specified), or is a well-formed string in the format encoding. Optionally, it decodes it to a certain set of target encodings.

UTF8 works as a format encoding validator when target is omitted, and as a decoder when target is specified. It is an error to omit target and to specify a value for error_handling at the same time (that is, if target was omitted, then error_handling should be omitted too).

When UTF8 is used as validator, it returns a boolean value, indicating if the string is well-formed according to the format encoding. For example, ``UTF8(string)`` returns 1 when string contains well-formed UTF-8, and 0 if it contains ill-formed UTF-8.

Type of the returned value(s)

UTF8 always returns BYTES strings, except when it is used as a standalone routine (i.e., not in combination with ``Unicode.cls``, the RXU Rexx Preprocessor for Unicode, etc.), in which case it returns standard ooRexx strings.

UTF8 performs a verification, at initialization time, to see whether .Bytes is a .Class, and, additionally, if .Bytes subclasses .String. If both conditions are met, UTF8 returns BYTES strings; if not, it returns standard ooRexx strings.

Valid formats

The format argument can be omitted or specified as the null string, in which case UTF-8 is assumed, or in can be one of UTF8 (or UTF-8), UTF8Z (or UTF-8Z), WTF8 (or WTF-8), CESU8 (or CESU-8), and MUTF8 (or MUTF-8).

  • The UTF-8 encoding is described in The Unicode® Standard. Version 15.0 – Core Specification, pp. 124 ss.
  • UTF-8Z is identical to UTF-8, with a single exception: "00"U, in UTF-8Z, is encoded using the overlong sequence "C080"X, while in UTF-8 it is encoded as "00"X.
  • The WTF-8 encoding is described in The WTF-8 encoding. It extends UTF-8 by allowing lone surrogate codepoints, encoded as standard three-byte sequences. Surrogate pairs are not allowed: they should be encoded using four-byte sequences.
  • The CESU-8 encoding is described in the Unicode Technical Report #26. Supplementary characters (i.e., codepoints greater than "FFFF"U) are first encoded using two surrogates, as in UTF16, and then each surrogate is encoded as in WTF-8, giving a total of six bytes. Four-byte sequences are ill-formed, and lone surrogates are admitted.
  • The MUTF-8 encoding (see the Wikipedia entry about MUTF-8) is identical to CESU-8, with a single difference: it encodes "00"U in the same way that UTF-8Z.

UTF-8 and UTF-8Z do not allow sequences containing lone surrogates. All the other formats allow lone surrogates.

Decoding with UTF8

To use UTF8 as a decoder, you have to specify a target encoding. This argument accepts a single encoding, or a blank-separated set of tokens.

Each token can have one of the following values: UTF8 (or UTF-8), WTF8 (or WTF-8), UTF32 (or UTF-32), WTF32 (or WTF-32).

The W- forms of the encodings allow lone surrogates, while the U- do not.

Duplicates, when specified, are ignored. If one of the specified encodings is a W-encoding, the rest of the encodings should also be W-encodings. If format allows lone surrogates (i.e., if it is not UTF-8 or UTF-8Z), then all the specified encodings should be W-encodings.

When several targets have been specified, a stem is returned. The stem will contain a tail for every specified encoding name (uppercased, and without dashes), and the compound variable value will be the decoded string.

Error handling

The optional error_handling argument determines the behaviour of the function when a decoding error is encountered. It is an error to specify error_handling withour specifying format at the same time.

  • When error_handling has the value "" (the default) or NULL, a null string is returned when a decoding error is encountered.
  • When error_handling has the value REPLACE, any ill-formed character will be replaced by the Unicode Replacement Character (``"FFFD"U``).
  • When error_handling has the value SYNTAX, a syntax condition will be raised when a decoding error is encountered.

Conditions

  • Syntax 93.900. Invalid option 'option'.
  • Syntax 93.900. Invalid format 'format'.
  • Syntax 93.900. Invalid target 'target'.
  • Syntax 93.900. Invalid error handling 'error_handling'.
  • Syntax 93.900. Conflicting target target and format format.
  • Syntax 23.900. Invalid format sequence in position n of string: 'hex-value'X.

Examples

Specifying format and target. Combination examples:

UTF8("00"X, utf8,  utf8)       -- "00"X. Validate and return UTF-8
UTF8("00"X, utf8,  wtf8)       -- "00"X. Validate and return WTF-8
UTF8("00"X, mutf8, utf8)       -- Syntax error: MUTF-8 allows lone surrogates, but UTF-8 does not
UTF8("00"X, mutf8, wtf8)       -- "". "00"X is ill-formed MUTF-8
UTF8("00"X, utf8,  utf8 utf32) -- A stem s.: s.utf8 == "00"X, and s.utf32 == "0000 0000"X
UTF8("00"X, utf8,  wtf8 wtf32) -- A stem s.: s.wtf8 == "00"X, and s.wtf32 == "0000 0000"X
UTF8("00"X, utf8,  utf8 wtf32) -- Syntax error: cannot specify UTF-8 and WTF-32 at the same time

Validation examples:

UTF8("")                             -- 1  (The null string always validates)
UTF8("ascii")                        -- 1  (Equivalent to UTF8("ascii", "UTF-8") )
UTF8("José")                         -- 1
UTF8("FF"X)                          -- 0  ("FF"X is ill-formed)
UTF8("00"X)                          -- 1  (ASCII)
UTF8("00"X, "UTF-8Z")                -- 0  (UTF-8Z encodes "00"U differently)
UTF8("C080"X)                        -- 1
UTF8("C080"X, "UTF-8Z")              -- 1
UTF8("C081"X, "UTF-8Z")              -- 0  (Only "C080" is well-formed)
UTF8("ED A0 80"X)                    -- 0  (High surrogate)
UTF8("ED A0 80"X,"WTF-8")            -- 1  (UTF-8 allows surrogates)
UTF8("ED A0 80"X,"WTF-8")            -- 1  (UTF-8 allows surrogates)
UTF8("F0 9F 94 94"X)                 -- 1  ( "(Bell)"U )
UTF8("F0 9F 94 94"X,"CESU-8")        -- 0  ( CESU-8 doesn't allow four-byte sequences... )
UTF8("ED A0 BD ED B4 94"X,"CESU-8")  -- 1  ( ...it expects two three-byte surrogates instead)

Error handling:

                                     -- "C080" is ill-formed utf8
UTF8("C080"X,,utf8)                  -- "" (By default, UTF8 returns the null string when an error is found)
UTF8("C080"X,,utf8, replace)         -- "EFBFBD EFBFBD"X ("EFBFBD" is the Unicode Replacement character)
                                     -- "C0"X is ill-formed, and then "80"X is ill-formed too
                                     -- That's why we get two replacement characters
UTF8("C080"X,,utf8, syntax)          -- Syntax error 23.900:
                                     -- "Invalid UTF-8 sequence in position 1 of string: 'C0'X".

Conversion examples:

UTF8("José",,UTF32)                  -- "0000004A 0000006F 00000073 0000E9"X ("é" is "E9"U)
UTF8("FF"X,,UTF32)                   -- "" (an error)
UTF8("FF"X,,UTF32,REPLACE)           -- "�" ("FFFD"X, the replacement character)
UTF8("FF"X,,UTF32,SYNTAX)            -- Raises a Syntax error

Implementation notes

See The Unicode® Standard. Version 15.0 – Core Specification, p. 125:

Table 3-7. Well-Formed UTF-8 Byte Sequences

Code Points First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F 00..7F
U+0080..U+07FFC2..DF80..BF
U+0800..U+0FFFE0A0..BF80..BF
U+1000..U+CFFFE1..EC80..BF80..BF
U+D000..U+D7FFED80..9F80..BF
U+E000..U+FFFFEE..EF80..BF80..BF
U+10000..U+3FFFFF090..BF80..BF80..BF
U+40000..U+FFFFFF1..F380..BF80..BF80..BF
U+100000..U+10FFFFF480..8F80..BF80..BF
In Table 3-7, cases where a trailing byte range is not 80..BF are shown in bold italic to draw attention to them. These exceptions to the general pattern occur only in the second byte of a sequence.
Based on this table, on the first run, UTF8 will build a Finite State Machine. States will be coded into two TRANSLATE tables, stored in the .local directory.
  • The range 00..7F is mapped to "A" (for "A"SCII).
  • The range 80..BF is mapped to "C" (for "C"ontinuation characters). Some few bytes will require manual checking.
  • The values CO, C1 and F5..FF are always illegal in a UTF-8 string. We add rows for these ranges, and we map the corresponding codes to "I" (for "I"llegal).
  • The range C2..DF is mapped to "20"X (the "2" in "20" reminds us that we will find a 2-byte group, if the string is well-formed).
  • The range E0..EF is mapped to the "3x"X values, "3a", "3b" and "3c". The "3" reminds us that we will find a 3-bytes groups, if the string is well-formed; the final "a", "b" and "c" allow us to differentiate the cases, and perform the corresponding tests.
  • Similarly, the F0..F4 range is mapped to "4a"X, "4b"X and "4c"X, as described below.

Table 3-7 (modified)

Bytes Mapping Description
00..7F "A" ASCII byte
80..BF "C" Continuation byte
C0..C1 "I" Illegal byte
C2..DF "20"X Two-bytes sequence
E0 "3a"X Three bytes, case (a)
E1..EC "3b"X Three bytes, case (b)
ED "3c"X Three bytes, case (c)
EE..EF "3b"X Three bytes, case (b)
F0 "4a"X Four bytes, case (a)
F1..F3 "4b"X Four bytes, case (b)
F4 "4c"X Four bytes, case (c)
F5..FF "I" Illegal byte

Table 3-7 (modified for UTF8Z)

UTF-8Z is identical to UTF-8, with only one exception: "00"U is encoded using the overlong encoding "C080"X, so that a well-formed UTF-8Z string cannot contain NULL characters. Thus allows the continued use of old-style string C functions, which expect strings to be terminated by a NULL character.

For UTF8Z, table 3-7 has to be modified in the following way:

Bytes Mapping Description
00 "I" Illegal byte
00..7F "A" ASCII byte
80..BF "C" Continuation byte
C0 "0" "C080"X -> "0000"U, error otherwise
C1 "I" Illegal byte
C2..DF "20"X Two-bytes sequence
E0 "3a"X Three bytes, case (a)
E1..EC "3b"X Three bytes, case (b)
ED "3c"X Three bytes, case (c)
EE..EF "3b"X Three bytes, case (b)
F0 "4a"X Four bytes, case (a)
F1..F3 "4b"X Four bytes, case (b)
F4 "4c"X Four bytes, case (c)
F5..FF "I" Illegal byte

Table 3-7 (modified for WTF-8)

See The WTF-8 encoding.

For WTF-8, table 3-7 has to be modified in the following way:

Bytes Mapping Description
00 "I" Illegal byte
01..7F "A" ASCII byte
80..BF "C" Continuation byte
C0 "0" "C080"X -> "0000"U, error otherwise
C1 "I" Illegal byte
C2..DF "20"X Two-bytes sequence
E0 "3a"X Three bytes, case (a)
E1..EC "3b"X Three bytes, case (b)
ED "3d"X Three bytes, case (d): 2nd byte in 80..9F, normal char; in A0..AF, lead surrogate; in B0..BF, trail surrogate; surrogate pair: error
EE..EF "3b"X Three bytes, case (b)
F0 "4a"X Four bytes, case (a)
F1..F3 "4b"X Four bytes, case (b)
F4 "4c"X Four bytes, case (c)
F5..FF "I" Illegal byte

Table 3-7 (modified for CESU-8)

See Unicode Technical Report #26. COMPATIBILITY ENCODING SCHEME FOR UTF-16: 8-BIT (CESU-8).

For CESU-8, table 3-7 has to be modified in the following way:

Bytes Mapping Description
00..7F "A" ASCII byte
80..BF "C" Continuation byte
C0..C1 "I" Illegal byte
C2..DF "20"X Two-bytes sequence
E0 "3a"X Three bytes, case (a)
E1..EC "3b"X Three bytes, case (b)
ED "3e"X Three bytes, case (e)
EE..EF "3b"X Three bytes, case (b)
F0..FF "I" Illegal byte

Table 3-7 (modified for MUTF-8)

MUTF-8 (Modified UTF-8) is identical to CESU-8, except for the encoding of "00"U, which is the overlong sequence "C080"X.

See the Wikipedia entry about MUTF-8.

For MUTF-8, table 3-7 has to be modified in the following way:


Bytes Mapping Description
00 "I" Illegal byte
01..7F "A" ASCII byte
80..BF "C" Continuation byte
C0 "0" "C080"X -> "0000"U, error otherwise
C1 "I" Illegal byte
C2..DF "20"X Two-bytes sequence
E0 "3a"X Three bytes, case (a)
E1..EC "3b"X Three bytes, case (b)
ED "3e"X Three bytes, case (e)
EE..EF "3b"X Three bytes, case (b)
F0..FF "I" Illegal byte

Copyright © 1992-2025, EPBCN & Josep Maria Blasco. This site is powered by ooRexx and RexxHttp.