New built-in functions


New built-in functions

/******************************************************************************
 * This file is part of The Unicode Tools Of Rexx (TUTOR)                     *
 * See https://rexx.epbcn.com/TUTOR/                                          *
 *     and https://github.com/JosepMariaBlasco/TUTOR                          *
 * Copyright © 2023-2025 Josep Maria Blasco <josep.maria.blasco@epbcn.com>    *
 * License: Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0)  *
 ******************************************************************************/

The Rexx Preprocessor for Unicode implements a series of new built-in functions (BIFs). Follow this link if you want to read about modifications to existing BIFs.

BYTES

Diagram for the BYTES BIF

Returns the string converted to the BYTES format. BYTES strings are composed of 8-bit bytes, and every character in the string can be an arbitrary 8-bit value, including binary data. Rexx built-in-functions operate at the byte level, and no Unicode features are available (for example, LOWER operates only on the ranges "A".."Z" and "a".."z"). This is equivalent to Classic Rexx strings, but with some enhancements. See the description of the BYTES class for details.

CODEPOINTS

Diagram for the CODEPOINTS BIF

Converts string to a CODEPOINTS string and returns it. CODEPOINTS strings are composed of Unicode codepoints, and every character in the string can be an arbitrary Unicode codepoint. The argument string has to contain well-formed UTF-8, or a Syntax error will be raised. When working with CODEPOINTS strings, Rexx built-in functions operate at the codepoint level, and can produce much richer results than when operating on BYTES strings.

Please note that CODEPOINTS, GRAPHEMES and TEXT strings are guaranteed to contain well-formed UTF-8 sequences. To test if a string contains well-formed UTF-8, you can use the DECODE(string,"UTF-8") or UTF8(string) function calls.

C2U (Character to Unicode)

Diagram for the C2U BIF

Returns a string, in character format, that represents string converted to Unicode codepoints.

By default, C2U returns a list of blank-separated hexadecimal representations of the codepoints. The format argument allows to select different formats for the returned string:

  • When format is the null string or CODES (the default), C2U returns a list of blank-separated hexadecimal codepoints. Codepoints larger than "FFFF"X will have their leading zeros removed, if any. Codepoints smaller than "10000"X will always have four digits (by adding zeros to the left if necessary).
  • When format is U+, a list of hexadecimal codepoints is returned. Each codepoint is prefixed with the characters "U+".
  • When format is NAMES, each codepoint is substituted by its corresponding name or label, between parentheses. For example, C2U("S") == "(LATIN CAPITAL LETTER S)", and C2U("0A"X) = "(<control-000A>)".
  • When format is UTF-32, a UTF-32 representation of string is returned.

Examples (assuming an ambient encoding of UTF-8):

 C2U("")       = "0053 00ED"       -- And "0053 00ED"U == "53 C3AD"X == "Sí".
 C2U("","U+")  = "U+0053 U+00ED"   -- Again, "U+0053 U+00ED"U == "53 C3AD"X == "Sí".
 C2U("","Na")  = "(LATIN CAPITAL LETTER S) (LATIN SMALL LETTER I WITH ACUTE)"
                                     -- And "(LATIN CAPITAL LETTER S) (LATIN SMALL LETTER I WITH ACUTE)"U == "Sí"
 C2U("","UTF-32") = "0000 0053 0000 00ED"X

DECODE

Diagram for the DECODE BIF

Tests whether a string is encoded according to a certain encoding, and optionally decodes it to a certain format.

DECODE works as an encoding validator when format is omitted, and as a decoder when format is specified. It is an error to omit format and to specify a value for error_handling at the same time (that is, if format was omitted, then error_handling should be omitted too).

When DECODE is used as validator, it returns a boolean value, indicating if the string is well-formed according to the specified encoding. For example, DECODE(string,"UTF-8") returns 1 when string contains well-formed UTF-8, and 0 if it contains ill-formed UTF-8.

To use DECODE as a decoder, you have to specify a format. This argument accepts a blank-separated set of tokens. Each token can have one of the following values: UTF8, UTF-8, UTF32, or UTF-32 (duplicates are allowed and ignored). When UTF8 or UTF-8 have been specified, a UTF-8 representation of the decoded string is returned. When UTF32 or UTF-32 have been specified, UTF-32 representation of the decoded string is returned. When both have been specified, a two-items array is returned. The first item of the array is the UTF-8 representation of the decoded string, and the second item of the array contains the UTF-32 representation of the decoded string.

The optional error_handling argument determines the behaviour of the function when the format argument has been specified. If it has the value "" (the default) or NULL, a null string is returned when there a decoding error is encountered. If it has the value REPLACE, any ill-formed character will be replaced by the Unicode Replacement Character (U+FFFD). If it has the value SYNTAX, a syntax condition will be raised when a decoding error is encountered.

Examples:

DECODE(string, "UTF-16")                           -- Returns 1 if string contains proper UTF-16, and 0 otherwise
var = DECODE(string, "UTF-16", "UTF-8")            -- Decodes string to the UTF-8 format. A null string is returned if string contains ill-formed UTF-16.
DECODE(string, "UTF-16",,"SYNTAX")                 -- The fourth argument is checked for validity and then ignored.
DECODE(string, "UTF-16",,"POTATO")                 -- Syntax error (Invalid option 'POTATO').
var = DECODE(string, "UTF-16", "UTF-8", "REPLACE") -- Decodes string to the UTF-8 format. Ill-formed character sequences are replaced by U+FFFD.
var = DECODE(string, "UTF-16", "UTF-8", "SYNTAX")  -- Decodes string to the UTF-8 format. Any ill-formed character sequence will raise a Syntax error.

ENCODE

Diagram for the ENCODE BIF

ENCODE first validates that string contains well-formed UTF-8. Once the string is validated, encoding is attempted using the specified encoding. ENCODE returns the encoded string, or a null string if validation or encoding failed. You can influence the behaviour of the function when an error is encountered by specifying the optional error_handling argument. * When error_handling is not specified, is "" or is NULL (the default), a null string is returned if an error is encountered. * When error_handling has the value SYNTAX, a Syntax error is raised if an error is encountered.

Examples:

ENCODE(string, "IBM1047")                          -- The encoded string, or "" if string can not be encoded to IBM1047.
ENCODE(string, "IBM1047","SYNTAX")                 -- The encoded string. If the encoding fails, a Syntax error is raised.

GRAPHEMES

Diagram for the GRAPHEMES BIF

Converts string to a GRAPHEMES string and returns it. GRAPHEMES strings are composed of extended grapheme clusters, and every character in a GRAPHEMES string can be an arbitrary extended grapheme cluster. The argument string has to contain well-formed UTF-8, or a Syntax error is raised. When working with GRAPHEMES strings, Rexx built-in functions operate at the extended grapheme cluster level, and can produce much richer results than when operating with BYTES or CODEPOINTS strings.

Please note that CODEPOINTS, GRAPHEMES and TEXT strings are guaranteed to contain well-formed UTF-8 sequences. To test if a string contains well-formed UTF-8, you can use the DECODE(string,"UTF-8") or UTF8(string) function calls.

N2P (Name to codePoint)

Diagram for the N2P BIF

Returns the hexadecimal Unicode codepoint corresponding to name, or the null string if name does not correspond to a Unicode codepoint.

N2P accepts names, as defined in the second column of UnicodeData.txt (that is, the Unicode "Name" ["Na"] property), like "LATIN CAPITAL LETTER F" or "BELL"; aliases, as defined in NameAliases.txt, like "LF" or "FORM FEED", and labels identifying codepoints that have no names, like "<Control-0001>" or "<Private Use-E000>".

When specifying a name, case is ignored, as are certain characters: spaces, medial dashes (except for the "HANGUL JUNGSEONG O-E" codepoint) and underscores that replace dashes. Hence, "BELL", "bell" and "Bell" are all equivalent, as are "LATIN CAPITAL LETTER F", "Latin capital letter F" and "latin_capital_letter_f".

Returned codepoints will be normalized, i.e., they will have a minimum length of four digits, and they will never start with a zero if they have more than four digits.

Examples:

N2P("LATIN CAPITAL LETTER F") =  "0046"       -- Padded to four digits
N2P("BELL")                   = "1F514"       -- Not "01F514"
N2P("Potato")                 = "1F954"       -- Unicode has "Potato" (a vegetable emoticon)..
N2P("Potatoes")               = ""            -- ..but no "Potatoes".

P2N (codePoint to Name)

Diagram for the P2N BIF

Returns the name or label corresponding to the hexadecimal Unicode codepoint argument, or the null string if the codepoint has no name or label.

The argument codepoint is first verified for validity. If it is not a valid hexadecimal number or it is out-of-range, a null string is returned. If the codepoint is found to be valid, it is then normalized: if it has less than four digits, zeros are added to the left, until the codepoint has exactly four digits; and if the codepoint has more than four digits, leading zeros are removed, until no more zeros are found or the codepoint has exactly four characters.

Once the codepoint has been validated and normalized, it is uppercased, and the Unicode Character Database is then searched for the "Name" ("Na") property.

If the codepoint has a name, that name is returned. If the codepoint does not have a name but it has a label, like "<control-0010>", then that label is returned. In all other cases, the null string is returned.

Note. Labels are always enclosed between "<" and ">" signs. This allows to quickly distinguish them from names.

Examples:

P2N("46")      =  "LATIN CAPITAL LETTER F"    -- Normalized to "0046"
P2N("0046")    =  "LATIN CAPITAL LETTER F"    -- Normalized to "0046"
P2N("0000046") =  "LATIN CAPITAL LETTER F"    -- Normalized to "0046"
P2N("1F342")   =  "FALLEN LEAF"               -- An emoji
P2N("0012")    =  "<control-0012>"            -- A label, not a name
P2N("XXX")     =  ""                          -- Invalid codepoint
P2N("110000")  =  ""                          -- Out-of-range

STRINGTYPE

Diagram for the STRINGTYPE BIF

If you specify only string, it returns TEXT when string is a TEXT string, GRAPHEMES when string is a GRAPHEMES string, CODEPOINTS when string is a CODEPOINTS string, and BYTES when string is a BYTES string. If you specify type, it returns 1 when string matches the type. Otherwise, it returns 0. The following are valid types:

  • BYTES. Returns 1 if the string is a BYTES string.
  • CODEPOINTS. Returns 1 if the string is a CODEPOINTS string.
  • GRAPHEMES. Returns 1 if the string is a GRAPHEMES string.
  • TEXT. Returns 1 if the string is a TEXT string.

TEXT

Diagram for the TEXT BIF

Converts string to a TEXT string and returns it. TEXT strings are composed of extended grapheme clusters, and every character in a TEXT string can be an arbitrary extended grapheme cluster. The argument string has to contain well-formed UTF-8, or a Syntax error is raised. When working with TEXT strings, Rexx built-in functions operate at the extended grapheme cluster level, and can produce much richer results than when operating with BYTES or CODEPOINTS strings.

Please note that CODEPOINTS, GRAPHEMES and TEXT strings are guaranteed to contain well-formed UTF-8 sequences. To test if a string contains well-formed UTF-8, you can use the DECODE(string,"UTF-8") or UTF8(string) function calls.

U2C (Unicode to Character)

Diagram for the U2C BIF

Returns a string, in character format, that represents u-string converted to characters. U-string must be a blank-separated sequence of hexadecimal codepoints, or parenthesized code point names, alias or labels (separator blanks are not needed outside parentheses). The function will succeed if and only if an equivalent "U-string"U string would be syntactically correct, and produce a syntax error otherwise.

UNICODE (Functional form)

Diagram for the functional form of the UNICODE BIF

Function can be one of:

  • isNFC: returns 1 when string is normalized to the NFC format, and 0 otherwise.
  • isNFD: returns 1 when string is normalized to the NFD format, and 0 otherwise.
  • toLowercase: returns toLowercase(X), as defined in rule R2 of section "Default Case Conversion" of The Unicode Standard, Version 15.0 – Core Specification: "Map each character C in X to Lowercase_Mapping(C)". Broadly speaking, Lowercase_Mapping(C) implements the Simple_Lowercase_Mapping property, as defined in the UnicodeData.txt file of the Unicode Character Database (UCD). Two exceptions to this mapping are defined in the SpecialCasing.txt file of the UCD. One exception is due to the fact that the mapping is not one to one: "0130"U, LATIN CAPITAL LETTER I WITH DOT ABOVE lowercases to "0069 0307"U. The second exception is for "03A3"U, the final greek sigma, which lowercases to "03C2"U only in certain contexts (i.e., when it is not in a medial position).
  • toUppercase: returns toUppercase(X), as defined in rule R1 of section "Default Case Conversion" of The Unicode Standard, Version 15.0 – Core Specification: "Map each character C in X to Uppercase_Mapping(C)". Broadly speaking, Uppercase_Mapping(C) implements the Simple_Uppercase_Mapping property, as defined in the UnicodeData.txt file of the Unicode Character Database (UCD), but a number of exceptions, defined in the SpecialCasing.txt file of the UCD have to be applied. Additionally, the Iota-subscript, "0345"X, receives a special treatment.
  • toNFC: returns string normalized to the NFC format.
  • toNFD: returns string normalized to the NFD format.

Examples:

UNICODE("Café", toNFD)                            -- "Cafe" || "301"U
UNICODE("Café","isNFD")                           -- 0 (Since "Café" normalizes to something else)
UNICODE("Cafe" || "301"U,"isNFD")                 -- 1
UNICODE("Café",toUppercase)                       -- "CAFÉ"
UNICODE("ὈΔΥΣΣΕΎΣ"T,toLowercase)                  -- "ὀδυσσεύς" (note the difference between medial and final sigmas)

UNICODE ("Property" form)

Diagram for the property form of the UNICODE BIF

The first argument, code, must be either a UTF-32 codepoint (i.e., a four-byte BYTES string representing a 32-bit positive integer) or a hexadecimal codepoint (without the leading "U+").

The string name must be one of:

  • Alphabetic: returns a boolean.
  • Alpha: an alias for Alphabetic.
  • Canonical_Combining_Class: returns an integer between 0 and 254.
  • Canonical_Decomposition_Mapping: returns one or two normalized hex codepoints [Non-standard property: this corresponds to the Decomposition_Mapping column (number 6, 1-based, in UnicodeData.txt), when the mapping is not a compatibility mapping (i.e., it does not start with a "<" character)]
  • Case_Ignorable: returns a boolean.
  • Cased: returns a boolean.
  • CCC: an alias for Canonical_Combining_Class.
  • Changes_When_Casefolded: returns a boolean.
  • Changes_When_Casemapped: returns a boolean.
  • Changes_When_Lowercased: returns a boolean.
  • Changes_When_Titlecased: returns a boolean.
  • Changes_When_Uppercased: returns a boolean.
  • CI: an alias for Case_Ignorable.
  • Comp_Ex: an alias for Full_Composition_Exclusion.
  • CWCF: an alias for Changes_When_NFKC_Casefolded.
  • CWCM: an alias for Changes_When_Casemapped.
  • CWL: an alias for Changes_When_Lowercased.
  • CWT: an alias for Changes_When_Titlecased.
  • CWU: an alias for Changes_When_Uppercased.
  • Full_Composition_Exclusion: returns a boolean.
  • Lowercase: returns a boolean.
  • Lower: an alias for Lowercase.
  • Math: returns a boolean.
  • Na: an alias for Name.
  • Name: returns the name or label corresponding to the code argument. This corresponds to the (1-based) column number 2 of UnicodeData-txt. This is a modified property, since it returns labels when there is no name to return. If you want only names, discard returned values that start with a "<" character.
  • NFC_Quick_Check: returns either Y, N or M.
  • NFC_QC: an alias for NFC_Quick_Check.
  • NFD_Quick_Check: returns either Y or N.
  • NFD_QC: an alias for NFD_Quick_Check.
  • NFKC_Quick_Check: returns either Y, N or M.
  • NKFC_QC: an alias for NFKC_Quick_Check.
  • NFKD_Quick_Check: returns either Y or N.
  • NKFD_QC: an alias for NFKD_Quick_Check.
  • OAlpha: an alias for Other_Alphabetic.
  • OLower: an alias for Other_Lowercase.
  • OUpper: an alias for Other_Uppercase.
  • Other_Alphabetic: returns a boolean.
  • Other_Lowercase: returns a boolean.
  • Other_Uppercase: returns a boolean.
  • SD: an alias for Soft_Dotted.
  • Simple_Lowercase_Mapping: returns the lowercase version of the argument code, or code itself when the character has no explicit lowercase mapping. This corresponds to the (1-based) column number 14 of UnicodeData-txt.
  • Simple_Uppercase_Mapping: returns the uppercase version of the argument code, or code itself when the character has no explicit uppercase mapping. This corresponds to the (1-based) column number 13 of UnicodeData-txt.
  • slc: an alias for Simple_Lowercase_Mapping.
  • Soft_Dotted: returns a boolean.
  • suc: an alias for Simple_Uppercase_Mapping.
  • Uppercase: returns a boolean.
  • Upper: an alias for Uppercase.

Examples

UNICODE(AA, Property,Alphabetic)                            -- 1 ("ª", Feminine ordinal indicator)
UNICODE(301, Property, Canonical_Combining_Class)           -- 230 ("301"U, Combining grave accent)
UNICODE(C7, Property, Canonical_Decomposition_Mapping)      -- "0043 0327" ("Ç", Latin capital letter C with Cedilla)
UNICODE(B8, Property, Case_Ignorable)                       -- 1 ("B8"U, Cedilla)
UNICODE(F8, Property, Cased)                                -- 1 ("ù", Latin small letter u with grave)
UNICODE(110, Property, Changes_When_Lowercased)             -- 1 ("Đ", Latin capital letter D with stroke)
UNICODE(128, Property, Changes_When_Casefolded)             -- 1 ("Ĩ", Latin capital letter I with tilde)
UNICODE(222, Property, Changes_When_Casemapped)             -- 1 ("Ȣ", Latin capital letter Ou)
UNICODE(105, Property, Changes_When_Titlecased)             -- 1 ("ą", Latin small letter a with ogonek)
UNICODE(113, Property, Changes_When_Uppercased)             -- 1 ("ē", Latin small letter e with macron)
UNICODE(340, Property, Full_Composition_Exclusion)          -- 1 ("◌̀ ", Combining grave tone mark)
UNICODE(7A, Property, Lowercase)                            -- 1
UNICODE(7C, Property, Math)                                 -- 1
UNICODE(41, Property, Name)                                 -- "LATIN CAPITAL LETTER A"
UNICODE(D800, Property, Name)                               -- "<surrogate-D800>"
UNICODE(313, Property, NFC_Quick_Check)                     -- "M"
UNICODE(38C, Property, NFD_Quick_Check)                     -- "N"
UNICODE(CD5, Property, NFKC_Quick_Check)                    -- "M"
UNICODE(BC, Property, NFKD_Quick_Check)                     -- "N"
UNICODE(730, Property, Other_Alphabetic)                    -- 1
UNICODE(2071, Property, Other_Lowercase)                    -- 1
UNICODE(2160, Property, Other_Uppercase)                    -- 1
UNICODE(41, Property, Simple_Lowercase_Mapping)             -- "0061"
UNICODE(61, Property, Simple_Uppercase_Mapping)             -- "0041"
UNICODE(3F3, Property, Soft_Dotted)                         -- 1
UNICODE(102, Property, Uppercase)                           -- 1

UTF8

Diagram for the UTF8 BIF

Note: Although this routine is part of TUTOR, The Unicode Tools Of Rexx, it can also be used separately, as it has no dependencies on the rest of components of TUTOR.

   ╭───────╮  ┌────────┐  ╭───╮                                                                  ╭───╮
▸▸─┤ UTF8( ├──┤ string ├──┤ , ├─┬────────────┬─┬──────────────────┬─┬──────────────────────────┬─┤ ) ├─▸◂
   ╰───────╯  └────────┘  ╰───╯ │ ┌────────┐ │ │ ╭───╮ ┌────────┐ │ │ ╭───╮ ┌────────────────┐ │ ╰───╯
                                └─┤ format ├─┘ └─┤ , ├─┤ target ├─┘ └─┤ , ├─┤ error_handling ├─┘
                                  └────────┘     ╰───╯ └────────┘     ╰───╯ └────────────────┘

Tests whether string contains well-formed UTF-8 (this is the default when format has not been specified), or is a well-formed string in the format encoding. Optionally, it decodes it to a certain set of target encodings.

UTF8 works as a format encoding validator when target is omitted, and as a decoder when target is specified. It is an error to omit target and to specify a value for error_handling at the same time (that is, if target was omitted, then error_handling should be omitted too).

When UTF8 is used as validator, it returns a boolean value, indicating if the string is well-formed according to the format encoding. For example, UTF8(string) returns 1 when string contains well-formed UTF-8, and 0 if it contains ill-formed UTF-8.

Type of the returned value(s)

UTF8 always returns BYTES strings, except when it is used as a standalone routine (i.e., not in combination with Unicode.cls, the RXU Rexx Preprocessor for Unicode, etc.), in which case it returns standard ooRexx strings.

UTF8 performs a verification, at initialization time, to see whether .Bytes is a .Class, and, additionally, if .Bytes subclasses .String. If both conditions are met, UTF8 returns BYTES strings; if not, it returns standard ooRexx strings.

Valid formats

The format argument can be omitted or specified as the null string, in which case UTF-8 is assumed, or in can be one of UTF8 (or UTF-8), UTF8Z (or UTF-8Z), WTF8 (or WTF-8), CESU8 (or CESU-8), and MUTF8 (or MUTF-8).

  • The UTF-8 encoding is described in The Unicode® Standard. Version 15.0 – Core Specification, pp. 124 ss.
  • UTF-8Z is identical to UTF-8, with a single exception: "00"U, in UTF-8Z, is encoded using the overlong sequence "C080"X, while in UTF-8 it is encoded as "00"X.
  • The WTF-8 encoding is described in The WTF-8 encoding. It extends UTF-8 by allowing lone surrogate codepoints, encoded as standard three-byte sequences. Surrogate pairs are not allowed: they should be encoded using four-byte sequences.
  • The CESU-8 encoding is described in the Unicode Technical Report #26. Supplementary characters (i.e., codepoints greater than "FFFF"U) are first encoded using two surrogates, as in UTF16, and then each surrogate is encoded as in WTF-8, giving a total of six bytes. Four-byte sequences are ill-formed, and lone surrogates are admitted.
  • The MUTF-8 encoding (see the Wikipedia entry about MUTF-8) is identical to CESU-8, with a single difference: it encodes "00"U in the same way that UTF-8Z.

UTF-8 and UTF-8Z do not allow sequences containing lone surrogates. All the other formats allow lone surrogates.

Decoding with UTF8

To use UTF8 as a decoder, you have to specify a target encoding. This argument accepts a single encoding, or a blank-separated set of tokens.

Each token can have one of the following values: UTF8 (or UTF-8), WTF8 (or WTF-8), UTF32 (or UTF-32), WTF32 (or WTF-32).

The W- forms of the encodings allow lone surrogates, while the U- do not.

Duplicates, when specified, are ignored. If one of the specified encodings is a W-encoding, the rest of the encodings should also be W-encodings. If format allows lone surrogates (i.e., if it is not UTF-8 or UTF-8Z), then all the specified encodings should be W-encodings.

When several targets have been specified, a stem is returned. The stem will contain a tail for every specified encoding name (uppercased, and without dashes), and the compound variable value will be the decoded string.

Error handling

The optional error_handling argument determines the behaviour of the function when a decoding error is encountered. It is an error to specify error_handling withour specifying format at the same time.

  • When error_handling has the value "" (the default) or NULL, a null string is returned when a decoding error is encountered.
  • When error_handling has the value REPLACE, any ill-formed character will be replaced by the Unicode Replacement Character ("FFFD"U).
  • When error_handling has the value SYNTAX, a syntax condition will be raised when a decoding error is encountered.

Conditions

  • Syntax 93.900. Invalid option 'option'.
  • Syntax 93.900. Invalid format 'format'.
  • Syntax 93.900. Invalid target 'target'.
  • Syntax 93.900. Invalid error handling 'error_handling'.
  • Syntax 93.900. Conflicting target target and format format.
  • Syntax 23.900. Invalid format sequence in position n of string: 'hex-value'X.

Examples

Specifying format and target. Combination examples:

UTF8("00"X, utf8,  utf8)                           -- "00"X. Validate and return UTF-8
UTF8("00"X, utf8,  wtf8)                           -- "00"X. Validate and return WTF-8
UTF8("00"X, mutf8, utf8)                           -- Syntax error: MUTF-8 allows lone surrogates, but UTF-8 does not
UTF8("00"X, mutf8, wtf8)                           -- "". "00"X is ill-formed MUTF-8
UTF8("00"X, utf8,  utf8 utf32)                     -- A stem s.: s.utf8 == "00"X, and s.utf32 == "0000 0000"X
UTF8("00"X, utf8,  wtf8 wtf32)                     -- A stem s.: s.wtf8 == "00"X, and s.wtf32 == "0000 0000"X
UTF8("00"X, utf8,  utf8 wtf32)                     -- Syntax error: cannot specify UTF-8 and WTF-32 at the same time

Validation examples:

UTF8("")                                          -- 1  (The null string always validates)
UTF8("ascii")                                     -- 1  (Equivalent to UTF8("ascii", "UTF-8") )
UTF8("José")                                      -- 1
UTF8("FF"X)                                       -- 0  ("FF"X is ill-formed)
UTF8("00"X)                                       -- 1  (ASCII)
UTF8("00"X, "UTF-8Z")                             -- 0  (UTF-8Z encodes "00"U differently)
UTF8("C080"X)                                     -- 1
UTF8("C080"X, "UTF-8Z")                           -- 1
UTF8("C081"X, "UTF-8Z")                           -- 0  (Only "C080" is well-formed)
UTF8("ED A0 80"X)                                 -- 0  (High surrogate)
UTF8("ED A0 80"X,"WTF-8")                         -- 1  (UTF-8 allows surrogates)
UTF8("ED A0 80"X,"WTF-8")                         -- 1  (UTF-8 allows surrogates)
UTF8("F0 9F 94 94"X)                              -- 1  ( "(Bell)"U )
UTF8("F0 9F 94 94"X,"CESU-8")                     -- 0  ( CESU-8 doesn't allow four-byte sequences... )
UTF8("ED A0 BD ED B4 94"X,"CESU-8")               -- 1  ( ...it expects two three-byte surrogates instead)

Error handling:

                                                  -- "C080" is ill-formed utf8
UTF8("C080"X,,utf8)                               -- "" (By default, UTF8 returns the null string when an error is found)
UTF8("C080"X,,utf8, replace)                      -- "EFBFBD EFBFBD"X ("EFBFBD" is the Unicode Replacement character)
                                                  -- "C0"X is ill-formed, and then "80"X is ill-formed too
                                                  -- That's why we get two replacement characters
UTF8("C080"X,,utf8, syntax)                       -- Syntax error 23.900:
                                                  -- "Invalid UTF-8 sequence in position 1 of string: 'C0'X".

Conversion examples:

UTF8("José",,UTF32)                               -- "0000004A 0000006F 00000073 0000E9"X ("é" is "E9"U)
UTF8("FF"X,,UTF32)                                -- "" (an error)
UTF8("FF"X,,UTF32,REPLACE)                        -- "�" ("FFFD"X, the replacement character)
UTF8("FF"X,,UTF32,SYNTAX)                         -- Raises a Syntax error

Implementation notes

See The Unicode® Standard. Version 15.0 – Core Specification, p. 125:

Table 3-7. Well-Formed UTF-8 Byte Sequences

Code Points First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

In Table 3-7, cases where a trailing byte range is not 80..BF are shown in bold italic to draw attention to them. These exceptions to the general pattern occur only in the second byte of a sequence.

Based on this table, on the first run, UTF8 will build a Finite State Machine. States will be coded into two TRANSLATE tables, stored in the .local directory.

  • The range 00..7F is mapped to "A" (for "A"SCII).
  • The range 80..BF is mapped to "C" (for "C"ontinuation characters). Some few bytes will require manual checking.
  • The values CO, C1 and F5..FF are always illegal in a UTF-8 string. We add rows for these ranges, and we map the corresponding codes to "I" (for "I"llegal).
  • The range C2..DF is mapped to "20"X (the "2" in "20" reminds us that we will find a 2-byte group, if the string is well-formed).
  • The range E0..EF is mapped to the "3x"X values, "3a", "3b" and "3c". The "3" reminds us that we will find a 3-bytes groups, if the string is well-formed; the final "a", "b" and "c" allow us to differentiate the cases, and perform the corresponding tests.
  • Similarly, the F0..F4 range is mapped to "4a"X, "4b"X and "4c"X, as described below.

Table 3-7 (modified)

Bytes Mapping Description
00..7F "A" ASCII byte
80..BF "C" Continuation byte
C0..C1 "I" Illegal byte
C2..DF "20"X Two-bytes sequence
E0 "3a"X Three bytes, case (a)
E1..EC "3b"X Three bytes, case (b)
ED "3c"X Three bytes, case (c)
EE..EF "3b"X Three bytes, case (b)
F0 "4a"X Four bytes, case (a)
F1..F3 "4b"X Four bytes, case (b)
F4 "4c"X Four bytes, case (c)
F5..FF "I" Illegal byte

Table 3-7 (modified for UTF8Z)

UTF-8Z is identical to UTF-8, with only one exception: "00"U is encoded using the overlong encoding "C080"X, so that a well-formed UTF-8Z string cannot contain NULL characters. Thus allows the continued use of old-style string C functions, which expect strings to be terminated by a NULL character.

For UTF8Z, table 3-7 has to be modified in the following way:

Bytes Mapping Description
00 "I" Illegal byte
00..7F "A" ASCII byte
80..BF "C" Continuation byte
C0 "0" "C080"X -> "0000"U, error otherwise
C1 "I" Illegal byte
C2..DF "20"X Two-bytes sequence
E0 "3a"X Three bytes, case (a)
E1..EC "3b"X Three bytes, case (b)
ED "3c"X Three bytes, case (c)
EE..EF "3b"X Three bytes, case (b)
F0 "4a"X Four bytes, case (a)
F1..F3 "4b"X Four bytes, case (b)
F4 "4c"X Four bytes, case (c)
F5..FF "I" Illegal byte

Table 3-7 (modified for WTF-8)

See The WTF-8 encoding.

For WTF-8, table 3-7 has to be modified in the following way:

Bytes Mapping Description
00 "I" Illegal byte
01..7F "A" ASCII byte
80..BF "C" Continuation byte
C0 "0" "C080"X -> "0000"U, error otherwise
C1 "I" Illegal byte
C2..DF "20"X Two-bytes sequence
E0 "3a"X Three bytes, case (a)
E1..EC "3b"X Three bytes, case (b)
ED "3d"X Three bytes, case (d): 2nd byte in 80..9F, normal char; in A0..AF, lead surrogate; in B0..BF, trail surrogate; surrogate pair: error
EE..EF "3b"X Three bytes, case (b)
F0 "4a"X Four bytes, case (a)
F1..F3 "4b"X Four bytes, case (b)
F4 "4c"X Four bytes, case (c)
F5..FF "I" Illegal byte

Table 3-7 (modified for CESU-8)

See Unicode Technical Report #26. COMPATIBILITY ENCODING SCHEME FOR UTF-16: 8-BIT (CESU-8).

For CESU-8, table 3-7 has to be modified in the following way:

Bytes Mapping Description
00..7F "A" ASCII byte
80..BF "C" Continuation byte
C0..C1 "I" Illegal byte
C2..DF "20"X Two-bytes sequence
E0 "3a"X Three bytes, case (a)
E1..EC "3b"X Three bytes, case (b)
ED "3e"X Three bytes, case (e)
EE..EF "3b"X Three bytes, case (b)
F0..FF "I" Illegal byte

Table 3-7 (modified for MUTF-8)

MUTF-8 (Modified UTF-8) is identical to CESU-8, except for the encoding of "00"U, which is the overlong sequence "C080"X.

See the Wikipedia entry about MUTF-8.

For MUTF-8, table 3-7 has to be modified in the following way:

Bytes Mapping Description
00 "I" Illegal byte
01..7F "A" ASCII byte
80..BF "C" Continuation byte
C0 "0" "C080"X -> "0000"U, error otherwise
C1 "I" Illegal byte
C2..DF "20"X Two-bytes sequence
E0 "3a"X Three bytes, case (a)
E1..EC "3b"X Three bytes, case (b)
ED "3e"X Three bytes, case (e)
EE..EF "3b"X Three bytes, case (b)
F0..FF "I" Illegal byte