Rexx built-in functions for Unicode: enhancements and modifications


Rexx built-in functions for Unicode: enhancements and modifications

/******************************************************************************
 * This file is part of The Unicode Tools Of Rexx (TUTOR)                     *
 * See https://rexx.epbcn.com/TUTOR/                                          *
 *     and https://github.com/JosepMariaBlasco/TUTOR                          *
 * Copyright © 2023-2025 Josep Maria Blasco <josep.maria.blasco@epbcn.com>    *
 * License: Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0)  *
 ******************************************************************************/

This article concentrates on documenting enhancements and modifications to existing built-in functions (BIFs). If you want to know about new BIFs, please refer to the accompanying document New built-in functions.

Introduction: What are the enhanced built-in functions and how are they implemented

Statement of the problem

The purpose of RXU, the Rexx Preprocessor for Unicode, is to offer a Unicode-enhanced Rexx experience that is as seamless and as simple as possible. A Unicode-enabled Rexx program ("a RXU program" for short) is a program written in a language based on standard (oo)Rexx and enhanced with a set of Unicode specific additions and modifications.

As an example of additions, RXU programs allow for four new types of literal strings. These are described in an accompanying document, New types of strings. There is also a set of new built-in functions, described in another document.

Modifications become necessary when the behaviour of already existing mechanisms of Rexx has to be altered. In our case, for instance, we will expect that RXU programs know how to manage Unicode strings, and thus bring the rich set of features of Rexx to the Unicode world. But this will mean that existing BIFs will have to operate with new entities (i.e., Unicode strings) and, of course, they will most probably have to produce new and different results when processing these new entities.

We are then confronted to the task of enhancing, and in this sense redefining, existing BIFs. But to redefine BIFs in Rexx is quite difficult.

Ways to substitute BIFs. Necessity of a preprocessor

As is well known, built-in functions (BIFs) are second in the Rexx search order

Functions are searched in the following sequence: internal routines, built-in functions, external functions (rexxref, 7.2.1, "Search Order").

As a consequence, when one wants to redefine a BIF, the only possible way is to write an internal function with the same name:

If the call or function invocation uses a literal string, then the search for internal label is bypassed. This bypass mechanism allows you to extend the capabilities of an existing internal function, for example, and call it as a built-in function or external routine under the same name as the existing internal function. To call the target built-in or external routine from inside your internal routine, you must use a literal string for the function name (Ibid.).

If, as we stated above, we want to offer an experience that is "as seamless and as simple as possible", the only way to achieve that is to implement a preprocessor. The alternative would be to define a kind of "epilog" that would contain all the redefined functions, and ask the programmers to copy it at the bottom of their programs: a maintenance nightmare, and nothing that can be called "seamless" or "simple".

Ways to substitute BIFs, part II

A preprocessor could add such an epilog to RXU programs in an automated way. But, if we counted on the idea of a (sufficiently powerful) preprocessor, we could also opt for a different strategy. Instead of writing an internal routine for each BIF that we wanted to modify or enhance, we could substitute the name of each BIF in every BIF call, and call a different function instead. Now, that different function would have a new name, an external function name. Clashes with existing BIF names would disappear, and, with them, the need to define internal routines. That's a much neater solution. Indeed, if working with ooRexx, all the external routines can be grouped in some few packages, and the task of the preprocessor will practically be reduced, beyond the substitution of names and the implementation of new string types, to the trivial addition of a ::Requires directive or a function call that enables the new external functions.

The RXU preprocessor for Unicode follows this approach. It substitutes calls to an arbitrary rexx BIF, say F, with calls to !F, i.e., an exclamation mark, "!", is prepended to the BIF name. For example the preprocessor would translate Length(var) to !Length(var).

Subtleties of substitution

The basic idea of such a substitution is very easy to explain, but, as it often happens with basic ideas, its concrete realization is nothing but trivial. You cannot simply pick every occurence of, say, "LENGTH" and blindly substitute it with "!LENGTH": that would unintendedly transform method calls, like n = var~length, for example.

Ok, you could say: let's reduce ourselves to the case where a BIF name is followed by a left parentheses. But this leaves out CALL statements, and there are methods that have arguments anyway...

The RXU Rexx Preprocessor for Unicode handles all these complexities, and many more, except one: if there is an internal routine with the same name as a BIF, it substitutes names anyway. It should not, but it's beyond its power, in the current version. This limitation will be addressed in a future release.


Alphabetic list of implemented BIFs

An alphabetic list of Unicode-enabled BIFs follows. This list will be updated when new functions will be enabled for Unicode. In most cases, only the new functionality is described,

C2X (Character to heXadecimal)

Diagram for the C2X BIF

Returns a BYTES string that represents string converted to hexadecimal.

There has been much debate about C2X. RXU follows a very simple approach to determine what should be returned: always return the C2X of the BYTES value of the argument.

So, for example, Text("(Man)"U) == "👨"T is a TEXT string. Its UTF-8 representation, i.e., it's conversion to BYTES, is the UTF-8 representation of the codepoint for the "Man" character, that is, "F0 9F 91 A8"X. And this will be, unsurprisingly, the value of C2X("👨"T):

C2X("👨"T) = "F0 9F 91 A8"X

CHARIN

Diagram for the CHARIN BIF

The CHARIN BIF is enhanced by supporting the encoding options specified in the STREAM OPEN command. * When an encoding is not specified for a stream, the standard BIF is called. * When an encoding is specified, the action taken depends on the encoding target. * When the encoding target is TEXT (the default), a TEXT string, composed of grapheme clusters, is returned. The string is normalized to NFC before being returned. * When the encoding target is GRAPHEMES, a GRAPHEMES string, composed of grapheme clusters, is returned. The string is returned as-is, without attempting any normalization. * When the encoding target is CODEPOINTS, the appropiate number of Unicode codepoints is read and is returned in a string. * The handling of ill-formed Unicode sequences depends on the value of the encoding error_handling. * When error_handling is set to REPLACE (the default), any ill-formed character will be replaced by the Unicode Replacement Character (U+FFFD). * When error_handling is set to SYNTAX, a Syntax condition will be raised.

Character positioning is precautionarily disabled in some circumstances: * When the encoding is a variable-length encoding. * When the encoding is a fixed-length encoding, but a target of TEXT or GRAPHEMES has been requested.

Character positioning at the start of the stream (that is, when start is specified as 1) will work unconditionally.

Please refer to the accompanying document Stream functions for Unicode for a comprehensive vision of the stream functions for Unicode-enabled streams.

CHAROUT

Diagram for the CHAROUT BIF

The CHAROUT BIF is enhanced by supporting the encoding options specified in the STREAM OPEN command. * When an encoding has not been specified for a stream, the standard BIF is called. * When the string type is TEXT, GRAPHEMES or CODEPOINTS, the string presentation is well-formed UTF-8 and will be used as-is. * When the string type is BYTES, it will be checked for UTF-8 well-formedness. * In both cases, the resulting string is then encoded using the encoding specified in the STREAM OPEN command. * When SYNTAX was specified as the stream error_handling option, a Syntax error is raised in case an encoding error is found, or if the argument string contains ill-formed UTF-8. * When REPLACE was specified as the stream error_handling option, ill-formed characters will be replaced by the Unicode Replacement Character (U+FFFD).

Character positioning is precautionarily disabled in some circumstances: * When the encoding is a variable-length encoding. * When the encoding is a fixed-length encoding, but a target of TEXT or GRAPHEMES has been requested.

Character positioning at the start of the stream (that is, when start is specified as 1) will work unconditionally.

Please refer to the accompanying document Stream functions for Unicode for a comprehensive vision of the stream functions for Unicode-enabled streams.

CHARS

Diagram for the CHARS BIF

The CHARS BIF is modified to support the encoding options specified in the STREAM OPEN command.

  • When an encoding has not been specified for stream name, the standard BIF is called.
  • When an encoding has been specified for stream name, the behaviour of CHARS depends on the stream encoding options.
    • When the encoding is variable-length or the target type is TEXT or GRAPHEMES, the CHARS function returns 1 to indicate that data is present in the stream, or 0 if no data is present.
    • When the encoding is fixed length and the target type is CODEPOINTS, the standard BIF is called to obtain the number of remaining bytes. If this number is an exact multiple of the encoding length, the result of dividing the number of bytes left by the number of bytes per character of the encoding is returned.
    • In all other cases, 1 is returned.

Please refer to the accompanying document Stream functions for Unicode for a comprehensive vision of the stream functions for Unicode-enabled streams.

CENTER (or CENTRE)

Diagram for the CENTER BIF

Works as the standard BIF does, but it operates on bytes, codepoints or extended grapheme clusters depending of whether string is a BYTES string, a CODEPOINTS string, or a GRAPHEMES or a TEXT string, respectively. Before ensuring that the pad character is one character in length, pad is first converted, if necessary, to the type of string. If this conversion fails, a Syntax error is raised.

Examples.

....+....1....+....2....+....3....+....4....+....5
Center("Man"Y,5)                                  -- " Man "
Center("Man"Y,5,"+")                              -- "+Man+"
Center("Man"Y,5,"👨")                             -- Syntax error ('CENTER argument 3 must be a single character; found "👨"')
Center("Man"P,5,"👨")                             -- "👨Man👨"
Center("Man"P,5,"(Man)(Zwj)(Man)"U)               -- Syntax error ('CENTER argument 3 must be a single character; found "👨‍👨"')
Center("Man"T,5,"(Man)(Zwj)(Man)"U)               -- "👨‍👨Man👨‍👨"
Center("Man"T,5,"FF"X)                            -- Syntax error ("Invalid UTF-8 sequence in position 1 of string: 'FF'X")

COPIES

Diagram for the COPIES BIF

Works as the standard BIF does, but it operates on bytes, codepoints or extended grapheme clusters depending of whether string is a BYTES string, a CODEPOINTS string, or a GRAPHEMES or TEXT string, respectively.

DATATYPE

Diagram for the DATATYPE BIF

A new type is admitted, C, for uniCode. Datatype(string, "C") returns 1 if and only if string follows the Unicode string format, namely, if it consists of a blank-separated series of:

  • Valid hexadecimal Unicode codepoints, like 61, or 200D, or 1F514.
  • Valid hexadecimal Unicode codepoints prefixed with U+ or u+, like u+61, or U+200D, or u+1F514.
  • Names, alias or labels that designate a Unicode codepoint, enclosed between parentheses, like (Latin small letter A), (ZWJ), (Bell), or (<Control-001d>). Items enclosed between parentheses do not need to be separated by blanks.

Examples.

DATATYPE('string','C')                            -- 0
DATATYPE('61','C')                                -- 1
DATATYPE('U61','C')                               -- 0 (it's U+ or U+, not U)
DATATYPE('U+61','C')                              -- 1
DATATYPE('10661','C')                             -- 1
DATATYPE('110000','C')                            -- 0 (max Unicode scalar is 10FFFF)
DATATYPE('(Man)','C')                             -- 1
DATATYPE('(Man','C')                              -- 0 (missing parentheses)
DATATYPE('(Man)(Zwj)(Woman)','C')                 -- 1

LEFT

Diagram for the LEFT BIF

Works as the standard BIF does, but it operates on bytes, codepoints or extended grapheme clusters depending of whether string is a BYTES string, a CODEPOINTS string, or a GRAPHEMES or a TEXT string, respectively. Before ensuring that the pad character is one character in length, pad is first converted, if necessary, to the type of string. If this conversion fails, a Syntax error is raised.

LENGTH

Diagram for the LENGTH BIF

When string is a BYTES string, it returns the number of bytes in string. When string is a CODEPOINTS string, it returns the number of codepoints in string. When string is a GRAPHEMES or a TEXT string, it returns the number of extended grapheme clusters in string.

Examples.

Length("a")                                       -- 1
Length("á")                                       -- "á" is "C3 A1"X
Length("á"P)                                      -- "á" is 1 codepoint
Length("👨‍👩")                                      -- 11 bytes, that was "F09F91A8E2808DF09F91A9"X
Length("👨‍👩"P)                                     -- 3 codepoints (Man + Zwj + Woman)
Length("👨‍👩"T)                                     -- 1 grapheme cluster

LINEIN

Diagram for the LINEIN BIF

The LINEIN BIF is enhanced by supporting the encoding options specified in the STREAM OPEN command.

  • When an encoding has not been specified for stream name, the standard BIF is called.
  • When an encoding has been specified, a line is read, taking into account the end-of-line conventions defined by the encoding. The line is then decoded to UTF8, and returned as a TEXT string (the default), as a GRAPHEMES string, when GRAPHEMES has been specified as an encoding option of the STREAM OPEN command, or as a CODEPOINTS string, when CODEPOINTS has been specified as an encoding option of the STREAM OPEN command.
  • If an error is found in the decoding process, the behaviour of the LINEIN BIF is determined by the error_handling method specified as an encoding option of the STREAM OPEN command.
    • When SYNTAX has been specified, a Syntax error is raised.
    • When REPLACE has been specified, any character that cannot be decoded will be replaced with the Unicode Replacement character (U+FFFD).

Line-end handling

Preliminary note. Rexx honors Windows line-end sequences ("0D0A"X) and Unix-like line-end characters ("0A"X), and it does so both in Windows and in Unix-like systems. You can try it for yourself by creating a file that contains "31610d0a32610d33610a34610a0d3563"X and reading it line-by line both on Windows and on Linux.

What happens when we are using a multi-byte encoding like UTF-16 or UTF-32? On the one hand, we will be getting false positives: "000A"X is a line end, but "0Ahh"X is not, irrespective of the value of hh. On the other hand, we will be getting lost sequences: a "000D"X that immediately preceeds a "000A"X should be removed by Rexx, but the current versions do not remove it.

All these details have to be taken into account by this routine.

Implementation restriction. Line positioning when line > 1 is not implemented when:

  • The end-of-line character is not "0A"X.
  • The encoding number of bytes per char is greater than 1.
  • The encoding is not fixed-length.

Some or all of these restrictions may be eliminated in a future release.

Please refer to the accompanying document Stream functions for Unicode for a comprehensive vision of the stream functions for Unicode-enabled streams.

LINEOUT

Diagram for the LINEOUT BIF

The LINEOUT BIF is enhanced by supporting the encoding options specified in the STREAM OPEN command. * When an encoding has not been specified for stream name, the standard BIF is called. * When an encoding has been specified for stream name, the string is decoded to that encoding; additionally, the encoding end-of-line sequence is used.

Implementation restriction. When line > 1, line positioning is not implemented in the following cases: * When the encoding is a variable-length encoding. * When the length of the encoding end-of-line character is greater than 1. * When the end-of-line character is not "0A".

Some or all of these restrictions may be eliminated in a future release.

Please refer to the accompanying document Stream functions for Unicode for a comprehensive vision of the stream functions for Unicode-enabled streams.

LINES

Diagram for the LINES BIF

The LINES BIF is modified to support the encoding options specified in the STREAM OPEN command.

Implementation restriction. LINES(name,"Count") will fail with a Syntax error when: * The encoding is not fixed-length. * The length of the encoding is greater than 1. * The encoding end-of-line character is different from "0A"X.

Some or all of these restrictions may be eliminated in a future release.

Please refer to the accompanying document Stream functions for Unicode for a comprehensive vision of the stream functions for Unicode-enabled streams.

LOWER

Diagram for the LOWER BIF

Works as the standard BIF does, but it operates on bytes, codepoints or extended grapheme clusters depending of whether string is a BYTES string, a CODEPOINTS string, or a GRAPHEMES or a TEXT string, respectively. When operating on CODEPOINTS, GRAPHEMES or TEXT strings, it implements the toLowercase(X) definition, as defined in rule R2 of section "Default Case Conversion" of The Unicode Standard, Version 15.0 – Core Specification:

Map each character C in X to Lowercase_Mapping(C).

Broadly speaking, Lowercase_Mapping(C) implements the Simple_Lowercase_Mapping property, as defined in the UnicodeData.txt file of the Unicode Character Database (UCD). Two exceptions to this mapping are defined in the SpecialCasing.txt file of the UCD. One exception not one to one: "0130"U, LATIN CAPITAL LETTER I WITH DOT ABOVE, which lowercases to "0069 0307"U. The second exception is for "03A3"U, the final greek sigma, which lowercases to "03C2"U only in certain contexts (i.e., when it is not in a medial position).

Examples.

Lower("THIS")                                     -- "this"
Lower("MAMÁ"Y)                                    -- "mamÁ", since "MAMÁ"Y is a Classic Rexx string
Lower("MAMÁ"P)                                    -- "mamá"
Lower('ÁÉÍÓÚÝÀÈÌÒÙÄËÏÖÜÂÊÎÔÛÑÃÕÇ'T)               -- 'áéíóúýàèìòùäëïöüâêîôûñãõç'
Lower('ὈΔΥΣΣΕΎΣ'T)                                -- 'ὀδυσσεύς' (note the difference between medial and final sigmas)
Lower('Aİ')                                       -- 'ai̇' ("6169CC87"X)
Length(Lower('Aİ'))                               -- 3

POS

Diagram for the POS BIF

Works as the standard BIF does, but it operates on bytes, codepoints or extended grapheme clusters depending of whether haystack is a BYTES string, a CODEPOINTS string, or a GRAPHEMES or a TEXT string, respectively. If necessary, needle is converted to the type of haystack. If this conversion fails, a Syntax error is raised.

Examples:

Pos('s','string')                                 -- 1
needle = '👩'                                    -- A BYTES string
haystack = '(Woman)(Zwj)(Man)'U                   -- Another BYTES string
Pos(needle,haystack)                              -- 8
needle   = CODEPOINTS(needle)                     -- 1 codepoint
haystack = CODEPOINTS(haystack)                   -- 3 codepoints
Pos(needle,haystack)                              -- 3
needle   = TEXT(needle)                           -- 1 grapheme cluster
haystack = TEXT(haystack)                         -- 1 grapheme cluster
Pos(needle,haystack)                              -- 0 (not found)
Pos('FF'X,haystack)                               -- Syntax error ("FF"X is ill-formed)

REVERSE

Diagram for the REVERSE BIF

Works as the standard BIF does, but it operates on bytes, codepoints or extended grapheme clusters depending of whether string is a BYTES string, a CODEPOINTS string, or a GRAPHEMES or a TEXT string, respectively.

Examples:

string = '(Woman)(Zwj)(Man)'U
Say string                                        -- ‍‍👩‍👨
Say string~c2x                                    -- F09F91A9E2808DF09F91A8
Say REVERSE(string)~c2x                           -- A8919FF08D80E2A9919FF0
string = CODEPOINTS(string)
Say REVERSE(string)                               -- 👨‍👩, i.e., '(Man)(Zwj)(Woan)'U
string = TEXT(string)
Say string == REVERSE(string)                     -- 1, since LENGTH(string) == 1
Diagram for the RIGHT BIF

Works as the standard BIF does, but it operates on bytes, codepoints or extended grapheme clusters depending of whether string is a BYTES string, a CODEPOINTS string, or a GRAPHEMES or a TEXT string, respectively. Before ensuring that the pad character is one character in length, pad is first converted, if necessary, to the type of string. If this conversion fails, a Syntax error is raised.

STREAM

The STREAM BIF is enhanced by adding encoding options to the OPEN and QUERY commands. In this version, ENCODING should be the last option specified, and it can not be used with BINARY streams.

New options for the OPEN command

A new ENCODING fragment is added to the STREAM OPEN COMMAND:

Diagram for the STREAM COMMAND OPEN BIF

The format of the ENCODING fragment is the following:

Diagram for the ENCODING options of the STREAM COMMAND OPEN BIF

The encoding options are as follows:

  • ENCODING encoding specifies that the file is encoded (for reading) or is to be encoded (for writing) using the encoding encoding.
  • ENCODING encoding can be followed by any of SYNTAX, REPLACE, TEXT, GRAPHEMES or CODEPOINTS, in any order.
  • Only one of TEXT, GRAPHEMES or CODEPOINTS can be specified; TEXT is the default. This option determines the type of the strings (STRINGTYPE) that will be returned by the CHARIN and LINEIN BIFs.
  • Only one of SYNTAX or REPLACE can be specified; REPLACE is the default. When REPLACE is specified, ill-formed byte sequences are replaced by the Unicode Replacement Character (U+FFFD); when SYNTAX is specified, any ill-formed byte sequence raises a Syntax condition.

New QUERY ENCODING commands

Diagram for the QUERY ENCODING STREAM COMMAND BIF
  • QUERY ENCODING returns a string consisting of three words, or a null string if no encoding was specified. If the returned string is not empty, it will contain the official encoding name, the encoding target (that is, TEXT, GRAPHEMES or CODEPOINTS), and the encoding error_handling (that is, SYNTAX or REPLACE).
  • QUERY ENCODING NAME returns the stream encoding official name, or a null string if no encoding was specified.
  • QUERY ENCODING TARGET returns TEXT, GRAPHEMES or CODEPOINTS, or a null string if no encoding was specified.
  • QUERY ENCODING ERROR returns SYNTAX or REPLACE, or a null string if no encoding was specified.
  • QUERY ENCODING LASTERROR returns the value of the characters that could not be encoded or decoded by the last stream operation. QUERY ENCODING LASTERROR will return a null string if no encoding or decoding errors have been produced in the stream name, or when the last operation was successful; if there was an error in the last stream operation, the offending line will be returned.

Modifications and restrictions to the SEEK and POSITION STREAM commands

Implementation restrictions. SEEK and POSITION will raise a Syntax error in the following cases:

For character positioning, * When the encoding is variable-length. * When TEXT has been selected as the encoding target type.

Positioning the stream at the start of the stream with an offset of "=1" will unconditionally succeed.

For line positioning, all the restrictions listed for character positioning apply, and, additionally: * When the encoding specifies a line-end different from "0A"X.

Some or all of these restrictions may be eliminated in a future release.

Note. The source code for the enhanced stream operations can be found in the file Stream.cls.

Please refer to the stream.rxu program in the samples subdirectory for examples.

Please refer to the accompanying document Stream functions for Unicode for a comprehensive vision of the stream functions for Unicode-enabled streams.

SUBSTR

   ╭─────────╮  ┌────────┐  ╭───╮  ┌───┐  ╭───╮                                    ╭───╮
▸▸─┤ SUBSTR( ├──┤ string ├──┤ , ├──┤ n ├──┤ , ├─┬────────────┬──┬────────────────┬─┤ ) ├─▸◂
   ╰─────────╯  └────────┘  ╰───╯  └───┘  ╰───╯ │ ┌────────┐ │  │ ╭───╮  ┌─────┐ │ ╰───╯
                                                └─┤ length ├─┘  └─┤ , ├──┤ pad ├─┘
                                                  └────────┘      ╰───╯  └─────┘

Works as the standard BIF does, but it operates on bytes, codepoints or extended grapheme clusters depending of whether string is a BYTES string, a CODEPOINTS string, or a GRAPHEMES or a TEXT string, respectively. Before ensuring that the pad character is one character in length, pad is first converted, if necessary, to the type of string. If this conversion fails, a Syntax error is raised.

UPPER

Diagram for the UPPER BIF

Works as the standard BIF does, but it operates on bytes, codepoints or extended grapheme clusters depending of whether string is a BYTES string, a CODEPOINTS string, or a GRAPHEMES or a TEXT string, respectively. When operating on CODEPOINTS, GRAPHEMES or TEXT strings, it implements the toUppercase(X) definition, as defined in rule R1 of section "Default Case Conversion" of The Unicode Standard, Version 15.0 – Core Specification:

Map each character C in X to Uppercase_Mapping(C).

Broadly speaking, Uppercase_Mapping(C) implements the Simple_Uppercase_Mapping property, as defined in the UnicodeData.txt file of the Unicode Character Database (UCD), but a number of exceptions, defined in the SpecialCasing.txt file of the UCD have to be applied. Additionally, the Iota-subscript, "0345"X, receives a special treatment.

Examples.

Upper("this")                                     -- "THIS"
Upper("mamá"Y)                                    -- "MAMá", since "mamá"Y is a Classic Rexx string
Upper("mamá"P)                                    -- "MAMÁ"
Upper('áéíóúýàèìòùäëïöïÿâêîôûñãõç')               -- 'ÁÉÍÓÚÝÀÈÌÒÙÄËÏÖÏŸÂÊÎÔÛÑÃÕÇ'
Upper('ᾴ')                                        -- 'ΆΙ' ("03B1 0345 0301"U --> "0391 0301 0399"U)
Upper('Straße')                                   -- 'STRASSE' (See the uppercasing of the german es-zed)