Release notes for version 0.1d, 20230719

/******************************************************************************
 * This file is part of The Unicode Tools Of Rexx (TUTOR)                     *
 * See https://rexx.epbcn.com/TUTOR/                                          *
 *     and https://github.com/JosepMariaBlasco/TUTOR                          *
 * Copyright © 2023-2025 Josep Maria Blasco <josep.maria.blasco@epbcn.com>    *
 * License: Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0)  *
 ******************************************************************************/

Today's release of the Unicode Toys contains a number of substantial enhancements.

The Rexx Preprocessor for Unicode, "rxu". You write a program with the ".rxu" extension; this program is a normal (oo)Rexx program, with some extensions (defined below). Then you enter "rxu programname arguments" on the command line, and the Rexx Preprocessor translates your ".rxu" file into a ".rex" one, and then calls this new ".rex" file with the supplied arguments.
The Rexx Tokenizer. Written in Rexx, it supports Regina, ANSI Rexx, ooRexx, and the corresponding Unicode extensions (defined below). This is a prototype.
New syntactic constructs:

U-strings, like "(Woman) (zwj) (Man) (zwj) (Woman) (zwj) (Girl) (Father Christmas)"U. They are Text strings, and are resolved at parse-time (to '👩‍👨‍👩‍👧🎅', in this case). You can include codepoints using the usual hex notation, or a Unicode name, alias or label between parenthesis. If a codepoint is invalid (i.e., > 10FFF or a surrogate), or if a string between parenthesis does not resolve to a codepoint, a syntax error is raised at parse time.

Example U-strings:

"(Father Christmas)"U = "🎅"      -- An emoji, by name
-- "(Father Christmasx)"U         -- Syntax error at parse time, no codepoint is named "Father Christmasx"
-- "(Father Christmas"U           -- Syntax error at parse time (missing right parenthesis)
"(New line)"U = "0A"X             -- An alias. See http://www.unicode.org/Public/UCD/latest/ucd/NameAliases.txt
"(LF)"U = "0A"X                   -- Another alias
--"<Control-000A>"U = "0A"X       -- 000A has no name, but it has a label
"000A"U = "0A"X                   -- A codepoint
--"110000"U                       -- Syntax error (codepoint > 10FFF)
"DB7F"                            -- Syntax error (a surrogate)
"(LATIN CAPITAL LETTER A)"U = "A" -- The official name for this codepoint
"(Latin capital letter A)"U = "A" -- Case insensitive
"(LATIN-CAPITAL-LETTER-A)"U = "A" -- The standard allows these variations
"(LATINCAPITALLETTERA)"U = "A"    -- And these too

T-strings, like "noël👩‍👨‍👩‍👧🎅"T. They are Text strings, and they are checked at parse time for UTF-8 correctness (i.e., if there is an invalid UTF-8 string, a syntax error is raised).

R-strings, like "noël👩‍👨‍👩‍👧🎅"R. They are Runes strings, and they are checked at parse time for UTF-8 correctness (i.e., if there is an invalid UTF-8 string, a syntax error is raised).

The RXU preprocessor handles the syntax checking for U-, T- and T-strings, and translates them to "normal" Rexx, assuming the use of Unicode.cls. For example, "noël👩‍👨‍👩‍👧🎅"T is translated to Text("noël👩‍👨‍👩‍👧🎅").

RXU also substitutes several BIF function calls and adds an exclamation mark at the beginning of the function name. For example, l = length(var) is translated to l = !length(var). Unicode.cls then defines the !-BIFs, which are rerouted to the corresponding BIMs (i.e., to Bytes [.String], Runes, or Text, as appropiate).

This way you can write Length("noël👩‍👨‍👩‍👧🎅"), and be sure that you will get the correct result (34 for Bytes, 12 for Runes, and 6 for Text), without having to add internal function handlers for Length.

RXU unconditionally adds a last ::Requires Unicode.cls line to the generated .rex file. The ooRexx processor doesn't complain if there are several of these, and this way we ensure that we have access to the new classes, the !-BIFs and the new BIFs (like BYTES(), TEXT() or RUNES()).

I would have liked to write some documentation, apart from the one that is included inside the source files, but I have opted to release the code first. That way I will be able to incorporate your comments/suggestions, etc., and I will not have to write the documentation twice.

I am especially interested in your comments about the syntax of the U-Strings, possible extensions, etc.

I am attaching a small sample program and its output below my signature.

Want to give it a try? Just download everything from https://github.com/RexxLA/rexx-repository/tree/master/ARB/standards/work-in-progress/unicode/UnicodeToys (including the subdirectories) and start to experiment.

Josep Maria

------------------------- Sample program "sample.rxu" -------------------------
text = "(Woman) (zwj) (Man) (zwj) (Woman) (zwj) (Girl) (Father Christmas)"U

Say "Text is: '"text"'."
Say "It is a" StringType(text) "string."
Say "Its length is" Length(text)"."
Do i = 1 To Length(text)
  Say "  "i":" text[i] "('"c2x(text[i])"'X)"
End
Say "Reversed, it's '"Reverse(text)"'."

Say

Say "Now we will convert the Text string to a Runes string."
text = Runes(text)
Say "Text is: '"text"'."
Say "It is a" StringType(text) "string."
Say "Its length is" Length(text)"."
Do i = 1 To Length(text)
  Say "  "i":" text[i] "('"c2x(text[i])"'X)"
End
Say "Reversed, it's '"Reverse(text)"'."

Say

text = "noël👩‍👨‍👩‍👧🎅"T
Say "'"text"'T is a" StringType(text) "string of length" Length(text)"."

Do i = 8 To 1 By -1
  Say "  Left('"text","i"') = '"Left(text,i)"'"
End

Say

text = "noël👩‍👨‍👩‍👧🎅"R
Say "'"text"'R is a" StringType(text) "string of length" Length(text)"."
Do i = 14 To 1 By -1
  Say "  Left('"text","Right(i,2)"') = '"Left(text,i)"'"
End

Say

text = "noël👩‍👨‍👩‍👧🎅"
Say "'"text"' is a" StringType(text) "string of length" Length(text)"."
---------------------- End of sample program "sample.rxu" ---------------------

-------------------- Output from the "rxu sample" command ---------------------
Text is: '👩‍👨‍👩‍👧🎅'.
It is a TEXT string.
Its length is 2.
  1: 👩‍👨‍👩‍👧 ('F09F91A9E2808DF09F91A8E2808DF09F91A9E2808DF09F91A7'X)
  2: 🎅 ('F09F8E85'X)
Reversed, it's '🎅👩‍👨‍👩‍👧'.

Now we will convert the Text string to a Runes string.
Text is: '👩‍👨‍👩‍👧🎅'.
It is a RUNES string.
Its length is 8.
  1: 👩 ('F09F91A9'X)
  2: ‍ ('E2808D'X)
  3: 👨 ('F09F91A8'X)
  4: ‍ ('E2808D'X)
  5: 👩 ('F09F91A9'X)
  6: ‍ ('E2808D'X)
  7: 👧 ('F09F91A7'X)
  8: 🎅 ('F09F8E85'X)
Reversed, it's '🎅👧‍👩‍👨‍👩'.

'noël👩‍👨‍👩‍👧🎅'T is a TEXT string of length 6.
  Left('noël👩‍👨‍👩‍👧🎅,8') = 'noël👩‍👨‍👩‍👧🎅  '
  Left('noël👩‍👨‍👩‍👧🎅,7') = 'noël👩‍👨‍👩‍👧🎅 '
  Left('noël👩‍👨‍👩‍👧🎅,6') = 'noël👩‍👨‍👩‍👧🎅'
  Left('noël👩‍👨‍👩‍👧🎅,5') = 'noël👩‍👨‍👩‍👧'
  Left('noël👩‍👨‍👩‍👧🎅,4') = 'noël'
  Left('noël👩‍👨‍👩‍👧🎅,3') = 'noë'
  Left('noël👩‍👨‍👩‍👧🎅,2') = 'no'
  Left('noël👩‍👨‍👩‍👧🎅,1') = 'n'

'noël👩‍👨‍👩‍👧🎅'R is a RUNES string of length 12.
  Left('noël👩‍👨‍👩‍👧🎅,14') = 'noël👩‍👨‍👩‍👧🎅  '
  Left('noël👩‍👨‍👩‍👧🎅,13') = 'noël👩‍👨‍👩‍👧🎅 '
  Left('noël👩‍👨‍👩‍👧🎅,12') = 'noël👩‍👨‍👩‍👧🎅'
  Left('noël👩‍👨‍👩‍👧🎅,11') = 'noël👩‍👨‍👩‍👧'
  Left('noël👩‍👨‍👩‍👧🎅,10') = 'noël👩‍👨‍👩‍'
  Left('noël👩‍👨‍👩‍👧🎅, 9') = 'noël👩‍👨‍👩'
  Left('noël👩‍👨‍👩‍👧🎅, 8') = 'noël👩‍👨‍'
  Left('noël👩‍👨‍👩‍👧🎅, 7') = 'noël👩‍👨'
  Left('noël👩‍👨‍👩‍👧🎅, 6') = 'noël👩‍'
  Left('noël👩‍👨‍👩‍👧🎅, 5') = 'noël👩'
  Left('noël👩‍👨‍👩‍👧🎅, 4') = 'noël'
  Left('noël👩‍👨‍👩‍👧🎅, 3') = 'noë'
  Left('noël👩‍👨‍👩‍👧🎅, 2') = 'no'
  Left('noël👩‍👨‍👩‍👧🎅, 1') = 'n'

'noël👩‍👨‍👩‍👧🎅' is a BYTES string of length 34.
----------------- End of output from the "rxu sample" command -----------------