Release notes for version 0.1d, 20230719
/******************************************************************************
* This file is part of The Unicode Tools Of Rexx (TUTOR) *
* See https://rexx.epbcn.com/TUTOR/ *
* and https://github.com/JosepMariaBlasco/TUTOR *
* Copyright ยฉ 2023-2025 Josep Maria Blasco <josep.maria.blasco@epbcn.com> *
* License: Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0) *
******************************************************************************/
Today's release of the Unicode Toys contains a number of substantial enhancements.
- The Rexx Preprocessor for Unicode, "rxu". You write a program with the ".rxu" extension; this program is a normal (oo)Rexx program, with some extensions (defined below). Then you enter "rxu programname arguments" on the command line, and the Rexx Preprocessor translates your ".rxu" file into a ".rex" one, and then calls this new ".rex" file with the supplied arguments.
- The Rexx Tokenizer. Written in Rexx, it supports Regina, ANSI Rexx, ooRexx, and the corresponding Unicode extensions (defined below). This is a prototype.
- New syntactic constructs:
U-strings, like "(Woman) (zwj) (Man) (zwj) (Woman) (zwj) (Girl) (Father Christmas)"U. They are Text strings, and are resolved at parse-time (to '๐ฉโ๐จโ๐ฉโ๐ง๐ ', in this case). You can include codepoints using the usual hex notation, or a Unicode name, alias or label between parenthesis. If a codepoint is invalid (i.e., > 10FFF or a surrogate), or if a string between parenthesis does not resolve to a codepoint, a syntax error is raised at parse time.
Example U-strings:
"(Father Christmas)"U = "๐
" -- An emoji, by name
-- "(Father Christmasx)"U -- Syntax error at parse time, no codepoint is named "Father Christmasx"
-- "(Father Christmas"U -- Syntax error at parse time (missing right parenthesis)
"(New line)"U = "0A"X -- An alias. See http://www.unicode.org/Public/UCD/latest/ucd/NameAliases.txt
"(LF)"U = "0A"X -- Another alias
--"<Control-000A>"U = "0A"X -- 000A has no name, but it has a label
"000A"U = "0A"X -- A codepoint
--"110000"U -- Syntax error (codepoint > 10FFF)
"DB7F" -- Syntax error (a surrogate)
"(LATIN CAPITAL LETTER A)"U = "A" -- The official name for this codepoint
"(Latin capital letter A)"U = "A" -- Case insensitive
"(LATIN-CAPITAL-LETTER-A)"U = "A" -- The standard allows these variations
"(LATINCAPITALLETTERA)"U = "A" -- And these too
T-strings, like "noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
"T. They are
Text strings, and they are checked at parse time for
UTF-8 correctness (i.e., if there is an invalid UTF-8 string, a syntax
error is raised).
R-strings, like "noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
"R. They are
Runes strings, and they are checked at parse time for
UTF-8 correctness (i.e., if there is an invalid UTF-8 string, a syntax
error is raised).
The RXU preprocessor handles the syntax checking for U-, T- and
T-strings, and translates them to "normal" Rexx, assuming the use of
Unicode.cls. For example, "noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
"T is translated to
Text("noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
").
RXU also substitutes several BIF function calls and adds an
exclamation mark at the beginning of the function name. For example,
l = length(var) is translated to
l = !length(var). Unicode.cls then defines the
!-BIFs, which are rerouted to the corresponding BIMs (i.e., to
Bytes [.String], Runes, or
Text, as appropiate).
This way you can write Length("noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
"), and be sure
that you will get the correct result (34 for Bytes, 12 for Runes, and 6
for Text), without having to add internal function handlers for
Length.
RXU unconditionally adds a last ::Requires Unicode.cls
line to the generated .rex file. The ooRexx processor
doesn't complain if there are several of these, and this way we ensure
that we have access to the new classes, the !-BIFs and the new BIFs
(like BYTES(), TEXT() or
RUNES()).
I would have liked to write some documentation, apart from the one that is included inside the source files, but I have opted to release the code first. That way I will be able to incorporate your comments/suggestions, etc., and I will not have to write the documentation twice.
I am especially interested in your comments about the syntax of the U-Strings, possible extensions, etc.
I am attaching a small sample program and its output below my signature.
Want to give it a try? Just download everything from https://github.com/RexxLA/rexx-repository/tree/master/ARB/standards/work-in-progress/unicode/UnicodeToys (including the subdirectories) and start to experiment.
Josep Maria
------------------------- Sample program "sample.rxu" -------------------------
text = "(Woman) (zwj) (Man) (zwj) (Woman) (zwj) (Girl) (Father Christmas)"U
Say "Text is: '"text"'."
Say "It is a" StringType(text) "string."
Say "Its length is" Length(text)"."
Do i = 1 To Length(text)
Say " "i":" text[i] "('"c2x(text[i])"'X)"
End
Say "Reversed, it's '"Reverse(text)"'."
Say
Say "Now we will convert the Text string to a Runes string."
text = Runes(text)
Say "Text is: '"text"'."
Say "It is a" StringType(text) "string."
Say "Its length is" Length(text)"."
Do i = 1 To Length(text)
Say " "i":" text[i] "('"c2x(text[i])"'X)"
End
Say "Reversed, it's '"Reverse(text)"'."
Say
text = "noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
"T
Say "'"text"'T is a" StringType(text) "string of length" Length(text)"."
Do i = 8 To 1 By -1
Say " Left('"text","i"') = '"Left(text,i)"'"
End
Say
text = "noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
"R
Say "'"text"'R is a" StringType(text) "string of length" Length(text)"."
Do i = 14 To 1 By -1
Say " Left('"text","Right(i,2)"') = '"Left(text,i)"'"
End
Say
text = "noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
"
Say "'"text"' is a" StringType(text) "string of length" Length(text)"."
---------------------- End of sample program "sample.rxu" ---------------------
-------------------- Output from the "rxu sample" command ---------------------
Text is: '๐ฉโ๐จโ๐ฉโ๐ง๐
'.
It is a TEXT string.
Its length is 2.
1: ๐ฉโ๐จโ๐ฉโ๐ง ('F09F91A9E2808DF09F91A8E2808DF09F91A9E2808DF09F91A7'X)
2: ๐
('F09F8E85'X)
Reversed, it's '๐
๐ฉโ๐จโ๐ฉโ๐ง'.
Now we will convert the Text string to a Runes string.
Text is: '๐ฉโ๐จโ๐ฉโ๐ง๐
'.
It is a RUNES string.
Its length is 8.
1: ๐ฉ ('F09F91A9'X)
2: โ ('E2808D'X)
3: ๐จ ('F09F91A8'X)
4: โ ('E2808D'X)
5: ๐ฉ ('F09F91A9'X)
6: โ ('E2808D'X)
7: ๐ง ('F09F91A7'X)
8: ๐
('F09F8E85'X)
Reversed, it's '๐
๐งโ๐ฉโ๐จโ๐ฉ'.
'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'T is a TEXT string of length 6.
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,8') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,7') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,6') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,5') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,4') = 'noรซl'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,3') = 'noรซ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,2') = 'no'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,1') = 'n'
'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'R is a RUNES string of length 12.
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,14') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,13') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,12') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,11') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,10') = 'noรซl๐ฉโ๐จโ๐ฉโ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 9') = 'noรซl๐ฉโ๐จโ๐ฉ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 8') = 'noรซl๐ฉโ๐จโ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 7') = 'noรซl๐ฉโ๐จ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 6') = 'noรซl๐ฉโ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 5') = 'noรซl๐ฉ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 4') = 'noรซl'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 3') = 'noรซ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 2') = 'no'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 1') = 'n'
'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
' is a BYTES string of length 34.
----------------- End of output from the "rxu sample" command -----------------
