Release notes for version 0.1d, 20230719
/******************************************************************************
* This file is part of The Unicode Tools Of Rexx (TUTOR) *
* See https://rexx.epbcn.com/TUTOR/ *
* and https://github.com/JosepMariaBlasco/TUTOR *
* Copyright ยฉ 2023-2025 Josep Maria Blasco <josep.maria.blasco@epbcn.com> *
* License: Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0) *
******************************************************************************/
Today's release of the Unicode Toys contains a number of substantial enhancements.
- The Rexx Preprocessor for Unicode, "rxu". You write a program with the ".rxu" extension; this program is a normal (oo)Rexx program, with some extensions (defined below). Then you enter "rxu programname arguments" on the command line, and the Rexx Preprocessor translates your ".rxu" file into a ".rex" one, and then calls this new ".rex" file with the supplied arguments.
- The Rexx Tokenizer. Written in Rexx, it supports Regina, ANSI Rexx, ooRexx, and the corresponding Unicode extensions (defined below). This is a prototype.
- New syntactic constructs:
U-strings, like "(Woman) (zwj) (Man) (zwj) (Woman) (zwj) (Girl) (Father Christmas)"U. They are Text strings, and are resolved at parse-time (to '๐ฉโ๐จโ๐ฉโ๐ง๐ ', in this case). You can include codepoints using the usual hex notation, or a Unicode name, alias or label between parenthesis. If a codepoint is invalid (i.e., > 10FFF or a surrogate), or if a string between parenthesis does not resolve to a codepoint, a syntax error is raised at parse time.
Example U-strings:
"(Father Christmas)"U = "๐
" -- An emoji, by name
-- "(Father Christmasx)"U -- Syntax error at parse time, no codepoint is named "Father Christmasx"
-- "(Father Christmas"U -- Syntax error at parse time (missing right parenthesis)
"(New line)"U = "0A"X -- An alias. See http://www.unicode.org/Public/UCD/latest/ucd/NameAliases.txt
"(LF)"U = "0A"X -- Another alias
--"<Control-000A>"U = "0A"X -- 000A has no name, but it has a label
"000A"U = "0A"X -- A codepoint
--"110000"U -- Syntax error (codepoint > 10FFF)
"DB7F" -- Syntax error (a surrogate)
"(LATIN CAPITAL LETTER A)"U = "A" -- The official name for this codepoint
"(Latin capital letter A)"U = "A" -- Case insensitive
"(LATIN-CAPITAL-LETTER-A)"U = "A" -- The standard allows these variations
"(LATINCAPITALLETTERA)"U = "A" -- And these too
T-strings, like "noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
"T
. They are
Text strings, and they are checked at parse time for
UTF-8 correctness (i.e., if there is an invalid UTF-8 string, a syntax
error is raised).
R-strings, like "noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
"R
. They are
Runes strings, and they are checked at parse time for
UTF-8 correctness (i.e., if there is an invalid UTF-8 string, a syntax
error is raised).
The RXU preprocessor handles the syntax checking for U-, T- and
T-strings, and translates them to "normal" Rexx, assuming the use of
Unicode.cls. For example, "noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
"T
is translated to
Text("noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
")
.
RXU also substitutes several BIF function calls and adds an
exclamation mark at the beginning of the function name. For example,
l = length(var)
is translated to
l = !length(var)
. Unicode.cls
then defines the
!-BIFs, which are rerouted to the corresponding BIMs (i.e., to
Bytes
[.String
], Runes
, or
Text
, as appropiate).
This way you can write Length("noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
")
, and be sure
that you will get the correct result (34 for Bytes, 12 for Runes, and 6
for Text), without having to add internal function handlers for
Length.
RXU unconditionally adds a last ::Requires Unicode.cls
line to the generated .rex file
. The ooRexx processor
doesn't complain if there are several of these, and this way we ensure
that we have access to the new classes, the !-BIFs and the new BIFs
(like BYTES()
, TEXT()
or
RUNES()
).
I would have liked to write some documentation, apart from the one that is included inside the source files, but I have opted to release the code first. That way I will be able to incorporate your comments/suggestions, etc., and I will not have to write the documentation twice.
I am especially interested in your comments about the syntax of the U-Strings, possible extensions, etc.
I am attaching a small sample program and its output below my signature.
Want to give it a try? Just download everything from https://github.com/RexxLA/rexx-repository/tree/master/ARB/standards/work-in-progress/unicode/UnicodeToys (including the subdirectories) and start to experiment.
Josep Maria
------------------------- Sample program "sample.rxu" -------------------------
text = "(Woman) (zwj) (Man) (zwj) (Woman) (zwj) (Girl) (Father Christmas)"U
Say "Text is: '"text"'."
Say "It is a" StringType(text) "string."
Say "Its length is" Length(text)"."
Do i = 1 To Length(text)
Say " "i":" text[i] "('"c2x(text[i])"'X)"
End
Say "Reversed, it's '"Reverse(text)"'."
Say
Say "Now we will convert the Text string to a Runes string."
text = Runes(text)
Say "Text is: '"text"'."
Say "It is a" StringType(text) "string."
Say "Its length is" Length(text)"."
Do i = 1 To Length(text)
Say " "i":" text[i] "('"c2x(text[i])"'X)"
End
Say "Reversed, it's '"Reverse(text)"'."
Say
text = "noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
"T
Say "'"text"'T is a" StringType(text) "string of length" Length(text)"."
Do i = 8 To 1 By -1
Say " Left('"text","i"') = '"Left(text,i)"'"
End
Say
text = "noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
"R
Say "'"text"'R is a" StringType(text) "string of length" Length(text)"."
Do i = 14 To 1 By -1
Say " Left('"text","Right(i,2)"') = '"Left(text,i)"'"
End
Say
text = "noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
"
Say "'"text"' is a" StringType(text) "string of length" Length(text)"."
---------------------- End of sample program "sample.rxu" ---------------------
-------------------- Output from the "rxu sample" command ---------------------
Text is: '๐ฉโ๐จโ๐ฉโ๐ง๐
'.
It is a TEXT string.
Its length is 2.
1: ๐ฉโ๐จโ๐ฉโ๐ง ('F09F91A9E2808DF09F91A8E2808DF09F91A9E2808DF09F91A7'X)
2: ๐
('F09F8E85'X)
Reversed, it's '๐
๐ฉโ๐จโ๐ฉโ๐ง'.
Now we will convert the Text string to a Runes string.
Text is: '๐ฉโ๐จโ๐ฉโ๐ง๐
'.
It is a RUNES string.
Its length is 8.
1: ๐ฉ ('F09F91A9'X)
2: โ ('E2808D'X)
3: ๐จ ('F09F91A8'X)
4: โ ('E2808D'X)
5: ๐ฉ ('F09F91A9'X)
6: โ ('E2808D'X)
7: ๐ง ('F09F91A7'X)
8: ๐
('F09F8E85'X)
Reversed, it's '๐
๐งโ๐ฉโ๐จโ๐ฉ'.
'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'T is a TEXT string of length 6.
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,8') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,7') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,6') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,5') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,4') = 'noรซl'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,3') = 'noรซ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,2') = 'no'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,1') = 'n'
'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'R is a RUNES string of length 12.
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,14') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,13') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,12') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,11') = 'noรซl๐ฉโ๐จโ๐ฉโ๐ง'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
,10') = 'noรซl๐ฉโ๐จโ๐ฉโ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 9') = 'noรซl๐ฉโ๐จโ๐ฉ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 8') = 'noรซl๐ฉโ๐จโ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 7') = 'noรซl๐ฉโ๐จ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 6') = 'noรซl๐ฉโ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 5') = 'noรซl๐ฉ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 4') = 'noรซl'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 3') = 'noรซ'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 2') = 'no'
Left('noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
, 1') = 'n'
'noรซl๐ฉโ๐จโ๐ฉโ๐ง๐
' is a BYTES string of length 34.
----------------- End of output from the "rxu sample" command -----------------