A toy ooRexx implementation of the General_Category Unicode property (20230711)


A toy ooRexx implementation of the General_Category Unicode property (20230711)

/******************************************************************************
 * This file is part of The Unicode Tools Of Rexx (TUTOR)                     *
 * See https://rexx.epbcn.com/TUTOR/                                          *
 *     and https://github.com/JosepMariaBlasco/TUTOR                          *
 * Copyright © 2023-2025 Josep Maria Blasco <josep.maria.blasco@epbcn.com>    *
 * License: Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0)  *
 ******************************************************************************/

I have written a toy, pure ooRexx, implementation of the General_Category Unicode property.

General_Category (abbr: gc) can be found as the third column of UnicodeData.txt (see https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt). It maps codepoints to an enumeration (see https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf, section 4.5, "General Category", on p. 172 for details).

Here is the list of possible values for gc:

  • Lu = Letter, uppercase
  • Ll = Letter, lowercase
  • Lt = Letter, titlecase
  • m = Letter, modifier
  • Lo = Letter, other
  • Mn = Mark, nonspacing
  • Mc = Mark, spacing combining
  • Me = Mark, enclosing
  • Nd = Number, decimal digit
  • Nl = Number, letter
  • No = Number, other
  • Pc = Punctuation, connector
  • Pd = Punctuation, dash
  • Ps = Punctuation, open
  • Pe = Punctuation, close
  • Pi = Punctuation, initial quote (may behave like Ps or Pe depending on usage)
  • Pf = Punctuation, final quote (may behave like Ps or Pe depending on usage)
  • Po = Punctuation, other
  • Sm = Symbol, math
  • Sc = Symbol, currency
  • Sk = Symbol, modifier
  • So = Symbol, other
  • Zs = Separator, space
  • Zl = Separator, line
  • Zp = Separator, paragraph
  • Cc = Other, control
  • Cf = Other, format
  • Cs = Other, surrogate
  • Co = Other, private use
  • Cn = Other, not assigned (including noncharacters)

Unicode implementations make ample use of this (and of course also of many other) properties. For example, the Go language defines a boolean function called "isLetter" that returns true when gc is L* (that is, Lu, Ll, Lt, Lm or Lo).

The class file needs to scan the included file UnicodeData.15.0.0.txt and builds a two-stage table, which is then stored in a binary file and reused on subsequent runs.

The main public routine is called, unsurprisingly, "GC". As an added bonus, I've added an "Algorithmic_name_start" routine that returns the start of a codepoint name when that name is algorithmically computable (in other cases, it returns the null string). See the source comments for details.

You will also find a self-test. On my desktop machine, a quite aged i7-9700 @ 3MHz, it checks about 0.5M codepoints/second.

I call this program a toy implementation because I've not spent much time to make a very robust implementation. For example, I am not looking for I/O errors. I've preferred to focus on functionality.

My intention is to produce, given time, a whole set of toy implementations. This will allow us to play with the concepts in practice, to do it in ooRexx, and to produce very quick prototypes, proof-of-concepts, et cetera.

You can download the program and the accompanying files from https://rexx.epbcn.com/TUTOR/ and https://github.com/JosepMariaBlasco/TUTOR/.