The MultiStageTable class
/******************************************************************************
* This file is part of The Unicode Tools Of Rexx (TUTOR) *
* See https://rexx.epbcn.com/TUTOR/ *
* and https://github.com/JosepMariaBlasco/TUTOR *
* Copyright © 2023-2025 Josep Maria Blasco <josep.maria.blasco@epbcn.com> *
* License: Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0) *
******************************************************************************/
This class, defined in the
/components/utilities/MultiStageTable.cls
package,
specializes in producing two-stage tables, three-stage tables, or, in
general multi-stage tables.
Multi-stage tables are recommended in The Unicode Standard 15.0, section 5.1, Data Structures for Character Conversion, "Multistage Tables", pp. 196–7.
This is not a general implementation of multi-stage tables, but a custom, tailored one, specific to Unicode.
The indexes for these tables run from 0 to the size of the buffer. Negative values will raise a syntax error, and indexes greater than the size of the buffer will return width copies of "00"X.
compress (Class method)
The compress method compresses a buffer and returns two smaller, compressed, tables, "chunks" and "offsets" (see below). The chunk_size argument, if specified, should be a power of two and a divisor of the length of the buffer; chunk_size defaults to 256.
Buffer is a n-byte string, where n is a multiple of chunk_size.
The compression technique works as follows: the buffer
source string is supposed to be compressible, i.e., it is supposed to
contain different segments which are identical (for example, because
they are copies of "00"X
). The buffer string will
be broken in a series of substrings of size chunk_size, and,
instead of storing the substring itself, we will store a reference to
the substring. Thus, when two identical substrings of buffer
are found, the first copy is stored, only once, in an indexed "chunks"
table, and its offset (divided by chunk_size) is stored in a
separate "offsets" table. A reference is supposed to be much smaller
than the subarray itself, and this makes for the desired
compression.
Note: To allow for maximum compression, we are supposing that the quantity of different sub-arrays does not exceed 256. This allows to store the references to the sub-arrays in a single byte.
[]
Returns the n-th element of the multi-stage table, when 0
< n <= table_size, or a string containing
width copies of "00"X
, when n >
table_size. Negative or non-numeric values of n will
raise a Syntax error.
new
Creates a new multi-stage table. This will be a two-stage table when
big_values is not specified or has the .Nil
value
(the default), and a three-stage table when big_values is
specified as a non-nil value.
The offset and chunks tables should have been created by the compress class method; chunk_size, when specified, should be the same value used in the compress call, and a power of two. Table_size should be the size of the buffer used in the compress call; it defaults to 217.
Width and big_values are optional. When width is specified for a two-stage table, a string of width bytes will be returned by the "[]" call; when width is specified for a three-stage table, the 1-byte value obtained from offset and chunks is multiplied by width and used as an index into big_values, to return a substring of width bytes.