Rexx.Tokenizer.cls, a Rexx Tokenizer
Notice: This is a copy of the Rexx Tokenizer. It was distributed with the TUTOR package, release 0.5, and it is kept here for reference purposes.
/******************************************************************************
* This file is part of The Unicode Tools Of Rexx (TUTOR) *
* See https://rexx.epbcn.com/TUTOR/ *
* and https://github.com/JosepMariaBlasco/TUTOR *
* Copyright © 2023-2024 Josep Maria Blasco <josep.maria.blasco@epbcn.com> *
* License: Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0) *
******************************************************************************/
Notice:
Although this class is part of TUTOR, The Unicode Tools Of Rexx, it can also be used separately, as it has no dependencies on the rest of components of TUTOR.
Introduction
The Rexx.Tokenizer.cls
classfile includes a set of
ooRexx classes. The main class is Rexx.Tokenizer
. It
implements both a simple and a full Rexx tokenizer (see below for definitions
of simple and full tokenizing). The getSimpleToken
method returns
basic Rexx tokens and non-tokens character sequences, like comments and
whitespace, while the getFullToken
method returns full
tokens, after discarding null clauses, ignorable blanks and comments.
When requesting full tokens, an optional mechanism allows access to the
ignored simple tokens and some other tokens that are not ignorable but
that have been included ("absorbed") for your convenience: for example,
labels include their own colon, keyword instructions include the first
blank after the keyword, if any, and so on.
This help file starts with a high-level description of the tokenizer functionality, and ends with an enumeration and description of the tokenizer methods, and some implementation notes.
Subclasses and Unicode support
The tokenizer intent is to support all the syntactical constructs of Open Object Rexx (ooRexx), Regina Rexx and ANSI Rexx. You can select the desired syntax subset at instance creation time by selecting the appropriate class.
Rexx.Tokenizer -- The main class. Choose a subclass
ooRexx.Tokenizer -- Tokenizes programs written in ooRexx
Regina.Tokenizer -- Tokenizes programs written in Regina Rexx
ANSI.Rexx.Tokenizer -- Tokenizes programs written in ANSI Rexx
Subclasses starting with "Regina" accept the Regina Rexx syntax; subclasses starting with "ANSI.Rexx" accept only the ANSI Rexx syntax (for example, comments starting with "--" are accepted by Regina but not by ANSI); subclasses starting with "ooRexx" accept ooRexx syntax; for example, "[", "]" and "~" are valid characters for ooRexx subclasses but not for Regina or ANSI subclasses.
The tokenizer supports classic comments (including nested comments),
line comments and strings. The ooRexx ::ESOURCE
construct
is also accepted.
When a Unicode class is used (see below), Y-, P-, G-, T- and U-suffixed strings are recognized, translated (in the case of U strings) and supported.
ooRexx.Unicode.Tokenizer -- Tokenizes programs written in ooRexx, with experimental Unicode extensions
Regina.Unicode.Tokenizer -- Tokenizes programs written in Regina Rexx, with experimental Unicode extensions
ANSI.Rexx.Unicode.Tokenizer -- Tokenizes programs written in ANSI Rexx, with experimental Unicode extensions
The full tokenizer is not a full AST parser, but it returns a lof of useful semantical information, like the instruction type, the directive type, the kind of variable (simple, stem or compound), etc.
Creating a tokenizer instance
To create a tokenizer instance, you will first need to construct a
Rexx array containing the source to tokenize. This array will then be
passed as an argument to the init
method of the
corresponding tokenizer class to produce an instance of the tokenizer
for this particular source.
source = CharIn(inFile,,Chars(inFile))~makeArray -- Read the whole file into an array
tokenizer = .ooRexx.Tokenizer~new(source) -- Or Regina.Tokenizer, etc.
You will also have to decide whether you will be using the simple
tokenizer (i.e., you will be getting tokens using the getSimpleToken
tokenizer
method), or you will prefer to use the full tokenizer (i.e.,
you will be getting your tokens using the getFullToken
tokenizer
method).
tokenizer = .ooRexx.Tokenizer~new(source)
Do Forever
token. = tokenizer~getSimpleToken -- Or tokenizer~getFullToken
If token.class == END_OF_SOURCE | token.class == SYNTAX_ERROR Then Leave -- Constants are defined below
-- Do things with the token.
End
Both kind of tokens are described below. In case you have opted for the full tokenizer, you will also be able to select detailed or undetailed tokenizing. Detailed tokenizing returns all the simple tokens that constitute a full token as a property of the full token. Undetailed tokenizing returns only the full tokens, and discards the elementary, simple tokens, once the full token has been constructed.
In any case, you will always be able to reconstitute the entirety of your source file by following the location attributes of the returned tokens.
An example: simple and full tokens
Structure of simple tokens
Let us start with a very simple piece of code:i = i + 1
test.rex
, and run it
through InspectTokens.rex
with the -simple
option. InspectTokens.rex
is a sample utility program you
will find in the parser
directory.
InspectTokens -simple test.rex
Here is the output of the program, prettyprinted and commented for your convenience.
1 [1 1 1 1] END_OF_CLAUSE (BEGIN_OF_SOURCE): '' -- Automatically generated BEGIN_OF_SOURCE marker
2 [1 1 1 2] VAR_SYMBOL (SIMPLE_VAR): 'i' -- A simple variable (i.e., not a stem nor a compound variable)
3 [1 2 1 3] BLANK: ' ' -- A blank run consisting of a single blank
4 [1 3 1 4] OPERATOR: '=' -- An operator. It happens to work as an assignment in this position
5 [1 4 1 5] BLANK: ' ' -- Another blank
6 [1 5 1 6] VAR_SYMBOL (SIMPLE_VAR): 'i' -- The same variable as before
7 [1 6 1 7] BLANK: ' ' -- One blank more
8 [1 7 1 8] OPERATOR: '+' -- A plus sign, denoting addition
9 [1 8 1 9] BLANK: ' ' -- Still one more blank
10 [1 9 1 10] NUMBER (INTEGER): '1' -- A number (the smallest positive integer)
11 [1 10 1 10] END_OF_CLAUSE (END_OF_LINE): '' -- An END_OF_LINE indicator (which works as an implied semicolon)
- The first column is a counter.
- The second column is an aggregate, the location of the token. We have written it between [brackets]. It is of the form starting-position ending-position, where each position is a line-column sequence. The ending position if the first character after the returned token. For example, the first "i" in the line runs from position (1,1) to position (1,2).
- The third column contains one or two values. When there are two, the second one is enclosed between parentheses. These are the class and the subclass of the token. They give a lot of information about the nature of the token (e.g., this is a NUMBER [class], subclass INTEGER; or this is a VAR_SYMBOL [class], subclass SIMPLE_VAR [i.e., not a stem or a compound variable]) and will be described below
- The fourth column, after a colon and between simple quotes, is the value of the token. Generally speaking, this is the token itself, but in some cases (classic comments, resources) only an indicator is returned (you can always reconstitute the original comment or resource by referring to the location attribute of the token). In some other cases, the value contains an elaboration of the original token: for example, an X, B or U string will be interpreted, so that their value can be substituted in the source file ("(man)"U, for instance, will generate a value of "👨").
How does the InspectTokens.rex
program work? Well,
essentially what it does is the following: it instantiates a tokenizer
instance, and then it runs it, by calling the getSimpleToken
method, until
either the end of file is reached or a syntax error is encountered. Now,
here is the trick: getSimpleToken
returns
tokens... which are Rexx stems! (you can already imagine the
components of these stems):
-- after
token. = tokenizerInstance~getSimpleToken
-- we have (assume that we have just scanned the second "i" of the above program)
token.class == VAR_SYMBOL -- The CLASS of the token
token.subClass == SIMPLE_VAR -- The SUBCLASS of the token
token.location == "1 5 1 6" -- The LOCATION of the token
token.value == "1" -- The VALUE of the token
Now you know practically everything there is to know about simple tokens (indeed, there are only two things more to know, if you limit yourself to simple tokenizing: error tokens, and end-of-file conditions; we will get to both of these shortly).
Structure of full tokens (undetailed)
What happens now if we want full tokens,
instead of simple ones? We will call
InspectTokens.rex
with the -full
option so
that it calls getFullToken
instead of getSimpleToken
. We
will also add the -nodetailed
option for the moment:
InspectTokens -full -nodetailed test.rex
Let us have a look at its output. Some tokens are the same as before, but some others have experienced some modifications. Let us focus on those:
1 [1 1 1 1] END_OF_CLAUSE (BEGIN_OF_SOURCE): ''
2 [1 1 1 2] ASSIGNMENT_INSTRUCTION (SIMPLE_VAR): 'i'
3 [1 2 1 5] OPERATOR (COMPARISON_OPERATOR): '=' -- "=" has grown to include the blanks before and after
4 [1 5 1 6] VAR_SYMBOL (SIMPLE_VAR): 'i'
5 [1 6 1 9] OPERATOR (ADDITIVE_OPERATOR): '+' -- "+" has grown to include the blanks before and after
6 [1 9 1 10] NUMBER (INTEGER): '1'
7 [1 10 1 10] END_OF_CLAUSE (END_OF_LINE): ''
What has changed, exactly? Well, both the "=" operator and the "+" operator seem to have "grown". Indeed, they have "eaten" the corresponding blanks. This strictly follows the rules of Rexx: blanks before and after operator characters are ignored. The tokenizer ignores the blanks, but at the same time does not want to lose information, so that it "expands" the absorbing tokens by making them wider, so that they can (so to speak) "accomodate" the ignored blanks: the "=" on line 3 runs now from (1 2 1 3) [where the previous blank is located] to (1 4 1 5) [where the next blank is located].
There are some other, subtle, changes in the returned results. The
class of "i" has changed, it is no longer VAR_SYMBOL, but
ASSIGNMENT_INSTRUCTION. The full tokenizer "knows" that
i = i + 1
is an assignment instructions, and it passes this
knowledge to us. Similarly, the subclass of "=" has changed.
Previously, it was OPERATOR: all the tokenizer knew was that "=" was an
operator character. Now it is ASSIGNMENT_OPERATOR, which is more
informative. Similarly, "+" has now a subclass of ADDITIVE_OPERATOR.
Structure of full tokens (detailed)
As we mentioned above, when using the full tokenizer, you have the option to request a detailed tokenizing. You do so at instance creation time, by specifying the optional, boolean, detailed argument:detailed = .true
tokenizer = .ooRexx.Tokenizer~new(array, detailed)
We will call our inspectTokens.rex
utility program once
more, but this time we will not specify the -nodetailed
option, so that a detailed listing (the default when requesting full
tokenizing) is produced.
InspectTokens -full test.rex
We will get output similar to the following:
1 [1 1 1 1] END_OF_CLAUSE (BEGIN_OF_SOURCE): ''
2 [1 1 1 2] ASSIGNMENT_INSTRUCTION (SIMPLE_VAR): 'i'
3 [1 2 1 5] OPERATOR (COMPARISON_OPERATOR): '=' -- If this token is the stem "token." ...
---> Absorbed:
1[1 2 1 3] BLANK: ' ' -- ...then these subtokens are in token.absorbed[1], ...
2[1 3 1 4] OPERATOR: '=' <== -- ...token.absorbed[2], and...
3[1 4 1 5] BLANK: ' ' -- ...token.absorbed[3].
4 [1 5 1 6] VAR_SYMBOL (SIMPLE_VAR): 'i'
5 [1 6 1 9] OPERATOR (ADDITIVE_OPERATOR): '+'
---> Absorbed:
1[1 6 1 7] BLANK: ' '
2[1 7 1 8] OPERATOR: '+' <== -- The "original" main token is indexed by token.cloneIndex, so that...
3[1 8 1 9] BLANK: ' ' -- ...token.absorbed[token.cloneIndex] is that token.
6 [1 9 1 10] NUMBER (INTEGER): '1'
7 [1 10 1 10] END_OF_CLAUSE (END_OF_LINE): ''
The non-indented lines are identical to the previous listing. The indented ones show us some new components of a full token, when a detailed tokenizing is requested:
token.absorbed
is an array of "absorbed" tokens. If there are no absorbed tokens,token.~hasIndex(absorbed)
is false.token.cloneIndex
is the index intoken.absorbed
of the "original" token. For example, when a "=" operator absorbs two blanks, these blanks are ignorable, but the "=" operator is the "original", main, non-ignorable token. In that case,token.cloneIndex
will be the index of the "=" operator in theabsorbed
array.
Constants, classes and subclasses
A token t.
has a class, t.class
,
and a subclass, t.subclass
. Classes and subclasses
are defined in the tokenClasses
constant of the Rexx.Tokenizer
class. The tokenClasses
constant itself is an
array of constants, so that you can use the following code to replicate
these constants in your own program:
Do constant over tokenizer~tokenClasses
Call Value constant[1], constant[2]
End
You should always use this construction, instead of relying on the internal values of the constants: these values can be changed without notice.
Here is the full value of the tokenClasses
constant. Please note
that the second element of each array is a placeholder, the character
"*". This will be substituted by appropriate values by the tokenizer
init method.
::Class Rexx.Tokenizer
/* ... */
::Constant tokenClasses ( -
( SYNTAX_ERROR , "*" ), - -- Special token returned when a Syntax error is found
( OPERATOR , "*" ), -
- -- +--- All subclasses of OPERATOR are full tokenizer only
( ADDITIVE_OPERATOR , "*" ), - -- | "+", "-"
( COMPARISON_OPERATOR , "*" ), - -- | "=", "\=", ">", "<", "><", "<>", ">=", "\<", "<=", "\>"
- -- | "==", "\==", ">>", "<<", ">>=", "\<<", "<<=", "\>>"
( CONCATENATION_OPERATOR , "*" ), - -- | "||"
( LOGICAL_OPERATOR , "*" ), - -- | "&", "|", "&&"
( MESSAGE_OPERATOR , "*" ), - -- | "~", "~~"
( MULTIPLICATIVE_OPERATOR , "*" ), - -- | "*", "/", "//", "%"
( POWER_OPERATOR , "*" ), - -- | "**"
( EXTENDED_ASSIGNMENT , "*" ), - -- | "+=", "-=", "*=", "/=", "%=", "//=", "||=", "&=", "|=", "&&=", "**="
- -- +--- All subclasses of OPERATOR are full tokenizer only
( SPECIAL , "*" ), -
( COLON , "*" ), -
( DIRECTIVE_START , "*" ), - -- "::" (Full tokenizer only, absorbed by directive)
( LPAREN , "*" ), -
( RPAREN , "*" ), -
( LBRACKET , "*" ), -
( RBRACKET , "*" ), -
( BLANK , "*" ), - -- May be ignorable, or not
( LINE_COMMENT , "*" ), - -- Up to but not including the end of the line
( CLASSIC_COMMENT , "*" ), - -- Infinite nesting allowed
( RESOURCE , "*" ), - -- The resource itself, i.e., the array of lines
( RESOURCE_DELIMITER , "*" ), - -- End delimiter, ends resource
( RESOURCE_IGNORED , "*" ), - -- After "::Resource name ;" or "::END delimiter"
( END_OF_SOURCE , "*" ), -
( END_OF_CLAUSE , "*" ), -
( BEGIN_OF_SOURCE , "*" ), - -- Dummy and inserted. Very convenient for simplification
( END_OF_LINE , "*" ), - -- Implied semicolon
( SEMICOLON , "*" ), - -- An explicit semicolon
( INSERTED_SEMICOLON , "*" ), - -- For example, after a label, THEN, ELSE, and OTHERWISE
-
- -- CLAUSE SUPPORT (Full tokenizer only)
- -- ==============
( LABEL , "*" ), - -- Includes and absorbs the COLON
- -- All DIRECTIVEs include and absorb the :: marker
( DIRECTIVE , "*" ), - --
( ANNOTATE_DIRECTIVE , "*" ), - --
( ATTRIBUTE_DIRECTIVE , "*" ), - --
( CLASS_DIRECTIVE , "*" ), - --
( CONSTANT_DIRECTIVE , "*" ), - --
( METHOD_DIRECTIVE , "*" ), - --
( OPTIONS_DIRECTIVE , "*" ), - --
( REQUIRES_DIRECTIVE , "*" ), - --
( RESOURCE_DIRECTIVE , "*" ), - --
( ROUTINE_DIRECTIVE , "*" ), - --
- --
( KEYWORD_INSTRUCTION , "*" ), - -- All KEYWORD_INSTRUCTIONs include the first blank after the keyword, if present
(ADDRESS_INSTRUCTION , "*" ), - --
(ARG_INSTRUCTION , "*" ), - --
(CALL_INSTRUCTION , "*" ), - --
(CALL_ON_INSTRUCTION , "*" ), - -- Includes the ON sub-keyword
(CALL_OFF_INSTRUCTION , "*" ), - -- Includes the OFF sub-keyword
(DO_INSTRUCTION , "*" ), - --
(DROP_INSTRUCTION , "*" ), - --
(ELSE_INSTRUCTION , "*" ), - -- Inserts a ";" after
(END_INSTRUCTION , "*" ), - --
(EXIT_INSTRUCTION , "*" ), - --
(EXPOSE_INSTRUCTION , "*" ), - --
(FORWARD_INSTRUCTION , "*" ), - --
(GUARD_INSTRUCTION , "*" ), - --
(IF_INSTRUCTION , "*" ), - --
(INTERPRET_INSTRUCTION , "*" ), - --
(ITERATE_INSTRUCTION , "*" ), - --
(LEAVE_INSTRUCTION , "*" ), - --
(LOOP_INSTRUCTION , "*" ), - --
(NOP_INSTRUCTION , "*" ), - --
(NUMERIC_INSTRUCTION , "*" ), - --
(OPTIONS_INSTRUCTION , "*" ), - --
(OTHERWISE_INSTRUCTION , "*" ), - -- Inserts a ";" after
(PARSE_INSTRUCTION , "*" ), - -- Includes UPPER, LOWER and CASELESS (as attributes too)
(PROCEDURE_INSTRUCTION , "*" ), - --
(PUSH_INSTRUCTION , "*" ), - --
(PULL_INSTRUCTION , "*" ), - --
(QUEUE_INSTRUCTION , "*" ), - --
(RAISE_INSTRUCTION , "*" ), - --
(REPLY_INSTRUCTION , "*" ), - --
(RETURN_INSTRUCTION , "*" ), - --
(SAY_INSTRUCTION , "*" ), - --
(SELECT_INSTRUCTION , "*" ), - --
(SIGNAL_INSTRUCTION , "*" ), - --
(SIGNAL_ON_INSTRUCTION , "*" ), - -- Includes SIGNAL ON
(SIGNAL_OFF_INSTRUCTION , "*" ), - -- Includes SIGNAL OFF
(THEN_INSTRUCTION , "*" ), - -- Inserts a ";" before and after
(TRACE_INSTRUCTION , "*" ), - --
(UPPER_INSTRUCTION , "*" ), - -- Regina only, no ANSI
(USE_INSTRUCTION , "*" ), - --
(WHEN_INSTRUCTION , "*" ), - --
( ASSIGNMENT_INSTRUCTION , "*" ), - -- Variable assignments, not message assignments
( COMMAND_OR_MESSAGE_INSTRUCTION , "*" ), - -- Cannot determine without arbitrarily large context
- -- End of CLAUSE SUPPORT
- -- =====================
( VAR_SYMBOL , "*" ), -
( SIMPLE_VAR , "*" ), -
( STEM_VAR , "*" ), -
( COMPOUND_VAR , "*" ), -
( NUMBER , "*" ), -
( INTEGER , "*" ), -
( FRACTIONAL , "*" ), -
( EXPONENTIAL , "*" ), -
( CONST_SYMBOL , "*" ), -
( PERIOD_SYMBOL , "*" ), -
( LITERAL_SYMBOL , "*" ), -
( ENVIRONMENT_SYMBOL , "*" ), -
( STRING , "*" ), -
( BINARY_STRING , "*" ), -
( HEXADECIMAL_STRING , "*" ), -
( CHARACTER_STRING , "*" ), -
( BYTES_STRING , "*" ), - -- Unicode only. Y suffix
( CODEPOINTS_STRING , "*" ), - -- Unicode only. P suffix
( GRAPHEMES_STRING , "*" ), - -- Unicode only. G suffix
( TEXT_STRING , "*" ), - -- Unicode only. T suffix
( UNOTATION_STRING , "*" ) - -- Unicode only. U suffix
)
You will notice that many classes and subclasses are marked as "full tokenizer only": they will only be returned as values when using the full tokenizer. Some other are marked as Unicode only, or Regina only, etc.
Error handling
When an error is encountered, tokenizing stops, and a special token is returned. Its class and subclass will be SYNTAX_ERROR, and a number of special attributes will be included, so that the error information is as complete as possible:
token.class = SYNTAX_ERROR
token.subclass = SYNTAX_ERROR
token.location = location in the source file where the error was found
token.value = main error message
-- Additional attributes, specific to SYNTAX_ERROR
token.number = the error number, in the format major.minor
token.message = the main error message (same as token.value)
token.secondaryMessage = the secondary error message, with all substitutions applied
token.line = line number where the error occurred (first word of .location)
If you want to print error messages that are identical to the ones printed by ooRexx, you can use the following code snippet:
If token.class == SYNTAX_ERROR Then Do
line = token.line
Parse Value token.number With major"."minor
Say
Say Right(line,6) "*-*" array[line] -- "array" contains the source code
Say "Error" major "running" inFile "line" line":" token.message -- "inFile" is the input filename
Say "Error" major"."minor": " token.secondaryMessage
Return -major -- Should be returned when Syntax error
End
Public methods
new (class method)
╭──────╮ ┌────────┐ ╭───╮
▸▸─┤ new( ├──┤ source ├──┬─────────────────────┬─┤ ) ├─▸◂
╰──────╯ └────────┘ │ ╭───╮ ┌──────────┐ │ ╰───╯
└─┤ , ├──┤ detailed ├─┘
╰───╯ └──────────┘
Returns a new tokenizer specialized to the source program. Source must be a (non-sparse) array of strings. The optional argument, detailed, has no effect when the tokenizer is used in "basic" mode. When used in "full" mode, detailed must be a boolean, which determines whether ignored (or "absorbed") tokens will be kept as an optional attribute of the returned full tokens. When detailed is 1 (the default), ignored tokens are kept as an array, which can be accessed using "absorbed" as a tail for the returned stem. When detailed is 0, ignored tokens are discarded.
getFullToken
╭──────────────╮
▸▸─┤ getFullToken ├─▸◂
╰──────────────╯
The getFullToken method selects the next "full" token in the source file and returns a stem containing the details that describe this token.
"Full" tokens build over "simple" tokens, by applying Rexx rules and ignoring certain elements:
- Classic comments and line comments are ignored.
- Blanks adjacent to special characters are ignored, except when they can be interpreted as a concatenation operator.
- Two consecutive end of clause markers (i.e., an explicit semicolon, or an end of line) are reduced to a single end of clause marker (the second one would constitute an ignorable null clause).
- Blanks at the beginning of a clause are ignored.
The ignoring process is not a simple discarding. On the one hand, the
location of each full token is adjusted, so that the original source can
always be reconstructed by examining the locations of the returned
tokens. On the other hand, if the detailed parameter is
specified as 1 when creating the tokenizer instance,
all the ignored tokens, including the original non-ignored token, can be
accessed as an array which is the value of
token[absorbed]
.
Sequences of special characters are collected to see if they form a multi-character operator, line "==", an extended assignment token, like "+=", or a directive-start marker, like "::".
Error handling
When the tokenizer encounters a syntax error, it returns a special token describing the error. Please note that the full tokenizer detects a series of errors that are not detected by the simple tokenizer. For example, when a directive start sequence, "::", is followed by a symbol that is not the name of a directive, the full tokenizer emits an error and stops, but the simple tokenizer does not detect any error. A higher-level parser making use of the tokenizer may detect errors that are still earlier than the one returned. See the documentation for the syntax_error method for details.
Important note
Using getSimpleToken
and
getFullToken
with the same
tokenizer instance can lead to impredictable results.
getSimpleToken
╭────────────────╮
▸▸─┤ getSimpleToken ├─▸◂
╰────────────────╯
The getSimpleToken
method
selects the next token in the input file and returns a stem containing
the details that describe this token.
The components of a returned stem t.
are the
following:
t.class
, selected between the non-indented elements of thetokenClasses
constant, excluding those marked as "Level 2".t.subclass
, selected between the indented elements of the tokenClasses constant above (when there is no indented element,t.class == t.subclass
). Subclasses identified as "Level 2" are not considered in a simple tokenizing.t.value
. In general, this is the character representation of the token itself, but in some cases it can differ. For example, in the case of strings, this is the string value, independent of whether its specification has used or not internal double quotes, or it is any of the X-, B- or U- suffixed strings. That is, in the ASCII encoding,t.value
is identical when the token was"a"
,"61"X
,"0110 0001"B
or"0061"U
.t.location
. This component has the formline1 start line2 pos
and identifies the start position of the token, and the end position, plus one character.Line1
andline2
will always be identical, except in the case of multi-line comments.
Important note
Using getSimpleToken
and
getFullToken
with the same
tokenizer instance can lead to impredictable results.
syntax_error
╭───╮
┌───┤ , ├───┐
│ ╰───╯ │
╭───────────────╮ ┌──────┐ ╭───╮ ┌───────┐ ╭───╮ ┌─────┐ ╭───╮ │ ┌───────┐ │ ╭───╮
▸▸─┤ syntax_error( ├──┤ code ├──┤ , ├──┤ start ├──┤ , ├──┤ end ├──┤ , ├─┴─┤ value ├─┴─┤ ) ├─▸◂
╰───────────────╯ └──────┘ ╰───╯ └───────┘ ╰───╯ └─────┘ ╰───╯ └───────┘ ╰───╯
Returns a special type of token, SYNTAX_ERROR
, that
includes extra information to identify a syntax error. The arguments to
syntax_error are:
- The error code, in the format
major.minor
. - The start location and the end location. Their
format is
startLine startCol endLine endCol
. The location of the error token will be the start position of the start location followed by the end position of the end location. - The following arguments are the substitution instances for the secondary error message.
The tokenizer uses the syntax_error method to return special
tokens when a syntax error is encountered. Both the class and the
subclass components of the returned stem are SYNTAX_ERROR
.
Other components of the returned stem token.
are:
value
is the main error message. Same asmessage
.message
is the main error message. Same asvalue
.number
is the error number, in themajor.minor
format, as specified in the first argument to syntax_error.secondaryMessage
is the secondary error message, with all substitutions applied.line
is the line number where the error occurred.
Implementation notes
Private routines
ErrorMessage
╭───────────────╮ ┌────────┐ ╭───╮ ┌─────────────┐ ╭───╮
▸▸───┤ ErrorMessage( ├──┤ number ├──┤ , ├──┤ subst_array ├──┤ ) ├─▸◂
╰───────────────╯ └────────┘ ╰───╯ └─────────────┘ ╰───╯
Returns an array containing the major and minor error messages (in this order) associated to the specified code, which has to take the form major.minor (where major is the major error code, and minor is the minor error code), with all placeholder instances substituted by the values of the array subst_array.
This routine returns different error messages, depending on the
tokenizer subclass. For example, error 6.1 is
'Unmatched comment delimiter ("/*") on line &1'
, with
one substitucion instance, for ooRexx, but
'Unmatched comment delimiter (""/*")'
for Regina Rexx, with
no substitution instances.
Private methods
InitializeActionPairs
╭───────────────────────╮
▸▸─┤ InitializeActionPairs ├─▸◂
╰───────────────────────╯
InitializeActionPairs
implements the
Action.
stem, which is the core of the finite state
automaton implementing the full tokenizing phase. Simple tokens are
examined in a window of two consecutive tokens, and a series of actions
is activated by examining the classes of these tokens. For example, a
BLANK
adjacent to a COLON
can always be
ignored ("absorbed"), and so on.
InitializeCharacterCategories
╭───────────────────────────────╮
▸▸─┤ InitializeCharacterCategories ├─▸◂
╰───────────────────────────────╯
Each character in the "00"X.."FF"X
range is assigned a
character category, simbolized by a single character: digits
("0".."9"
) are assigned the category "digit"
("d"
), letters ("a".."z"
and
"A".."Z"
, plus "_"
, "?"
,
"!"
and some other implementation-dependent characters) are
assigned the "general_letter" ("l"
) category, and so
on.
digit = "d"
general_letter = "l"
...
simple_symbol = general_letter || digit
var_symbol_char = simple_symbol || "."
...
Call AssignCharacterCategory digit, "0123456789"
Call AssignCharacterCategory general_letter, "_!?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
...
When we are about to tokenize a line L
, we will use the
TRANSLATE
BIF to obtain a new string containing the
character categories of each individual character in L
.
input_line = 'id123.xyz = id123.xyz + 1'
character_categories = 'llddd.lll o llddd.lll o d'
This allows a very efficient determination of the token boundaries.
For example, a run of "d"
will identify a simple number, a
run of "d"
or "l"
will identify a simple
symbol, and so on. The fragment of code below shows how the tokenizer
handles tokens that start with a letter; they can be either a simple
variable, a stem variable, or a compound variable.
Call skipCharsUntilNot simple_symbol -- Skip all letters and digits
-- Neither a letter, a digit or a period? This is a simple symbol
If thisCharIsNotA( "." ) Then Return Token( VAR_SYMBOL, SIMPLE_VAR )
-- That was a period. Skip it
Call nextChar
-- End of symbol? This is a stem
If thisCharIsNotA( var_symbol_char ) Then Return Token( VAR_SYMBOL, STEM_VAR )
-- If there is any stuff after the period, that's a compound symbol
Call skipCharsUntilNot var_symbol_char
Return Token( VAR_SYMBOL, COMPOUND_VAR )
InitializeClasses
╭───────────────────╮
▸▸─┤ InitializeClasses ├─▸◂
╰───────────────────╯
This method scans the tokenClasses
vector and assigns
the values of the corresponding constants. It also creates some useful
compound values, like STRING_OR_SYMBOL
, or
CLAUSE
.
InitializeDirectives
╭──────────────────────╮
▸▸─┤ InitializeDirectives ├─▸◂
╰──────────────────────╯
This method creates a stem that will be used to discriminate if a symbol is a valid directive name or not.
InitializeKeywordInstructions
╭───────────────────────────────╮
▸▸─┤ InitializeKeywordInstructions ├─▸◂
╰───────────────────────────────╯
The InitializeKeywordInstructions
method creates the
keywordInstruction.
stem, which allows us to determine
whether a certain SIMPLE_VAR VAR_SYMBOL
is a candidate to
start a KEYWORD_INSTRUCTION
or not. The stem is customized
for the ooRexx, Regina and ANSI cases.
InitializeOperatorTable
╭─────────────────────────╮
▸▸─┤ InitializeOperatorTable ├─▸◂
╰─────────────────────────╯
The InitializeOperatorTable
method creates the
operator_subclass.
stem. This stem allows us to
discriminate which operator character combinations constitute valid Rexx
operators, and which is its corresponding syntactic category
(ADDITIVE_OPERATOR
, LOGICAL_OPERATOR
, etc.)
that should be assigned to those operators.
InitializeSimpleTokenizer
╭───────────────────────────╮
▸▸─┤ InitializeSimpleTokenizer ├─▸◂
╰───────────────────────────╯
The InitializeSimpleTokenizer
method initializes a
series of variables that will hold the context when tokenizing a
::RESOURCE
directive.
InitializeStringSuffixes
╭──────────────────────────╮
▸▸─┤ InitializeStringSuffixes ├─▸◂
╰──────────────────────────╯
The InitializeStringSuffixes
method builds a stem that
maps string suffixes (i.e., "X", "B", "Y", "P", "T" and "U") to their
corresponding tokenClasses
.
InitializeTokenizer
╭──────────────────────╮
▸▸─┤ InitializeTokenizer ├─▸◂
╰──────────────────────╯
The InitializeTokenizer
method sets a number of
variables to track the special context for THEN
clauses,
the special token supplied at begin-of-source, and the extra buffer used
when we are forced, in the full tokenizer, to insert certain symbols,
for example, a semicolon after a label, or a THEN
,
ELSE
or OTHERWISE
clauses.