LUTHOR
Lex Table Generator
Copyright InnerSystem Software. All Rights Reserved.
Version 2.0
Table of Contents
Programs such as compilers and interpreters need to break down the input source code into recognizable units (tokens). The usual ‘quick and dirty’ method is to build a token by retrieving a character at a time and deciding if it's part of the same token. Once a token has been built, the program will do a series of string compares against a table of fixed strings. This is, however, extremely slow. Some programmers get around the speed problem by hashing the token first, thereby reducing the number of string compares that have to be done.
The ideal (and quickest) method is to decide as each character is retrieved whether a) it's part of the same token, and b) what valid token can be recognized at that point. This method requires that a token table be built which contains all possible permissible sequences of characters. The lexical analyzer has the job of building the table; the tokenizer uses the table at runtime to break down the input into tokens.
A problem that tokenizers have to deal with is the question of when a token should be recognized. For instance, in one version of BASIC, the tokens PRINT and PRINTER are both valid tokens. Usually a tokenizer will use a 'greedy algorithm' to recognize the longest possible token.
Using our BASIC example, and given an input of 'PRINTERS:' (note the colon), the tokenizer would make the following decisions as it read the input:
P - var
PR - var
PRI - var
PRIN - var
PRINT - var or PRINT
PRINTE - var
PRINTER - var or PRINTER
PRINTERS - var
PRINTERS: - no valid transition. Last valid transition gave PRINTERS as var. Return that. Save ':' for next token.
Note that in order to operate properly, the tokenizer had to remember the last transition that resulted in a valid token. Once it runs into an input character that doesn't produce a valid transition, it must go back to the memorized point and return that. It must also push back the subsequent characters for use in the next token.
A token table is usually generated beforehand, then compiled into the program - the idea being that you presumably know the nature of the input stream when you are writing your program. The generator will produce a table which identifies all possible transitions in the language, and the tokens that are valid for each transition.
The lexical table generator takes as its input a file containing a set of token definitions and builds a data table which will allow the lexical analyzer routine in the program to identify tokens. The analyzer will always identify the longest token possible. This form of program is significantly faster than a series of literal compares; and the token set can be modified easily and swiftly just by editing the definition file and rebuilding the token table.
The purpose of Luthor is to allow the programmer to create a token definition file, which Luthor then uses to produce a tokenizer which can be included in a program to allow the program to break down an input stream into tokens.
What’s different about LUTHOR?
The most common lexical analyzer, called LEX, is a standard component of UNIX systems. It has also been repackaged for DOS in some cases, and has some clones and commercial versions, which are distributed under slightly different names.
LEX has several shortcomings. First, the generator actually generates program code, with the table embedded in the source. The code is cryptic and hard to follow; using it in your programs is not straightforward; and it makes heavy use of global variables. And most importantly, the code would have to be heavily modified if you wanted to have access to more than one lex table in a program. Even if you don’t want to modify the LEX code, or use multiple tables, the generated program code and global variables force you into a specific program style in order to incorporate the functionality.
LUTHOR has several new and useful features
1. The lex engine is designed as a linkable library. This means that the interface to LUTHOR is well-defined and limited to the library function calls, improving code isolation and making side-effects less likely.
2. LUTHOR provides several interface methods for feeding the input stream: Your program can simply supply a filename and have LUTHOR return tokens from it; your program can supply the address of a function which will supply characters to LUTHOR; your program can directly supply characters one at a time and check for tokens on each call; or your program can submit strings to use as the source for tokens.
3. LUTHOR allows you to query the token value and text string for the current and next token in the stream.
4. LUTHOR allows your program to switch lex tables at any time for a given input stream.
5. LUTHOR supports run-time-loadable (and discardable) lex tables.
6. LUTHOR allows you to have multiple input streams active at the same time, each with its own assigned lexical table - or you can have multiple input streams sharing the same lex table.
7. LUTHOR allows you to define multiple lex tables in one lex definition file. This has the advantages of allowing you to keep related lex table definitions in a single file, and automatically prevents token value collisions since the token values are assigned sequentially in one pass.
8. LUTHOR allows you to optionally assign a structure and/or text value to each token in the definition file. These items produce additional ‘.h’ files which can be compiled into your code.
Generating lex tables using LUTHOR
Usage for the lex table generator is:
LEXGEN inputfile [outputprefix] [options]
Inputfile |
This is the name of the .DEF file that defines the lex table. |
Outputprefix |
This defines the first 5 characters of the names of the output files. If this is not specified, the first 5 characters of the input filename is used. |
Options:
-V |
Verbose. Using this option causes lexgen to generate large amounts of progress information. |
-I file |
Specify input filename. This option is really redundant, since the first file specified on the command line is considered to be the input filename anyway. |
-O prefix |
Specify output prefix. This option is also redundant, since the second word specified on the command-line is considered to be the output prefix. |
-C {Y|N} |
Case Sensitive. By default, the lex table is not case sensitive (PRINTER is the same as Printer is the same as pRINTer). |
-B nn |
Set beginning token value. By default the token values are assigned sequentially starting with 1. |
-F |
Create FLAGS table ‘.h’ file. See LUTHOR FILES section for a description of the FLAGS table. |
-T |
Create TEXT table ‘.h’ file. See LUTHOR FILES section for a description of the TEXT table. |
-L |
Create the lex table as a Loadable Lex Table. This option creates a file with the extension LLT which can be loaded by the LUTHOR library at runtime. See LUTHOR FILES section for more information on LLT files. |
-D |
Turns debugging on. This option causes a file named ‘record’ to be created in the current directory. This file contains a lot of status and progress information, including a dump of the initial NFA and final DFA data. |
-S |
Make lex objects static. This simply places the keyword ‘static’ in front of the storage declarations in the various ‘.h’ files. This is useful if your linker is getting overwhelmed or if you are compiling in a 16-bit environment. In the latter case, storage declarations may be placed in the near heap by default, and you may end up overrunning the 64k limit for near data. |
-W nn |
Set Flags table width. Normally the flags table is built as a struct; however by specifying a width in characters, you can create a flags table that consists of a two-dimensional character array. This is the only way to get a flags table when building an LLT (Loadable Lex Table). |
-N |
Suppress banner. This has no effect on the evaluation version. If you are still using an unregistered version of the program, shame on you. Bad Programmer. Bad, BAD programmer. |
LUTHOR FILES
This section describes the various input and output files that LUTHOR uses.
INPUT: The Lex Definition File (X.DEF)
LUTHOR generates lex tables from token definitions contained in a definition file (file extension .def).The definition file may also contain metacommands which control lexical table generation.
Metacommands in the .DEF file
Certain commands can be embedded in the lex definition file, which will have an affect on the files and tables produced. These metacommands are identified by starting with a ‘%’.
%TokenStart n |
Sets the token value starting at the next lex token def to n. |
%OutputPrefix xxxxx |
Sets the (up to) 5-character prefix name to xxxxx. The prefix is used for naming the output ‘.h’ files, and for generating some of the variable names in them. This change will come into effect with the next %Start. |
%Start [nnn] |
Writes out the current lex table def (if any), clears internal tables, and starts a new lex table def. Optional nnn sets the starting token value for the new table. |
%Fwidth nn |
Specify Flags table width if a two-dimensional character array is desired instead of the usual struct. |
%Option Opt Val |
Sets an option that otherwise would have to be set at command-line. Val is Y or N, and must be supplied for all options. |
V [y|n] |
Verbose. LUTHOR will print ‘progress reports’ as it generates the definitions. |
|
C [y|n] |
Case Sensitivity. If set to N, lex will recognize PRINT, print, and PrInT as the same token. |
|
B nnn |
Sets the beginning token value to nnn. |
|
F [y|n] |
Creates a flags table |
|
T [y|n] |
Creates a text table |
|
L [y|n] |
Create a loadable lex table (.LLT) rather than a source-code .h file. |
|
D [y|n] |
Turns debugging on. This generates detailed operating information into a file called record. |
|
S [y|n] |
Make lex items static. This option simply places the keyword static in the definition for the variable in the .h files. |
Lex Definition Strings
The lex definition of a token is a string that defines the sequence of characters that can be recognized as that token. Some tokens are simple, such as PRINT. The PRINT token simply consists of the letters P R I N T being received in the proper sequence. But other tokens, such as integers, real numbers, numbers in scientific notation, strings, comments, may come in a large number of possible character sequences - but all have to be recognized as the proper token type.
Some examples of valid lex definitions are:
|
matches the word ‘PRINT’ |
-?[0-9]+ |
matches a positive or negative integer. |
COLOR|COLOUR |
matches the American or British spellings of the word. |
COLOU?R |
an alternate and equally valid definition of the word. |
$[]*\.(80)?86 |
A definition for the assember directive ‘.86’ or ‘.8086’. It is left-anchored (must be the first token in the line), but may have any amount of whitespace before the assembler directive. The directive may appear as ‘.8086’ or ‘.86’. |
Possible components of the lexical definition string are:
x |
Any specific single character. In this case, ‘x’. Some characters may have to be escaped with backslash, because they have meaning to the lexical generator. These characters are: $ (as first or last character), ^, *, ., (, ), +, [, ], \, ?, ~. |
. (A period) |
Matches any character except linefeed ( char(10) ). |
^x |
Control character. ^@ is char(0), ^A is char(1), etc. |
[abc] |
Any of the characters in square brackets. |
[] |
Matches whitespace: char(32), char(9), char (12), and char(13) (space, tab, formfeed, and carriage return, but not line-feed). |
[a-d] |
Any range of characters. |
[a-dh-z] |
Ranges a-d and h-z. |
[~abc] |
All characters except a, b, and c. |
~" |
Any character except the double quote. |
Lexical elements can be modified using closure operators. These are:
* |
Zero or more occurrences of the preceding element |
+ |
One or more occurrences of the preceding element |
? |
Zero or one occurrences of the preceding element. |
Lexical definitions can be anchored to either the left or right, using the ‘$’ character. A left-anchored token must be the first token on a line in order to be recognized as that token. A right-anchored token must be the last token on the line in order to be recognized.
For instance, using this definition:
$\.NAME
will recognized the five-character sequence of ‘.NAME’ only if it is the first token on a line. Otherwise, some other rule will be used.
A lexical definition can include OR functionality using ‘|’. For instance,
‘~’*’|"~"*"
defines a string as being either a single quote (‘), followed by zero or more of any characters except a single quote, followed by a single quote; or a double quote (") followed by zero or more of any characters except a double quote, followed by a double quote.
A lexical definition can include parenthesized segments. For instance,
EQU(ATE)?
will match E Q U followed by zero or one occurrences of A T E, so this definition matches EQU or EQUATE.
The combination of the definition options can produce a quite complicated structure. For instance,
-?([0-9]+|([0-9]*\.[0-9]+)|([0-9]+\.[0-9]*))[Ee]-?[0-9]+
will recognize a number in scientific notation. It allows for zero or one negative signs; then a number which may be one or more digits, or zero or more digits plus a decimal plus one or more digits, or one or more digits plus a decimal plus zero or more digits; then the letter E in upper or lower case (if you have case sensitivity turned off, this extra definition is simply redundant); then zero or one negative signs; then one or more digits.
This complex definition is required because the first part of the number is valid with no decimal point, with numbers before the decimal but not after, with numbers after the decimal but not before; but it is not valid with a decimal point by itself.
Let's look at some of the examples used above, on a character-by-character basis.
"~"*"|'~'*'
COLOU?R
For those who are wondering, the British spelling of some words is different than the American spelling. So this particular trick would recognize either variant.
$[]*\.(80)?86
Lex Definition lines in the .DEF file
A definition line in the LUTHOR input file can have 2 to 4 parts, separated by whitespace:
TOKEN-NAME |
[FLAGS] |
[TEXT] |
DEFINITION STRING |
The first item is always the token name, and the last item is always the lexical definition string. The FLAGS section needs to exist if the -F option was specified, and the TEXT option needs to exist if the -T option was specified.
The FLAGS definition is assumed to be in a format suitable for a struct initialization. The TEXT definition is assumed to be suitable for a string. You must put quotes around the TEXT definition if it contains whitespace. There can be no whitespace in the FLAGS definition unless it is within a quoted string.
A Simple definition line might look like this:
TK_HEX_N |
0[Xx][0-9A-Fa-f]+ | [0-9][0-9A-Fa-f]*[Hh] |
A version of the same with all four fields would look like this:
TK_HEX_N |
0,0,1,NULL |
"Hex Number" |
0[Xx][0-9A-Fa-f]+|[0-9][0-9A-Fa-f]*[Hh] |
The struct table produced by LUTHOR is always named XXXXXFlagTab[] where XXXXX is the output prefix specified on the command line or by %OutputPrefix. The text table is always named XXXXXTextTab[].
OUTPUT: The XXXDEF.H file
The XDEF.H file is generated by LUTHOR based on the input file X.DEF . It contains all the #define’s for the lexical tokens, as well as a min and max #define.
OUTPUT: The XXXFLG.H file
The XFLG.H file is generated by LUTHOR based on the [FLAGS] section of the input file if the -F option is used. It is a zero-based table of struct XFlagTab[]. Correct lookup requires use of the min and max #define’s in the XDEF.H file.
OUTPUT: The XXXTXT.H file
The XTXT.H file is generated by LUTHOR based on the [TEXT] section of the input file if the -T option is used. It is a zero-based table of char *XTextTab[]. Correct lookup requires use of the min and max #define’s in the XDEF.H file.
OUTPUT: The XXXTAB.H file
This can be the largest file of the output set. It is generated by LUTHOR by default, unless the -L option is used to generate an LLT file instead. This file consists of two tables - a table of short ints called X_tab[] and a table of LEXNODE XTable[] items. There is also a single struct of LEXOBJECT XObject which is the item whose pointer is passed to the runtime library. The structures for this include file are defined in LUTHOR.H, which is included automatically if it already hasn’t been.
OUTPUT: The XXXXX.LLT file
If you have generated your lex table using the -L option, LUTHOR will generate an LLT file instead of the 2-4 .H files specified above. The LLT file is a binary data file which essentially contains the data which would otherwise go into the XTAB.H file. The LLT file is loaded at runtime by specifying the filename to the appropriate LUTHOR library routine.
This version of LUTHOR does not include support for loadable XTXT.H or XFLG.H files as part of the loadable lex table functionality. If you really need this suport, one worakaround is to generate the lex table twice - once with the -L option and once without. The .H files can be compiled into the EXE.
The SAMPLES subdirectory contains various subdirectories, numbered from 1 up. Each of these subdirectories contains files that illustrate an aspect of LUTHOR.
SAMPLES/1 Using LEXGEN with command-line Options.
SAMPLES/2 Using LEXGEN with metacommands to control options
SAMPLES/3 Using Multiple Input file definitions in a single .DEF file.
SAMPLES/4 Using the -T (text table) Option
SAMPLES/5 Using the -F (flags table) Option
SAMPLES/6 Simple use of LUTHOR functions to tokenize an input file.
SAMPLES/7 Using LexPushFile to process an include file.
SAMPLES/8 Using a second LexStream to process an INI file.
SAMPLES/9 Using Loadable Lex Tables.
SAMPLES/10 Using a LexStream set up with LexInputFunction
SAMPLES/11 Using a LexStream set up with LexInputByCall
SAMPLES/12 Changing lex tables in midstream using LexChangeTable
SAMPLES/13 Recovering from a bad token using AdvanceChar.
SAMPLES/14 Using LexInfo
SAMPLES/15 Using a LexStream set up with LexInputString
EXTERNAL VARIABLES
The following variables and library functions are available for use by an application program:
extern int lexerror
This int variable contains the last error, for functions that have an error return. The possible errors are defined in luthor.h
VARIABLE TYPES
The LUTHOR library makes use of the following typedefs and structs. Many of the typedefs are simply ints - The only purpose of typedef’ing them is for clarity.
Typedef int LEXTABLE
A lex table is submitted to the LUTHOR engine by one of the LEX SETUP functions. Once this is done, the LEXTABLE value can be used to associate the table with any number of LEXSTREAMS. The LEXTABLE handle is valid until program termination or until the table is dropped using LexDropTable().
typedef int LEXSTREAM
A LEXTABLE, in association with an input stream of some kind, forms a LEXSTREAM. This handle specifies which input stream is being referred to when performing any of the TOKEN operations.
Typedef struct LEXOBJECT
This structure is defined in LUTHOR.H . It is the structure of the object in XTAB.H that defines the lex table.
Typedef struct LEXINFO
This structure is defined in LUTHOR.H. It contains the fields required to return the info specified by LexLineInfo.
INITIALIZATION ROUTINES
void LexInit(unsigned int options)
No return value.
Function initializes the lexical engine. It may be called explicitly, or will be called automatically the first time one of the other lexical routines is called, with options set to zero.
Options are defined by OR-ing the following constants:
LO_SUPPRESS - Suppresses the LUTHOR banner at initialization time.
SETUP ROUTINES
LEXTABLE LexSetObject(LEXOBJECT *item)
Returns a LEXTABLE handle.
Returns LXE_ERROR on error.
Function sets up a lexical analyzer table by passing a pointer to a LEXOBJECT. Returns an int handle which may be used to associate an input stream with this lex table.
LEXTABLE LexLoadObject(char *filename)
Returns a LEXTABLE handle.
Returns LXE_ERROR on error.
Function loads a Lexical analyzer table specified by filename. Returns a handle which may be used to associate an input stream with this lex table.
LEXSTREAM LexInputFile(LEXTABLE x, char *filename)
Returns a LEXSTREAM handle.
Returns LXE_ERROR on error.
Function assigns an input file to a given lex table, and returns a handle to a lexical analyzer stream.
LEXSTREAM LexInputFunction(LEXTABLE x, int (*fn)())
int (*fn)(); /* function for returning characters */
Returns a LEXSTREAM handle.
Returns LXE_ERROR on error.
Function assigns a function to be the source of input for a particular LEXSTREAM. The called function must not require arguments and must return an int value for the next character on each call. At EOF, the function must return EOF’s every time it is called.
LEXSTREAM LexInputByCall(LEXTABLE x)
Returns a LEXSTREAM handle.
Returns LXE_ERROR on error.
This function sets up a LEXSTREAM that can then be used with the BuildToken() function.
LEXSTREAM LexInputString(LEXTABLE x,char *string)
Returns a LEXSTREAM handle.
Returns LXE_ERROR on error.
Function sets up a LEXSTREAM using the supplied string as input. Further input can subsequently be submitted using LexSubmitString().
LEXTABLE LexChangeTable(LEXSTREAM x,LEXTABLE y)
Function changes the LexTable being used on lexstream x to lextable y. The stack of pushed files is unchanged. Current Token is unchanged, but next token is reevaluated. Returns the handle of the previously active LexTable.
FILE HANDLING ROUTINES
Int LexPushFile(LEXSTREAM x, char *filename)
Returns 0 on success or LXE_ERROR on failure.
This function pushes a new file into the input stream, so that all further input comes from the new file until EOF. ThisToken() is not affected but NextToken() is reevaluated. When the input stream reaches EOF for the new file, further input continues from the former file, where it left off. The depth of stacking of files is limited only by the ability of the library to reallocate tables.
Note: When a file reaches EOF, one EOF token is returned. An AdvanceToken() will then retrieve the next token from the next pushed-down file. To determine if you are at true EOF, use LexLineInfo() to retrieve the current file pushed depth.
Int LexCloseFile(LEXSTREAM x)
Function forces the current file closed for the specified LEXSTREAM, and causes it to return EOF as the next token. For LEXSTREAMs that have files pushed using LexPushFile(), the current file is closed and input continues from the next pushed-down file.
Int LexCloseStream(LEXSTREAM x)
Once all input has been exhausted for a stream, that stream continues to exist, but returns only EOF for token values. To release resources, the stream must be explicitly closed. This function may also be used to prematurely close a stream. It will properly close all open files.
Int LexDropTable(LEXTABLE x)
This function drops the specified lex table from the list of available tables. The function fails if the table is being used by an open stream, even one that has reached EOF. Use LexCloseStream() first to close all dependant streams.
The function releases memory if the specified lex table was an LLT; otherwise it just makes the lextable handle available for new lextables.
Int LexSubmitString(LEXSTREAM x,char *string,int opt)
Returns buffer character count on success or LXE_ERROR on error.
This function submits a new string to a lexstream. Opt can be one of:
SSO_ATSTART
SSO_ATEND
SSO_REPLACE
which will either insert the new string at the current input location (forcing reevaluation of ThisToken()), at the end of input, or completely replacing the existing input and forcing reevaluation.
The function returns the number of characters in the input buffer, including the current token, after submission. You can pass a NULL string to the function to get the current character count without affecting the buffer (value of opt is ignored). Clearing the input buffer using SSO_REPLACE requires a non-NULL, zero-length string.
TOKEN HANDLING FUNCTIONS
Int ThisToken(LEXSTREAM x)
Returns the current token for a given LEXSTREAM, or EOF. Returns zero if a token cannot be recognized. In this case, one recovery option is to call AdvanceChar() until a token is recognized.
Int NextToken(LEXSTREAM x)
Returns the value of the next token for a given LEXSTREAM. If the current token is zero or EOF, NextToken will return the identical value.
Char *ThisTokenText(LEXSTREAM x)
Returns a string pointer to the actual text of the current token, zero-terminated. This function returns a pointer to a static area, which is overwritten on every advance. Function returns NULL on EOF or zero token.
Char *NextTokenText(LEXSTREAM x)
Returns a string pointer to the actual text of the next token. Error handling as per ThisTokenText().
Int AdvanceToken(LEXSTREAM x)
Advance the stream to the next token. If the current token is zero or EOF, the call has no effect. Function returns the new current token.
Int BuildToken(LEXSTREAM x, int c)
Returns valid token number, or zero if no token recognized yet, or EOF if all input is used up, or LXE_NOTRAN if input cannot be recognized as a token.
Function allows the program to submit characters one at a time to the lex table. The function will return zero until it has built a valid token, then it will return the token value. Input can be flushed by calling the routine with EOF (-1) as the value for c until the function returns EOF. If the function cannot resolve the received input into a valid token, it will return -2 and flush its buffers of all pending input.
Notes:
• ThisTokenText(LEXSTREAM) will return NULL until a token is recognized, then will return the token text value until another character is submitted.
• Most of the other functions in the library are invalid for LEXSTREAMs that use BuildToken().
CHARACTER HANDLING FUNCTIONS
Int ThisChar(LEXSTREAM x)
Returns the value of the character at the head of the stream (the first character of ThisTokenText() if current token is valid).
Int AdvanceChar(LEXSTREAM x)
Function advances the input stream by one character and attempts to rebuild current token. Returns the new character at the head of the stream.
MISC FUNCTIONS
LEXINFO *LexLineInfo(LEXSTREAM x)
typedef struct lexinfostruct LEXINFO
Returns a pointer to a LEXINFO object that contains the following:
• char pointer to the name of the current input file (where applicable)
• Line number of the text line that the current token is on
• char pointer to the text of the current line.
• position in the current line of the start of the current token.
• Number of files ‘pushed’.
• Minimum and maximum token values for the lex table.
• Pointers to the Flags and Text Tables for LLT's.
The best book that I have found on the whole subject of lexical analysis is:
Compiler Design in C
by Alan Holub
Published by Prentice Hall --- ISBN 0-13-155045-4
This book gives a clear plain-english description of the theory and practice of lexical analysis and the design of lexical generators and tables.
Version 1.1
Version 2.0