Introduction
Installation:
Lexer definition file format:
Import section: between %Import and Import%
This section is copied into the output file before
the definition of the lexer class.
Class option: between %ClassOptions and ClassOptions%
The ClassOptions section allows you to customize
the generated lexer class - is it final, what should the name be..
LexerClass: between %LexerClass and LexerClass%
This should contain code that must be part of the
LexerClass - extra members and methods
Java: between %Java and Java%
Auxilliary classes - the content is copied in after
the Lexer class
LexerDef: between %LexerDef and LexerDef%
The definition of the lexer - regular expressions
to actions
Everything outside the sections is ignored. The only required section
is the LexerDef section.
You can put comments outside the sections and in the code ( as
java comments - //... and /*...*/ )
In the following I will go through each of these sections.
Import section: between %Import and Import%
%Import
import java.util.*;
Import%
Class option: between %ClassOptions and ClassOptions%
LexerClass: between %LexerClass and LexerClass%
You should at all times avoid using these names as identifiers:
variables:
ULEX__index
ULEX__currentIndex
ULEX__currentState
ULEX__theReader
ULEX__tokenBuf
ULEX__c
ULEX__accepted;
ULEX__start
methods:
next_token
ULEX__next
getText
getStartIndex
getEndIndex
LexerDef: between %LexerDef and LexerDef%
This section is the most important, and also the only one required.
It defines a number of mappings from regular expressions
to actions. Since next_token() is required to return java_cup.runtime.Symbol
objects, a mapping is normally just a regular expression to a return new
Symbol(..); The syntax is as follows:
mapping : REGEXP %{ CODE %}
The CODE contains code to be executed when accepting the REGEXP.
If the CODE section is empty ( i.e. no text what so ever ), the string
matching REGEXP is ignored. For a description of the supported format of
regular expressions see Regular expression
format
Java: between %Java and Java%
I was very much inspired by mosml-lex in the definition of the regular
expression format. However, the precedence rules are VERY obscure, so you
should use the ( regexp ) format when combining regular expression, e.g.
:
('a''a'*) is the same as ('a''a')*. I will probably look into this
problem at some time, but for now, you should just use the parenthesis
: (('a')('a'*))
Here is the complete syntax for regular expressions:
'char' | character constant* |
['char'-'char'] | Match any character in the range |
^'char' | Match any character that is not 'char' |
( regexp ) | same as regexp |
regexp1 | regexp2 | strings matching regexp1 or regexp2 |
regexp1 regexp2 | strings matching regexp1 concatenated with strings matching regexp2 |
regexp+ | one or more concatenations of strings matching regexp |
regexp* | zero or more concatenations of strings matching regexp |
regexp? | one or zero concatenations of strings matching regexp |
eof | end of file |
. | error - no other expression matched |
* character constants:
The following escape sequences are supported: \b,
\t, \n, \f, \r, \", \', \\ and \number
where number is the ascii value (0-128)
For characters (, ), [ and ] you MUST use the ascii
value if the regular expression contains () or [] groupings:
( :
40
) :
41
[ :
91
] :
93
Examples
I have provided some examples of lexer file definitions: