ULex version 0.2
online: http://www.oocities.org/ulrikm/java/ulex/ulex.htm


Introduction
Lexer definition file format
Regular expression format
Examples

Introduction



 

Installation:



Run install.bat - files are copied to C:\umsoft\ulex
 

Lexer definition file format:


The lexer program is generated based on the file you give as argument to ULex, and it has 4 sections:

Import section: between %Import and Import%
    This section is copied into the output file before the definition of the lexer class.
Class option: between %ClassOptions and ClassOptions%
    The ClassOptions section allows you to customize the generated lexer class - is it final, what should the name be..
LexerClass: between %LexerClass and LexerClass%
    This should contain code that must be part of the LexerClass - extra members and methods
Java: between %Java and Java%
    Auxilliary classes - the content is copied in after the Lexer class
LexerDef: between %LexerDef and LexerDef%
    The definition of the lexer - regular expressions to actions

Everything outside the sections is ignored. The only required section is the LexerDef section.
You can put comments outside the sections and in the code  ( as java comments - //... and /*...*/ )

In the following I will go through each of these sections.

Import section: between %Import and Import%

This section will normally contain import declarations. E.g. if you have put a method in the LexerClass section that uses classes in the java.util package, you will put "import java.util.*;" here:
%Import
    import java.util.*;
Import%


Class option: between %ClassOptions and ClassOptions%
 

LexerClass: between %LexerClass and LexerClass%

Code in this section becomes part of the generated lexer class definition. Any method or member defined here becomes part of the generated lexer class.

You should at all times avoid using these names as identifiers:
    variables:
         ULEX__index
         ULEX__currentIndex
         ULEX__currentState
         ULEX__theReader
         ULEX__tokenBuf
         ULEX__c
         ULEX__accepted;
         ULEX__start
    methods:
        next_token
        ULEX__next
        getText
        getStartIndex
        getEndIndex
 
 

LexerDef: between %LexerDef and LexerDef%

This section is the most important, and also the only one required. It defines a number of mappings from regular expressions
to actions. Since next_token() is required to return java_cup.runtime.Symbol objects, a mapping is normally just a regular expression to a return new Symbol(..); The syntax is as follows:

    mapping : REGEXP %{ CODE %}

The CODE contains code to be executed when accepting the REGEXP.
If the CODE section is empty ( i.e. no text what so ever ), the string matching REGEXP is ignored. For a description of the supported format of regular expressions see Regular expression format
 
 

Java: between %Java and Java%
 
 

Regular expression format


I was very much inspired by mosml-lex in the definition of the regular expression format. However, the precedence rules are VERY obscure, so you should use the ( regexp ) format when combining regular expression, e.g. :
('a''a'*) is the same as ('a''a')*. I will probably look into this problem at some time, but for now, you should just use the parenthesis : (('a')('a'*))

Here is the complete syntax for regular expressions:
 

 'char' character constant*
['char'-'char'] Match any character in the range
^'char' Match any character that is not 'char'
( regexp ) same as regexp
regexp1 | regexp2 strings matching regexp1 or regexp2
regexp1 regexp2 strings matching regexp1 concatenated with strings matching regexp2
regexp+ one or more concatenations of strings matching regexp
regexp* zero or more concatenations of strings matching regexp
regexp? one or zero concatenations of strings matching regexp
eof end of file
. error - no other expression matched

* character constants:
    The following escape sequences are supported: \b, \t, \n, \f, \r, \", \', \\ and \number
    where number is the ascii value (0-128)
    For characters (, ), [ and ] you MUST use the ascii value if the regular expression contains () or [] groupings:
        (    :    40
        )    :    41
        [    :    91
        ]    :    93
 
 

Examples


I have provided some examples of lexer file definitions:
 

  • java.lex : produces JavaLexer.java

  •     A lexer for a subset of java ( e.g. most keywords are treated as identifiers )
        The .java file is given as argument to JavaLexer