Chue Wai Lian's JavaCC Article

Parsing Information from HyperText Markup Language

(HTML) Files with Java Compiler Compiler (JavaCC)

ABSTRACT

     The rapid growth of the Internet has led to the development of Internet II. Web surfers view information retrieved from the Internet as rich and relevant. This article explores the basic concepts of automatic parser generation and the features of a parser generator named JavaCC from Sun Microsystems. In addition, an application example to parse information from a HTML file will be presented to illustrate the basic concepts.

INTRODUCTION

     Basically, a parser converts text that can be read by humans into data structures known as parse trees, which are understood by the computer. This process is similar to compiling a Java source file (i.e. .java) into the corresponding Java bytecode (i.e. .class) that can be executed on any Java Virtual Machine.

     JavaCC is currently the most popular automatic parser generator for use with Java applications. JavaCC is a tool that reads a high-level grammar specification (i.e. .jj) and converts it to a Java program (i.e. .java) that can recognise matches to the grammar. In addition to the parser generator itself, JavaCC provides other standard capabilities related to parser generation such as tree building, actions, debugging, etc. Both JavaCC and the parsers generated by JavaCC may be executed on a variety of Java platforms (i.e. "Write Once, Run Anywhere").

JavaCC Flow Chart

Figure 1 Flowchart illustrating the steps involved in creating a JavaCC parser program

     The name and function of each of the generated Java files are depicted in Table 1.

Table 1 Description of the Generated Java Files

Generated Java Files Description
TokenMgrError.java Returns a detailed message for the Error when it is thrown by the token manager to indicate a lexical error.
ParseException.java This exception is thrown when parse errors are encountered.
Token.java Describes the input token stream.
ASCII_CharStream.java An implementation of interface CharStream, where the stream is assumed to contain only ASCII characters (i.e. without UNICODE processing).
ParserName.java This is the main parser program.
ParserNameTokenManager.java Token Manager for the parser.
ParserNameConstants.java Definition of constants for the parser.

     The power of automatic parser generation is that it allows programmers to concentrate on the grammar and NOT worry about the correctness of the implementation. This can be a tremendous time-saver in both simple and complex projects.

JAVACC GRAMMAR FILE

     All JavaCC grammar files must have parser specifications while lexical specifications are optional. Reference [1] gives a detailed description of the JavaCC Grammar File. Table 2 shows the structures that may be used in parser and lexical specifications.

Table 2 Valid Structures in Expansions/Regular Expressions

Valid Structures in
Expansions / Regular Expressions
Meaning
e1 | e2 | e3 | . . . A choice of e1, e2, e3, etc (i.e. Logical OR).
( . . . )+ One or more occurrences of . . .
( . . . )* Zero or more occurrences of . . .
( . . . )? An optional occurrence of . . .
[ . . . ] A pattern that is matched by the characters (individual characters or character ranges) specified in . . .
~[ . . . ] A pattern that matches any characters NOT specified in . . .

PARSING A HTML WEB PAGE

     In general, the collection and parsing of information from HTML files may be divided into two steps:

a.    It is collected from an external source via Transmission Control Protocol / Internet Protocol (TCP / IP).
b.    It is parsed to extract the relevant information.

     An application example to extract horoscope information will be used to illustrate the process of writing a high-level grammar specification (i.e. Horoscope.jj).

HOROSCOPE PARSER PROGRAM

     Please refer to http://www.astrology-online.com/daily.htm for the daily updated horoscope web page. Suppose that only the date of the horoscope forecast (i.e. "These are for DayOfWeek MM/DD/YY"), the sign name (i.e. ARIES, …, PISCES) as well as the horoscope for the corresponding twelve signs will be extracted. The following software were used for development: Java Development Kit (JDK) and JavaCC from Sun Microsystems.

     In order to "partition" the web page into different information blocks (i.e. each of these blocks contain the desired information), the source code of the web page was examined to identify certain unique features to "bookmark" the start/end of each block and thus defined as TOKENs. Note also that all white space will be ignored and NOT considered for parsing (i.e. They will be "skipped" or "thrown" away since they do NOT contain "important" / relevant information).

// LEXICAL SPECIFICATIONS

SKIP :
{
          // Skip white space
          // Space | Newline | Tab | Carriage Return
          < (" " | "\n" | "\t" | "\r" )+ >
}

TOKEN [IGNORE_CASE] :
{
          // Start of Sign
          < SSIGN: "<I><U>" >
          |
          // Start of Change HTML tag
          < SCHANGE: "<!--" ("-")+ "CHANGE" <TEXT> ">" >
          |
          // Start of End HTML tag
          < SEND: "<!--" ("-")+ "END" <TEXT> ">" >
          |
          // End of the horoscope for the 12 signs
          < EHORO: "<!---EDIT NOTHING ELSE--->" >
          |
          // HTML tag in a HTML file
          < TAG: "<" (~[">"])+ ">" >
          |
          // Text in a HTML file
          < TEXT: (~["<",">"])+ >
}

     The parsing process may be divided into three major JavaCC productions:

a.    Header – Find where the interested content starts.
b.    Content – Get the interested content and process content text.
c.    Footer – Find where the interested content ends.

// PARSER SPECIFICATION

/*
** JavaCC Production : ParseFile()
** Description : Divide the parsing procedure into three parts.
*/
void ParseFile() :
{
}
{
          Header() (GetContent())+ Footer()
}

     Notice that the date of the horoscope forecast lies in between "<!---CHANGE DATE AND DAY HERE--->" and "<!---END CHANGE DATE AND DAY---->" HTML comment tags. The date of the horoscope forecast will be extracted in the JavaCC production Header as follows:

// PARSER SPECIFICATION

/*
** JavaCC Production : Header()
** Description : Find where the interested content starts.
** Parameters : Token TokenDate - Contents of horoscope date.
*/
void Header() :
{
          Token TokenDate = null ;
}
{
          (<TAG> | <TEXT>)+
          <SCHANGE> TokenDate=<TEXT>
          <SEND> (<TAG> | <TEXT>)+
          {
                    HoroscopeData.StrForecastDate = TokenDate.image.trim() ;
          }
}

     From the web page, notice that the format and layout for each block of horoscope forecast for the twelve signs is the same. Hence, the powerful concept of a recursive call was used to design the JavaCC production GetContent. Note that in ParseFile(), the JavaCC production GetContent was specified to repeat one or more times (i.e. (GetContent())+). Upon parsing the web page, the JavaCC production GetContent will be repeated exactly twelve times since the format and layout is the same from the ARIES to the PISCES horoscope forecast.

     Observe that the names of the twelve signs are in italics and underlined (i.e. "<I><U>" HTML tags). The horoscope for the corresponding sign lies in between "<!---CHANGE SIGNNAME HERE--->" and "<!---END SIGNNAME--->" HTML comment tags. The sign name and the corresponding horoscope will be extracted in the JavaCC production GetContent as follows:

// PARSER SPECIFICATION

/*
** JavaCC Production : GetContent()
** Description : Get the interested content.
**                      Process content text.
** Parameters : Token TokenSign - Contents of Sign name.
**                      Token TokenStory - Contents of horoscope.
*/
void GetContent() :
{
          Token TokenSign = null ;
          Token TokenStory = null ;
}
{
          <SSIGN> (<TAG>)+ TokenSign=<TEXT> (<TAG>)+
          <TEXT> (<TAG> | <TEXT>)+
          <SCHANGE>
          TokenStory=<TEXT>
          <SEND> (<TAG> | <TEXT>)+
          {
                    // Get Sign
                    HoroscopeData.AStrSign[i] = TokenSign.image.trim() ;

                    // Get Story
                    HoroscopeData.AStrContent[i++] = TokenStory.image.trim() ;
          }
}

     For the JavaCC production Footer, the "<!---EDIT NOTHING ELSE--->" HTML comment tag marks the end of the horoscope forecast. The rest of the HTML file (made up of HTML tags or text in the HTML file) will be "ignored".

// PARSER SPECIFICATION

/*
** JavaCC Production : Footer()
** Description : Find where the interested content ends.
** Parameters : None.
*/
void Footer() :
{
}
{
          <EHORO> (<TAG> | <TEXT>)+
}

Click here to view a demo of the Horoscope Parser Applet.

CONCLUSION

     JavaCC is an extremely useful tool for automatic parser generation. Like Java, both JavaCC and the parsers generated by JavaCC are platform-independent. The major advantage in using JavaCC to parse HTML files is that a programmer only needs to concentrate on the grammar. Only minor modifications to the grammar file needs to be performed if the format and layout of the HTML file changes. However, if a C++ or Visual Basic program was written to parse HTML files, major modifications to the source code will be required if the format and layout of the HTML file changes. The HTML parser discussed earlier may be extended to parse HTML files of other languages provided that UNICODE was used to encode the contents of the HTML web pages.

REFERENCES

[1] Sun Microsystems JavaCC Web Site, http://www.sun.com/suntest/products/JavaCC/index.html

[2] Sun Microsystems JDK Web Site, http://www.javasoft.com/products/jdk/1.1/index.html


Home PageTable of Contents


Copyright© 2007 by Chue Wai Lian

Last updated on Wednesday, 26 December 2007

Please send all mails to wailian at hotmail dot com