
The software is composed of three parts:
1. Crawler and Parser
Web Crawler
The Crawler class implements Enumeration interface that traverses the web starting at a given URL. It retrieves html files and parses them for new URLs and words. The URLs are added to a to-do queue to be scanned. The words will be used in the information table. The html file is returned as a URLConnection object using the nextElement() method. Given a domain, the web crawler will crawl over all the links within.
The first thing when the crawler visits a web site is to check a special file in the root of each server called robots.txt, which is a plain text file and contains exclusions indicating paths the crawler should not pursue. These exclusions are stored in a hashtable for later use. In NoRobots class we use ok() method to implement the robot Exclusion standard. One by one, the crawler fetches the URL from the to-do queue. Then this path is checked against any exclusion for the site. When the ok() method returns true, the web page is visited and DocID, LastModified data of this URL are put into the DocID Hashtable. If the content type of the web page is "text/html", a htmlScanner object is made to parse the contents of this file looking for URLs and words. After that, we call a doThisURL() method to discard the URLs that are not located within the given domain. New URLs the crawler parsed out will be checked in the DocID hashtable. If DocID hashtable does not contain this URL, the new URL will be assigned a DocID and added to the to-do queue. During the crawling, whenever there is a broken link or error link, a MalformedURLException will be thrown out and caught. The user will be notified the error information. The crawler repeats the process above. When there is no URL in the to-do queue, the crawling is completed. The Crawler architecture is shown in Figure 1 below.
During the crawling, crawler will encounter some web sites with JavaScript, frames, image maps and dynamically generated pages. Some sites use SSL(Secure Sockets Layer) HTTPS protocol to ensure that private data is encrypted and authenticated during transit through the web. And still some sites have a private, password-protected section or have plug-ins such as Macromedia Flash, which do not generate Html web pages. All of these will confuse the crawler. In our current web crawler design, crawler just ignores all these web pages.

Since the web is constantly changing, with links going out of date and new ones created, it is very important to update our data repository regularly. The crawler will revisit the web periodically. When the crawler visit the pages it also fetches the date of last modification data, both for updating purposes and the information of users.
Parser
We use an Acme parser package downloaded from the web. When one URL is found, the path is normalized. It is made without reference part and query string and the relative path is made absolute. This parser can only parse out the new URLs. Thus we need to add the gotText() method to extract the texts which are not in a tag from the html file. The parsed out texts are filtered to throw away all the meaningless signs among the words. These words and corresponding positions will later be sent to the WordID hashtable to process.
2. Information Table
WordID Table
After the document has been parsed, all the words will be sent to a static WordIDTB object to update the lexicon table. Each word will be assigned a unique WordID and stored in a hashtable, which use the String of the word as key and the assigned WordID as the value.
The WordID table could be updated one by one, but we prefer to update all the words in one document each time. There are two advantages for the second updating method:
The hashtable class in java.util package is used, and the key for the hashtable is the String of the word, and the value is a unique WordID. A key point for the parser is to eliminate non-content-bearing "stopwords" such as "a", "and", "the", etc. An efficient way of doing this is to store all those words into the hashtable when the constructor of hashtable is called, and a 0 WordId is assigned to all those words. Then whenever we encounter such a word, we will find that it is already in the table with an ID 0, so we don t have to update it to the database.
Here is an introduction of the updating process. After a stream of document is passed in for update, words are retrieved one by one and assigned a position number. Each word is looked up in the hashtable. If it is in the table with a non-zero ID, then the value is retrieved; if, however, it is not in the table, then the Current Maximum WordID is increased by 1 and assigned to the word. After all, the WordID together with the current DocID and its position will be updated into the database.
DocID table
The DocID table is used to store all the URLs inside the given domain. However, we have much more than that to be stored. We have to assign a unique DocID for each web document, we have to save the last update time for the document for database updating efficiency, and also, we have to count all the links that point to the URL.
Again the default hashtable class in java.util package is used. The URL itself as a String is assigned as the key for the table search, and the value corresponding to the key would be a structure containing its DocID, a long number of last updating time and an integer to count all its backlinks.
The process to use and update the DocID table is as follows. Whenever a URL is encountered during the parsing, it is lookup in the hashtable. If it is already in the table, then its count will increase by 1, and an "already in-queue" message will be returned to the parser, so that it won t be push into the to-do queue again to avoid multiple parsing of a same document. If it is not yet in the table, the current Maximum DocID will increase by 1 and be assigned to it, and then the assigned DocID together with its last update time and a count of 1. A "Not-In-Queue" message will be returned to the parser to push the URL into the to-do queue for future crawling.
3. Database
The main functionality of the database is to store all the words (except the stopwords) of all the HTML documents we have visited during the crawling phase. Specifically, for every word we see in the documents, we would like to record the DocID of the document in which it appears and the location or position of the word in the document. The other main functionality of the database is to support efficient search of certain word in the database. That is, given a word, the database should return the information such as the documents in which it appears and positions of the word in those documents very quickly.
We have chosen to use a tree data structure for the purpose of storing and retrieving word info. In java, the TreeMap class is available to serve our purpose. The keys in the TreeMap would be the WordIDs. The values in the TreeMap would be Hashtable objects.
The keys in those Hashtables are DocIDs and the corresponding values are vectors that contain all the positions of each word in the specific document.
To add a piece of word information that contains the WordID, the DocID and position to our database, we first check if the word already exists in the database. If the answer is yes, we retrieve the corresponding Hashtable object and check to see if the document is already in the database. If yes, update the word position vector with our new position. If the document is not in the database, we build a new vector, add our new position into the vector and then update the Hashtable object.
If the word is not in our database, we build a new vector, add the new position into the vector, build a new Hashtable, add the pair (DocId, vector) to the Hashtable, and then add the Hashtable to the TreeMap.