Software Architecture of a Web Crawler

Software Architecture of a Web Crawler

The software is composed of three parts:

Crawler and Parser: this is the kernel part of the design. We begin from a given URL and domain, download the document and parse it. Then we retrieve the URLs from inside the document. Each good and first-touch URL will be assigned a DocID and put to the to-do queue. After finishing each document parsing, the next URL in the to-do queue is taken out for next crawling.
Information Table: there are two tables that we maintain during the crawling. The first is a WordID table. Every different word in the web documents will be assigned a unique WordID and store in the WordID table. They are not case-sensitive to avoid being too large. The second table is to store all URLs inside the given domain. A unique DocID is also assigned to each distinguished URL for search purpose. To reduce searching time, the data structures for both tables are hashtables.
Database: a database is maintained to store each word and its corresponding appearances in all the documents. The WordID is used as the primary key for the table, and followed by a list of DocID that contains the word and the position of the word in that document. The position of the word is necessary for the analysis of the importance rank to answer key-word search.

1. Crawler and Parser

Web Crawler

The Crawler class implements Enumeration interface that traverses the web starting at a given URL. It retrieves html files and parses them for new URLs and words. The URLs are added to a to-do queue to be scanned. The words will be used in the information table. The html file is returned as a URLConnection object using the nextElement() method. Given a domain, the web crawler will crawl over all the links within.

The first thing when the crawler visits a web site is to check a special file in the root of each server called robots.txt, which is a plain text file and contains exclusions indicating paths the crawler should not pursue. These exclusions are stored in a hashtable for later use. In NoRobots class we use ok() method to implement the robot Exclusion standard. One by one, the crawler fetches the URL from the to-do queue. Then this path is checked against any exclusion for the site. When the ok() method returns true, the web page is visited and DocID, LastModified data of this URL are put into the DocID Hashtable. If the content type of the web page is "text/html", a htmlScanner object is made to parse the contents of this file looking for URLs and words. After that, we call a doThisURL() method to discard the URLs that are not located within the given domain. New URLs the crawler parsed out will be checked in the DocID hashtable. If DocID hashtable does not contain this URL, the new URL will be assigned a DocID and added to the to-do queue. During the crawling, whenever there is a broken link or error link, a MalformedURLException will be thrown out and caught. The user will be notified the error information. The crawler repeats the process above. When there is no URL in the to-do queue, the crawling is completed. The Crawler architecture is shown in Figure 1 below.

During the crawling, crawler will encounter some web sites with JavaScript, frames, image maps and dynamically generated pages. Some sites use SSL(Secure Sockets Layer) HTTPS protocol to ensure that private data is encrypted and authenticated during transit through the web. And still some sites have a private, password-protected section or have plug-ins such as Macromedia Flash, which do not generate Html web pages. All of these will confuse the crawler. In our current web crawler design, crawler just ignores all these web pages.

Web Crawler Architecture

Since the web is constantly changing, with links going out of date and new ones created, it is very important to update our data repository regularly. The crawler will revisit the web periodically. When the crawler visit the pages it also fetches the date of last modification data, both for updating purposes and the information of users.

Parser

We use an Acme parser package downloaded from the web. When one URL is found, the path is normalized. It is made without reference part and query string and the relative path is made absolute. This parser can only parse out the new URLs. Thus we need to add the gotText() method to extract the texts which are not in a tag from the html file. The parsed out texts are filtered to throw away all the meaningless signs among the words. These words and corresponding positions will later be sent to the WordID hashtable to process.

2. Information Table

WordID Table

After the document has been parsed, all the words will be sent to a static WordIDTB object to update the lexicon table. Each word will be assigned a unique WordID and stored in a hashtable, which use the String of the word as key and the assigned WordID as the value.

The WordID table could be updated one by one, but we prefer to update all the words in one document each time. There are two advantages for the second updating method:

We need to store the position of a word in each document in our database, and if we update a whole document one time, we could assign the position to each word when going through the document.
We could use a BufferStream to read in the document and create a Thread to do the updating work in parallel mode. We could also free the buffer space after each update, and then we could optimize the utility of the memory.

The hashtable class in java.util package is used, and the key for the hashtable is the String of the word, and the value is a unique WordID. A key point for the parser is to eliminate non-content-bearing "stopwords" such as "a", "and", "the", etc. An efficient way of doing this is to store all those words into the hashtable when the constructor of hashtable is called, and a 0 WordId is assigned to all those words. Then whenever we encounter such a word, we will find that it is already in the table with an ID 0, so we don t have to update it to the database.

Here is an introduction of the updating process. After a stream of document is passed in for update, words are retrieved one by one and assigned a position number. Each word is looked up in the hashtable. If it is in the table with a non-zero ID, then the value is retrieved; if, however, it is not in the table, then the Current Maximum WordID is increased by 1 and assigned to the word. After all, the WordID together with the current DocID and its position will be updated into the database.

DocID table

The DocID table is used to store all the URLs inside the given domain. However, we have much more than that to be stored. We have to assign a unique DocID for each web document, we have to save the last update time for the document for database updating efficiency, and also, we have to count all the links that point to the URL.

Again the default hashtable class in java.util package is used. The URL itself as a String is assigned as the key for the table search, and the value corresponding to the key would be a structure containing its DocID, a long number of last updating time and an integer to count all its backlinks.

The process to use and update the DocID table is as follows. Whenever a URL is encountered during the parsing, it is lookup in the hashtable. If it is already in the table, then its count will increase by 1, and an "already in-queue" message will be returned to the parser, so that it won t be push into the to-do queue again to avoid multiple parsing of a same document. If it is not yet in the table, the current Maximum DocID will increase by 1 and be assigned to it, and then the assigned DocID together with its last update time and a count of 1. A "Not-In-Queue" message will be returned to the parser to push the URL into the to-do queue for future crawling.

3. Database

The main functionality of the database is to store all the words (except the stopwords) of all the HTML documents we have visited during the crawling phase. Specifically, for every word we see in the documents, we would like to record the DocID of the document in which it appears and the location or position of the word in the document. The other main functionality of the database is to support efficient search of certain word in the database. That is, given a word, the database should return the information such as the documents in which it appears and positions of the word in those documents very quickly.

We have chosen to use a tree data structure for the purpose of storing and retrieving word info. In java, the TreeMap class is available to serve our purpose. The keys in the TreeMap would be the WordIDs. The values in the TreeMap would be Hashtable objects.

The keys in those Hashtables are DocIDs and the corresponding values are vectors that contain all the positions of each word in the specific document.

To add a piece of word information that contains the WordID, the DocID and position to our database, we first check if the word already exists in the database. If the answer is yes, we retrieve the corresponding Hashtable object and check to see if the document is already in the database. If yes, update the word position vector with our new position. If the document is not in the database, we build a new vector, add our new position into the vector and then update the Hashtable object.

If the word is not in our database, we build a new vector, add the new position into the vector, build a new Hashtable, add the pair (DocId, vector) to the Hashtable, and then add the Hashtable to the TreeMap.

Prev Index Next