Internet Harvesting Agent Design

There are basically four fundamental classes of problems to be addressed in the design of harvesting robots.

1.: Harvesting robots should be nice. They should neither waste networking resources, nor overload the Internet servers from which the search engine obtains its information.
2.: They should do so, while still maximizing performance on an Internet which is growing at an ever increase rate.
3.: Robot software should maintain the integrity of the database, i.e. they should ensure that records in the database are updated when needed and are removed when resources disappear. We will not dwell upon all these gory details here.
4.: The robot, or possibly some auxiliary software, should parse the retrieved documents producing database records. This is a complicated task: First, HTML is a moving target and, although it is in principle easy to parse, it is in reality a programmers nightmare: The proportion of documents with unpredictable erroneous syntax is high. In addition there are different DTD (Document Type Definitions) that are partly incompatible. Secondly, the records should contain an optimal amount of information split into appropriate fields, in order to make possible exhaustive searches with high precision.

There are several instruments available for achieving these goals, and we will devote this and subsequent sections to a description of what is known about how this is achieved in different implementations. However, for reasons described above, little is actually known. In this report we will only skim over the class four type problems: Suffice it to say that there are a number of software packages available, APIs as well as stand alone programs, for the parsing of documents. The record structure used in major search services have been dealt with earlier in this report.

The production of good index data is not only dependent on harvesting robot designs. Some collaboration from Web-masters is needed, to prevent robots from retrieving and indexing material that is of little general interest, and guide them to valuable information. Until recently the only way Web-masters have had to their disposal for communication with the robot was the robots.txt file the so-called robot exclusion standard, which is written in a simple to parse text format. One of the criticisms raised against the standard is that only Web-masters can use it, and that authors should also be able give directions to robots. Mouldin et al. have recently proposed that the HTML META tag could be used for the purpose (http://www.kollar.com/robots.html).

The META Tag ....

<META NAME="ROBOTS" CONTENT="robots-terms">

where the content robot-terms is list that may contain one or more of the keywords ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW. Entries in the list should be separated by comma. Quoting from the original proposal, the meaning of these keywords are

NONE: Tells all robots to ignore this page (equivalent to: NOINDEX, NOFOLLOW).
ALL: There are no restrictions on indexing this page, or following links from this page to determine pages to index (equivalent to: INDEX, FOLLOW).
INDEX: All robots are welcome to include this page in search services.
NOINDEX: This page may not be indexed by a search service.
FOLLOW: Robots are welcome to follow links from this page to find other pages.
NOFOLLOW: Robots are not to follow links from this page.

In addition to these meta name content pairs the authors suggest the addition of the following

<META NAME="DOCUMENT-STATE" CONTENT="STATIC">
<META NAME="DOCUMENT-STATE" CONTENT="DYNAMIC">

A document whose state is dynamic is modified continuously. Its content is thus not suitable for indexing, and:

``The creator of a document may wish to tell a spider or a user bookmarking a page whether the content of the url is designed to change from access to access (dynamic). In this manner the spider would not save the url, and the web browser may or may not bookmark it. While this information could be conveyed in the /robots.txt file, it is possible that the user may not have the ability to modify it.''

Mauldin et al. also proposes that the authors might need a way to direct robots to an alternative URL pointing to a static description of the resource, a good idea for pages with dynamic content. Again they suggest the use of the meta tag

<META NAME="URL" CONTENT="absolute url">

where ``absolute url'' refers to the alternative page.

Prev Index Next