Internet Harvesting Agent Design
There are basically four fundamental classes of problems to be addressed
in the design of harvesting robots.
- 1.
-
Harvesting robots should be nice.
They should neither waste networking resources,
nor overload the Internet servers from which the search engine obtains its
information.
- 2.
-
They should do so,
while still maximizing performance on an Internet which is growing at
an ever increase rate.
- 3.
-
Robot software should maintain the integrity of the database,
i.e. they should
ensure that records in the database are updated when needed and are
removed when resources disappear.
We will not dwell upon all these gory details here.
- 4.
-
The robot, or possibly some auxiliary software, should parse the retrieved
documents producing database records.
This is a complicated task: First, HTML is a moving target and, although
it is in principle easy to parse, it is in reality a programmers nightmare:
The proportion of documents with unpredictable erroneous syntax is high.
In addition there
are different DTD (Document Type Definitions) that are partly
incompatible. Secondly,
the records should contain an optimal amount of information split into
appropriate fields,
in order to make possible exhaustive searches with high precision.
There are several instruments available for achieving these goals,
and we will devote this and subsequent sections to a description
of what is known about how this is achieved in different implementations.
However, for reasons described above,
little is actually known.
In this report we will only skim over the class four type problems:
Suffice it to say that there are a number of software packages available,
APIs as well as stand alone programs, for the parsing of documents.
The record structure used in major search services have been dealt with
earlier in this report.
The production of good index data is not only dependent
on harvesting robot designs.
Some collaboration from Web-masters is needed,
to prevent robots from retrieving and indexing material that is of
little general interest, and guide them to valuable information.
Until recently the only way Web-masters have had to their disposal
for communication with the robot was the
robots.txt file
the so-called robot exclusion standard,
which is written in a simple to parse text format.
One of the criticisms raised against the standard
is that only Web-masters can use it, and that authors should also be
able give directions to robots.
Mouldin et al. have recently proposed that the HTML
META tag could be used for the purpose
(http://www.kollar.com/robots.html).
The META Tag ....
-
<META NAME="ROBOTS" CONTENT="robots-terms">
where the content robot-terms is list that may contain one or
more of the keywords ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW.
Entries in the list should be separated by comma.
Quoting from the original proposal, the meaning of these keywords are
- NONE
-
Tells all robots to ignore this page (equivalent to: NOINDEX, NOFOLLOW).
- ALL
-
There are no restrictions on indexing this page, or following links from this page to determine pages to index (equivalent to: INDEX, FOLLOW).
- INDEX
-
All robots are welcome to include this page in search services.
- NOINDEX
-
This page may not be indexed by a search service.
- FOLLOW
-
Robots are welcome to follow links from this page to find other pages.
- NOFOLLOW
-
Robots are not to follow links from this page.
In addition to these meta name content pairs the authors suggest
the addition of the following
-
<META NAME="DOCUMENT-STATE" CONTENT="STATIC">
<META NAME="DOCUMENT-STATE" CONTENT="DYNAMIC">
A document whose state is dynamic is modified continuously.
Its content is thus not suitable for indexing,
and:
-
``The creator of a document may wish to tell a spider or a user bookmarking a page whether the content of the url is designed to
change from access to access (dynamic). In this manner the spider would not save the url, and the web browser may or may not
bookmark it. While this information could be conveyed in the /robots.txt file, it is possible that the user may not have the ability to
modify it.''
Mauldin et al. also proposes that the authors might need
a way to direct robots to an alternative URL pointing to a static description
of the resource, a good idea for pages with dynamic content.
Again they suggest the use of the meta tag
-
<META NAME="URL" CONTENT="absolute url">
where ``absolute url'' refers to the alternative page.
Prev
Index
Next