 |
Information Retrieval (IR), also called "Document Detection", generally
performs two functions - document search, and document routing. These terms
are defined in the document referred to as
TIPSTER Generic IR (2000) as:
- document search - the selection of documents from an existing collection
of documents.
- document routing - the dissemination of incoming documents to appropriate
users on the basis of user interest profiles.
In commercial use today, most
IR systems seem to concentrate on the first of those two functions.
One of the greatest
limitations of IR is that currently most of the commercially-available technologies
rely heavily on keyword searches. This can be very frustrating for the user,
since the user has to figure out what the best keywords will be for a particular search.
It is also difficult due to the fact that humans don't think of "keywords" as a
natural way to search for information.
Some systems try to alleviate this
problem by using a natural language interface that allows a user to input a normally-worded
question (such as "what is the price of tea in China?"). The system then processes this
query and essentially converts it into a keyword search by removing question words,
prepositions, and articles, such as "what", "the", "is", "of", and "in". The words
remaining are considered keywords, and the query submitted might look like "price tea
China". Documents that contain these words will then be retrieved for the user.
Unfortunately, as anyone who has
performed a keyword search on a commercial search engine such as AltaVista or InfoSeek
can tell you, the relevance of the documents that are returned from the user's
original query is often questionable.
Current research in this area is
primarily concerned with improving the accuracy of the process and the relevance of
the documents returned. Formal evaluations, in the form of the Text REtrieval Conferences
(TRECs 1 through 9), have demonstrated that "Retrieval system effectiveness has
approximately doubled in the seven years since TREC-1"
(NIST 2000). However, there is still much progress to be made.
(TREC-7,
TREC-8,
and TREC-9
evaluations) TREC-9 URL corrected, 21 February 2001. EJB
|