Introduction to Information Extraction

CL Home		Erica Brown Home
	Information Extraction Information Extraction is the process of extracting user-specified text from a set of documents (Chinchor 2000).
	Evaluations of IE systems have been going on since the first Message Understanding Conference was conducted by Beth Sundheim for the Navy in 1987 (Chinchor 2000). Beginning in 1990, the MUC-3 conference was sponsored by DARPA. A description of the MUC-3 evaluation is available in Chinchor et al. (1993). More recently, the 1998 Hub-4 Broadcast News and Hub-5 Large Vocabulary Conversational Speech Recognition evaluations of Speech Transcription have included subtasks involving IE (http://www.nist.gov/speech/tests/index.htm). In the Hub-4 Broadcast news Evaluation, the IE task which was used was called Named Entity (NE). The purpose of the NE task "is to identify named expressions in broadcast news transcriptions" (Robinson et al. 1999). These named expressions included most proper names (such as names of persons, locations and organizations) as well as certain numerical expressions (including monetary amounts, time expressions, and percents). (Chinchor et al. 1998) The transcriptions were made for the most part by automatic speech recognition (ASR) systems, and often included various types of errors. The inclusion of ASR errors made the task somewhat more difficult than it had been during the MUC series of evaluations, since those evaluations used text as the source material. However, systems in the Broadcast News evaluation performed remarkably well. The best system in the 1999 Broadcast News evaluation received an F-score of 91, while the best system in the MUC-7 evaluation received an F-score of 93 (Robinson et al. 1999). The paper referenced above also includes brief examples of "noisy speech transcriptions" followed by the same noisy transcriptions with Named Entity tags included. In 1999, at the Broadcast News evaluation conference, a new task was proposed which would have moved on from the difficult Scenario Template (ST) task, which was last performed in the MUC-7 evaluation. The best scores achieved in the ST task were 51% accuracy (Chinchor 2000). In the proposed replacement task, dubbed the "Event99" task, the annotators were able to achieve over 80% accuracy, indicating that the task was likely defined clearly enough to be usable in a formal evaluation (Hirschman et al. 1999). Unfortunately, the Broadcast News evaluation moved in a different direction for 2000, and the Event99 task has not yet been formally evaluated.
	This page last modified November 13, 2006 by Erica Brown. httpd://www.oocities.org/ejb_wd/IE-intro.html © 2000-2006, Erica Jean Lindsey Brown, All rights reserved This page has been accessed - - times since November 14, 2000