study : Altavista's Scooter
The AltaVista Method
- AltaVista's Web Robot, Scooter, made its first crawl on July 4, 1995
- Altavista claims that it is the fastest known Web Robot in existence.
- Using computers working in parallel, Scooter harvests approximately 6 million
pages per day
- Scooter obeys the Standard for Robot Exclusion
- Scooter covers HTTP server content and USENET
- Scooter is international in scope
- Scooter harvests the full-text of pages
- Scooter is "widely recognized as one of the most sophisticated digital beings
of its kind." -- Levins
- See the Detailed Specifications of Scooter.
- Scooter gathers URLs and sends them to Fetch
- Fetch grabs a copy of the site and sends the data to Vista
- Vista indexes the data
- Results are retrieved by users with the AltaVista GUI
Vista (the indexer)
Vista's ranking algorithm
- Vista - the indexer - is also known as Ni2(Net indexer 2)
- Vista began as a way to index e-mail, then evolved to indexing
newsgroups (classified by keyword only)
- Digital - the makers of Altavista - estimated the size of
all Web text as equaling 1 terabyte (1000 GB), and decided that their servers
and full-text indexing software could handle that amount of data. Hence it
went to work making Vista able to index every single
word on the Web.
- Vista can index 1GB of text per hour.
- It generates a link for each word that Scooter brings back.
- It eliminates duplicates and ranks entries so that queries will produce
more effective results.
- Documents score higher (as more relevant) if search terms
appear in the the first few words of the document, especially the title
- Documents score higher if search terms appear close to one
another in the document
- Documents score higher if they contain more of the search
terms than other documents