How Web Robots Work

To understand how a Web robot works, it first helps to understand how a browser works. In essence, a browser is simply a program that responds to user input by sending out Hypertext Transport Protocol (http) commands over the Internet to retrieve Web pages and display them on the computer screen. Web pages are actually HyperText Markup Language (HTML) files containing text, formatting codes, and other information defining content. "Other information" typically includes the addresses of files containing bitmapped images and hypertext links to other HTML pages.

Clicking a hypertext link to a remote Web page initiates a series of actions. First, the browser retrieves the Uniform Resource Locator (URL) address for the destination from the current page. Then it establishes a connection to the remote server, issues an http GET command to retrieve an HTML document, and displays the document on the screen. Normally, the text of the document is displayed first. Images are stored under separate URLs so they can be downloaded separately. Links to those images are stored in the document's text so that the browser knows where to find them and where they go.

A Web robot is a browser with an autopilot. Instead of the user clicking a hypertext link, the robot downloads a page from the Web and scans it for links to other sites. Then it picks a URL and jumps. From there it jumps to another Web site and begins the process again. When it encounters a site that doesn't contain links, it backs up a level or two and jumps to one of the links it skipped the time before. Once it's turned loose, a robot that employs a simple recursive navigation algorithm can cover vast distances in cyberspace, and, because the Web changes daily, the robot's path can change daily, too. In a conceptual sense, the robot "crawls" the Web like a spider. All it needs is a place to start.

What does a robot do with the pages it downloads? That depends on why the robot was created in the first place. Robots that index the Web typically use proprietary algorithms to generate document abstracts that go into a huge database of Web sites and their contents. They also build indexes correlating words and phrases found in HTML documents to URLs, so that, for example, someone using a search site can quickly get a list of all Web pages containing the word "Cozumel." One reason indexing robots can visit so many sites so fast is that they only have to grab HTML files. They don't have to download images, because the search databases can only index text.

Not all robots were built to index Web pages. Some exist just to identify links to pages that no longer exist. Others walk the Web to gather statistics regarding Web usage--for example, to determine which sites are the most popular by counting the number of references to them in other Web pages or to tally the number of Web pages as a tool for measuring growth.

It's not difficult to imagine why robots can place a strain on a server's resources. When a robot visits a site, it can request a large number of documents in a short period of time: It's as if a hundred or more extra users logged in all at once. Some servers will handle it; others won't. That's one reason a set of informal protocols has emerged to describe "good" robot behavior. A well-behaved robot, for example, will space requests to a given server at least a minute apart so the server doesn't become overwhelmed. It will also identify itself so its owner can be contacted if problems arise, and perform internal checks on the URLs it visits to avoid redundant accesses.

Prev Index Next