The Standard for Robot Exclusion

In 1994, while the Web was still in its infancy, a group of Internet users who recognized the problems that unruly robots could cause got together by e-mail and developed an unofficial standard known as the Standard for Robot Exclusion (SRE). The SRE defines a protocol that permits site managers to exclude robots from designated areas of their Web sites.

One very pragmatic use for the SRE is to prevent robots that are run by search sites from indexing temporary HTML documents that probably won't be around this time tomorrow. Many Web sites (search sites especially) generate HTML documents on the fly to fulfill requests from users. They then go back and delete those documents when they're no longer needed. Obviously, it does no one any good to have pages like these included in Web indexes. Another use for the SRE is to allow robots to steer clear of pages that are under construction or to leave instructions for them to avoid your site altogether.

Using the SRE to exclude robots from all or part of a Web site is simplicity itself. All you do is create an HTTP-accessible file on the Web server and assign it the local URL /robots.txt. This is a text file containing English-language commands spelling out access policies for robots. SRE-compliant robots look for this file when they first connect to a server and abide by the access policies contained therein. If there is no robots.txt file, the robot can assume that it's welcome anywhere in your Web site. Note that the onus is on you, the site manager, to tell robots where they can and cannot go. Also be aware that not all robots are SRE-compliant, and adherence to the commands in robots.txt is strictly voluntary on the part of the robot. Fortunately, the vast majority of existing robots either support SRE or are in the process of having SRE support added.

What does a robots.txt file look like? Here's a simple one that asks all robots to stay away from /tmp/documents and its subdirectories:

    # Sample robots.txt file
    User-agent: *
    Disallow: /tmp/documents/

The first line is a comment line. You can place comments anywhere in a robots.txt file provided they're preceded with pound symbols. The second line designates the robots to which the access policies apply; "*" means all robots. The third line disallows access to the specified directory and to any directories below it in the hierarchy. You can include multiple Disallow statements to prohibit access to two or more directories.

In some cases, it might be desirable to allow certain robots access to areas that are off limits to other robots. The following robots.txt file allows unrestricted site access to a robot named Hal-9000 but prohibits others from accessing either /tmp/documents or /under_construction:

    # Sample robots.txt file
    User-agent: *
    Disallow: /tmp/documents/
    Disallow: /under_construction/
    User-agent: Hal-9000
    Disallow:

To tell SRE-compliant robots to skip your site altogether, simply create a robots.txt file containing these two statements:

    User-agent: *
    Disallow: /

Upon seeing this, well-behaved robots immediately disconnect and go find another server.

Prev Index Next