Definition

The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention used to limit the impact of automatic web crawlers (spiders) on a web server. Well-behaved web page retrieval software will only visit pages permitted by the robots.txt file.

Overview

Web administrators who wish to limit bots’ actions on their Web server need to create a plain text file named “robots.txt.” The file must always have this name, and it must reside in the Web server’s root document directory. In addition, only one file is allowed per Web site. Note that the robots.txt file is a standard that is voluntarily supported by bot programmers, so malicious bots . . . often ignore this file.

The robots.txt file is a simple text file that contains some keywords and file specifications. Each line of the file is either blank or consists of a single keyword and its related information. The keywords are used to tell robots which portions of a Web site are excluded.[1]

References

  1. NIST Special Publication 800-44, at 5-7.

