Robot.txt


Robot.txt is a text file that used to instruct search engine robots on how to crawl on pages of the websites. Webmasters use this type of text file to manipulate web robots. It is a part of Robots exclusion protocol. REP is a group of web standards that leads robots to crawl on the web, access, and index content so search engines can serve it to users. In practice, robots.txt file can choose to allow or disallow particular web-crawling software to access specific parts of their websites. These files must be located on the top-level directory part of the web site. The crawl instruction filters result and specification in which users can crawl on which parts of the site.  

The basic format of the robots.txt looks like this:

User-agent: [user-agent name]

Disallow/Allow: [URL string not to be crawled]

Those two lines above are a complete robot.txt file. Users can create txt files that have several lines like this if their job requires more complex filters. 

How does it work?

Overall there are two primary duties for search engines.

  1. They crawl on the web to find content

  2. After they find the material, index it so that it can be served to the users who are looking for related topics though keywords. 

Search engines use links (follow the link) to get from one destination to another. As a result, they crawl across billions of various web sites and links — the act of, called spidering. The process of spidering and overall crawling starts when bot finds a website and enters it. Before the spidering phase, the bot looks for Robots.txt file so that it can take manuals from the site owner. If the bot can find one robot.txt file, it enters it and learns manuals. Malware robots and viruses do not choose to read robots.txt file; that is why web site owners should not include any sensitive information in that file.

