Preventing Search Engine access with /robots.txt
The /robots.txt file is simply a text file on your root directory that tells all “good” robots (the ones that aren’t fishing for email address or other spammy pursuits) what parts of your site to not index. By default all content that is accessible to a search engine spider is considered fair game. Because of this a /robots.txt file is only needed when you want to keep the search engines out of a particular directory.
It should also be noted that it is optional for search engines to follow the directives of the /robots.txt file. Although all major search engines tend to comply with the robots.txt file there are plenty of spammy spiders that won’t. For that reason the only sure fire way to protect content is to place password protection on the sensitive areas.
If you have something that you want to disallow search engine spiders access to, but still allow unrestricted surfer access the best practice is to create a /robots.txt file in your root directory. That means the url to it should look like this:
http://www.yourwebsite.com/robots.txt
The following are examples of what should be contained in a /robots.txt file:
If you want to exclude ALL robots from your ENTIRE site:
User-agent: * Disallow: /
If you want to exclude ALL robots from some content on your site:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
If you want to disallow a single robot:
User-agent: EvilRobot Disallow: /
IF you want to let ONE robot in:
User-agent: GoogleBot Disallow:
User-agent: * Disallow: /
The user-agent field is used to identify the bot you would like to allow/deny access. The * character is a wild card character and stands for “any”. Please note that if there is a part of your site that is private it should be password protected even if you use /robots.txt to disallow crawling of the page. /robots.txt is NOT enforceable and although legitimate spiders of search engines will follow its directives, not all bots are benign.
It’s also important to know that just because you have excluded content from being crawled, search engine results can still contain the urls to that content. This is because the spiders may have spotted links to your content from other sites with anchor text suggesting your uncrawled url may be good for that term. Below is a great video from Matt Cutts showing how that works.




