Superior Website Development and Design

Edward Rayne is a superior web development and design business in Topeka Kansas with over 16 years of professional experience building modern accessible websites. We specialize in solutions for small businesses that will enable them to have a successful productive web presence.

Request Quote

  • Featured Project: Paddock Tool
  • Featured Project: CuteZzz
  • Featured Project: Gotcha Blogger

Preventing Search Engine access with /robots.txt

The /robots.txt file is simply a text file on your root directory that tells all “good” robots (the ones that aren’t fishing for email address or other spammy pursuits) what parts of your site to not index. By default all content that is accessible to a search engine spider is considered fair game.  Because of this a /robots.txt file is only needed when you want to keep the search engines out of a particular directory.

It should also be noted that it is optional for search engines to follow the directives of the /robots.txt file.  Although all major search engines tend to comply with the robots.txt file there are plenty of spammy spiders that won’t.  For that reason the only sure fire way to protect content is to place password protection on the sensitive areas.

If you have something that you want to disallow search engine  spiders access to, but still allow unrestricted surfer access the best practice is to create a /robots.txt file in your root directory. That means the url to it should look like this:

http://www.yourwebsite.com/robots.txt

The following are examples of what should be contained in a /robots.txt file:

If you want to exclude ALL robots from your ENTIRE site:

User-agent: *
Disallow: /

If you want to exclude ALL robots from some content on your site:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

If you want to disallow a single robot:

User-agent: EvilRobot
Disallow: /

IF you want to let ONE robot in:

User-agent: GoogleBot
Disallow:
User-agent: *
Disallow: /

The user-agent field is used to identify the bot you would like to allow/deny access. The * character is a wild card character and stands for “any”. Please note that if there is a part of your site that is private it should be password protected even if you use /robots.txt to disallow crawling of the page. /robots.txt is NOT enforceable and although legitimate spiders of search engines will follow its directives, not all bots are benign.

It’s also important to know that just because you have excluded content from being crawled, search engine results can still contain the urls to that content. This is because the spiders may have spotted links to your content from other sites with anchor text suggesting your uncrawled url may be good for that term. Below is a great video from Matt Cutts showing how that works.

Uncrawled URLs in Search Results – Matt Cutts

Share this post on Social Bookmarking Websites
Leave a Reply

CommentLuv Enabled
Contact Us:



Connect with Edward Rayne