Every day, hundreds of web crawlers (also known as spiders, bots or robots) will search the web and will most like crawl your website as well. It could be Google trying to index your website to show in their search results, or a spam bot trying to find email addresses to send junk mail to.
For most webmasters, it is a very good idea to control which parts of your site they are allowed to crawl and visit – and which they are not. To do this, we use a simple text file called
robots.txt in the main directory of your website (for example:
http://sitebeginner.com/robots.txt) you can advise the crawlers which pages and directories they should access or ignore. Robots.txt is purely advisory though – the search engine spiders can ignore it and still index your site. To avoid this, you can block spiders using .htaccess instead.
Why you should have a robots.txt file for your website
There are a number of very good reasons why you might want to restrict access to your site for crawlers by using the robots.txt file.
- To prevented waster server resources. Each time a crawler finds your site it will attempt to call all of your scripts in the same way a browser would. This means that your images, videos and other media will be loaded, along with search forms and other scripts like contact forms.
This can be a big drain on your server resources and it can sometimes cause your server to crash, resulting in down time for your website. It can be useful to monitor your server logs to find crawlers that slow your website down or that spider your site in large volume to block them with robots.txt.
- To save your server bandwidth. While many commercial web hosts now offer unlimited (within reason) bandwidth, you might find that allowing spiders to crawl a very large website unrestricted will cause a spike in bandwidth user that you may have to pay for with your web host. This happens when the spiders access videos and images and other large files which have to be downloaded and indexed.
Having these in folders which are blocked with the robots.txt file can reduce the bandwidth and save you an expensive hosting bill.
- To block a rogue or spam bot you don't want to access your site. It could be that a spider than has a very high crawl rate that is slowing your site down or a spam bot is trying to hit your contact form with unsolicited email spam. In these cases you can block them from accessing your site using robots.txt or .htaccess (or preferably, both).
To do this you'll need to find the name of the bot as it appears in your server log so we can use it to target and blog with robots.txt.
Setting up a robots.txt file
It's really easy to get started with a robots.txt file. It's just a plain text file that you keep in the main directory of your site. However, you'll need to be careful to make sure you don't inadvertently block something important. You can easily block your entire site from Google and see your traffic from their search disappear!
To start with, create a plain text document and save it as robots.txt. It needs to be uploaded to the root folder (
mysite.com/robots.txt and not
mysite.com/folder/robots.txt). By placing it in the root, search engines and other crawlers know exactly where to look before they begin to index your site.
Each entry in the file will have a User-agent line to identify the spider you want to instruct followed by one or more Disallow: lines to tell that crawler what to avoid.
So with that in mind let's start with a basic entry into your robots.txt:
Robots.txt example 1
User-agent: * Disallow: /
In this sample robots.txt the asterisks (*) is a wildcard which targets all User-agents. This is instructing them to not index any of your pages. Don't use this unless you want to completely block Google and all other search engines and spiders.
It's useful to use this if you are targeting a specific agent, such as a spam bot or other crawler you don't want on your site.
Robots.txt example 2
Next, you can block specific directories from being indexed by any bots. This is useful if you have admin areas, password protected areas or testing sections of your site which you don't want the search engines to see:
User-agent: * Disallow: /cgi-bin/ Disallow: /admin-area/ Disallow: /testing/test.htm
Robots.txt example 3
Being specific in your robots.txt will override a previous rule. So if, for example, you have blocked your pages from being indexed by all spiders but you do want to give Google access, you can do this:
User-agent: * Disallow: / User-agent: Googlebot Disallow: /cgi-bin/ Disallow: /testing/test.htm
Robots.txt example 4
Instead of disallowing all of your pages from being crawled, you are also able to allow all of them to be accessed by not putting anything after
Disallow:. For example:
User-agent: * Disallow: / User-agent: Googlebot Disallow:
Robots.txt example 5
Some web crawlers (like Google) also accept
Allow: as a wait to explicitly tell them which pages and directories to crawl. Here, we are disallowing all crawlers except Google:
User-agent: * Disallow: / User-agent: Googlebot Allow: /
This is the recommended way for you to set this up according to Google's FAQ page, but the robots.txt protocol doesn't officially support it – so only use it for Google at this time.
Common mistakes when using robots.txt
When you first start creating your robots.txt file, there are a few common mistakes you might make:
- Disallowing your site by mistake. It's very easy to leave the backslash on
Disallow: /which would block the specified user agent from accessing any of your content. This is particularly dangerous with search engine spiders and they may remove your site from their index. Double check to make sure you are blocking the right files for the right bots.
- It doesn't always block the bots. Robots.txt is advisory and some bots – particularly spam bots – will ignore the file and continue to crawl your site. In cases like these, use .htaccess to block those crawlers instead.
- Listing your hidden directories. Your file is available for all to see at
yoursite.com/robots.txtso don't list any files or folders you don't want people or robots to find. Listing the folders in robots.txt will not block them from being accessed directly in the browser or by a bot that ignores the file. If you need to restrict access, use .htaccess to password protect the directory.
- Placing multiple directories on one line. If you place more than one directory to disallow on a line, robots.txt will not function properly and some folders will still be crawled. Always list each directory on a new