A common question asked in webmaster forums is how to block certain web spiders, crawlers or bots from accessing your site. You can do this using robots.txt, but some web crawlers have been known to ignore this request. A more reliable way to block bots is to use your .htaccess file instead.
What are web crawlers?
Web crawlers are often known as spiders or bots that systematically browse the web and perform automated tasks on your site. They can perform tasks such as:
- Check links in your content to other websites
- Validate your HTML code to check for errors
- Save information like the number of sites you link to or are linked from
- Store your site and content in an “archive”
Some bots are more sinister and will search your website for email addresses or forms that will be used to spam you or even search for security risks within your code.
What you need to start blocking web crawlers
Before you can start blocking web crawlers using .htaccess you'll need a couple of things first:
- Your site needs to be running on an Apache server. Most commercial web hosting companies will allow you to create or modify a .htaccess file – but free ones usually don't.
- You need access to your sites raw server logs so you can find the names of the web spiders you want to block (unless you already know what they are). Again, commercial hosting will provide this for you.
Note: unless you block all bots trying to access your website, you'll never fully be able to block them. New bots are made and existing ones modified all the time to get around anything you put in your .htaccess file. The best you can hope for is to make it more difficult for the bad bots who want to spam you or hack you.
Identifying the web crawler you want to block
To block a bot from trying to crawl your site you need to find one of two pieces of information about the bot — either the IP address the bot is using to access the web or the “User Agent String” which is the name of the crawler (for example Googlebot).
This database of 302 web bots might be useful if you already know the name of the bot you want to block with .htaccess.
Alternatively, you'll need to download your log files using FTP and open them with a text editor. The default location for your log files can vary depending on your server setup. If you aren't able to find your logs yourself, as your hosting company where they are stored.
To narrow down your search it helps if you can pinpoint which page the bot visited or what time they crawled the page so that you can search through your log.
Once you've found the bot(s) that you'd like to try and block, you can add them to your .htaccess file. Blocking the IP or bot name won't necessarily stop the bot forever, as they can be changed or moved to a new IP address.
Blocking robots in your .htaccess file
To start, you'll need to download your .htaccess file via FTP and take a copy of it in case you need to restore it later. The snippets below will show you how to block bots using either the IP address or the User-Agent string.
- Blocking by IP address. You can block specific IP's in .htaccess easily by using the following code:
Order Deny,Allow Deny from 127.0.0.1
You would obviously need to change 127.0.0.1 to whichever IP you'd like to block.
Order Deny,Allow
simply means that if the web server has a request that matches the Deny rule then it will deny it. If it doesn't match the Deny rule then it will allow it.The second line is telling the server to deny any requests from 127.0.0.1 which will issue a Forbidden message instead of the actual web page being requested.
You can add more IP's by adding Deny from lines to your .htaccess:
Order Deny,Allow Deny from 127.0.0.1 Deny from 215.146.3.3 Deny from 190.86.1.1
- Blocking bots by User-Agent string. The easiest way to block web crawlers by User-Agent string is to use a special function built into Apache called RewriteEngine. You can easily detect User-Agents and issue a 403 Forbidden error to them. So let's say we want to block some search engine bots:
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} Googlebot [OR] RewriteCond %{HTTP_USER_AGENT} AdsBot-Google [OR] RewriteCond %{HTTP_USER_AGENT} msnbot [OR] RewriteCond %{HTTP_USER_AGENT} AltaVista [OR] RewriteCond %{HTTP_USER_AGENT} Slurp RewriteRule . - [F,L]
What this does is takes a list of conditions (RewriteCond) and applies a rule to them. The F stands for Forbidden and the L means it's the last rule in the set.
Once you've made the changes and blocked the bots or IP's you want to, you can save the .htaccess file and upload it to your server, overwriting the original one.
You can keep the file updated as new bots or IP need to be blocked and if you did make a mistake you can revert it by using the original .htaccess file or just deleting the rules.
Leave a Reply