Site Beginner

Learn How to Build Websites

  • Make a Website
  • Start a Blog
  • BlueHost Review
  • About
  • Blog
  • Contact
You are here: Home / Apache / How to Block Web Spiders/Crawlers and Bots from Your Website With .htaccess

How to Block Web Spiders/Crawlers and Bots from Your Website With .htaccess

A common question asked in webmaster forums is how to block certain web spiders, crawlers or bots from accessing your site. You can do this using robots.txt, but some web crawlers have been known to ignore this request. A more reliable way to block bots is to use your .htaccess file instead.

Contents

  • What are web crawlers?
  • What you need to start blocking web crawlers
  • Identifying the web crawler you want to block
  • Blocking robots in your .htaccess file

What are web crawlers?

Web crawlers are often known as spiders or bots that systematically browse the web and perform automated tasks on your site. They can perform tasks such as:

  • Check links in your content to other websites
  • Validate your HTML code to check for errors
  • Save information like the number of sites you link to or are linked from
  • Store your site and content in an “archive”

Some bots are more sinister and will search your website for email addresses or forms that will be used to spam you or even search for security risks within your code.

What you need to start blocking web crawlers

Before you can start blocking web crawlers using .htaccess you'll need a couple of things first:

  1. Your site needs to be running on an Apache server. Most commercial web hosting companies will allow you to create or modify a .htaccess file – but free ones usually don't.
  2. You need access to your sites raw server logs so you can find the names of the web spiders you want to block (unless you already know what they are). Again, commercial hosting will provide this for you.

Note: unless you block all bots trying to access your website, you'll never fully be able to block them. New bots are made and existing ones modified all the time to get around anything you put in your .htaccess file. The best you can hope for is to make it more difficult for the bad bots who want to spam you or hack you.

Identifying the web crawler you want to block

To block a bot from trying to crawl your site you need to find one of two pieces of information about the bot — either the IP address the bot is using to access the web or the “User Agent String” which is the name of the crawler (for example Googlebot).

This database of 302 web bots might be useful if you already know the name of the bot you want to block with .htaccess.

Alternatively, you'll need to download your log files using FTP and open them with a text editor. The default location for your log files can vary depending on your server setup. If you aren't able to find your logs yourself, as your hosting company where they are stored.

To narrow down your search it helps if you can pinpoint which page the bot visited or what time they crawled the page so that you can search through your log.

Once you've found the bot(s) that you'd like to try and block, you can add them to your .htaccess file. Blocking the IP or bot name won't necessarily stop the bot forever, as they can be changed or moved to a new IP address.

Blocking robots in your .htaccess file

To start, you'll need to download your .htaccess file via FTP and take a copy of it in case you need to restore it later. The snippets below will show you how to block bots using either the IP address or the User-Agent string.

  • Blocking by IP address. You can block specific IP's in .htaccess easily by using the following code:
    Order Deny,Allow
    Deny from 127.0.0.1

    You would obviously need to change 127.0.0.1 to whichever IP you'd like to block. Order Deny,Allow simply means that if the web server has a request that matches the Deny rule then it will deny it. If it doesn't match the Deny rule then it will allow it.

    The second line is telling the server to deny any requests from 127.0.0.1 which will issue a Forbidden message instead of the actual web page being requested.

    You can add more IP's by adding Deny from lines to your .htaccess:

    Order Deny,Allow
    Deny from 127.0.0.1
    Deny from 215.146.3.3
    Deny from 190.86.1.1
  • Blocking bots by User-Agent string. The easiest way to block web crawlers by User-Agent string is to use a special function built into Apache called RewriteEngine. You can easily detect User-Agents and issue a 403 Forbidden error to them. So let's say we want to block some search engine bots:
    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} Googlebot [OR]
    RewriteCond %{HTTP_USER_AGENT} AdsBot-Google [OR]
    RewriteCond %{HTTP_USER_AGENT} msnbot [OR]
    RewriteCond %{HTTP_USER_AGENT} AltaVista [OR]
    RewriteCond %{HTTP_USER_AGENT} Slurp
    RewriteRule . - [F,L]

    What this does is takes a list of conditions (RewriteCond) and applies a rule to them. The F stands for Forbidden and the L means it's the last rule in the set.

    Once you've made the changes and blocked the bots or IP's you want to, you can save the .htaccess file and upload it to your server, overwriting the original one.

    You can keep the file updated as new bots or IP need to be blocked and if you did make a mistake you can revert it by using the original .htaccess file or just deleting the rules.

(Visited 11,667 times, 1 visits today)

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

9 + 1 =

How to Make a Website
A complete guide for beginners.

Want to learn how to make a website like this? Check out our beginner's course now. It's completely free!

GET STARTED

Popular Posts

  1. How to Start a Blog Using WordPress
  2. How to Point a Domain Name to Your Site
  3. How to Accept Credit Cards on Your Website
  4. An Ever Growing List Of Ways To Make Money Online…
  5. What is a Parked Domain & How Does It Work?
  6. The Best Domain Registrars To Use In 2023
  7. Choosing a Free HTML Editor to Build Your Website
  8. How to Pick Profitable Website Ideas
  9. What is a Domain Name and How Do They Work?
  10. What is Affiliate Marketing?

How to Start a Blog
A complete guide for beginners.

Want to learn how to make a blog? Check out our beginner's course now. It's completely free!

GET STARTED

Categories

  • Apache
  • Building a Website
  • Code
  • Domains
  • General
  • Make Money Online
  • Reviews
    • Ecommerce
    • Website Builders
  • Web Hosting
  • Website Traffic
  • WordPress

Amazon Affiliate Disclosure

Sitebeginner.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to amazon.com. Find out more here.

FTC Disclosure

I may receive customer referral fees from companies mentioned on this website, this does not affect the price you pay for any products you decide to buy. All data & opinions on this website are based on my experience as a paying customer.

  • Best Blogging Platform
  • Best WordPress Hosting
  • Online Business Ideas
  • Shopify Review

Copyright © 2007-2023 All Rights Reserved.
Site Beginner · About · Create a Blog · Learn

Copyright © 2023 · SB2 on Genesis Framework · WordPress · Log in