Control search engine crawlers

How to Control search engine crawlers with a robots.txt file

Website owners, can instruct search engines on how they should crawl a website, by using a robots.txt file.

When a search engine crawls a website, it requests the robots.txt file first and then follows the rules within.

It’s important to know robots.txt rules don’t have to be followed by bots, and they are a guideline.For instance to set a Crawl-delay for Google this must be done in the Google Webmaster tools.

For bad bots that abuse your site you should look at how to block bad users by User-agent in .htaccess.

Edit or create robots.txt file

The robots.txt file needs to be at the root of your site. If your domain was example.com it should be found:

On your website:

http://example.com/robots.txt

On your server:

/home/userna5/public_html/robots.txt

You can also create a new file and call it robots.txt as just a plain-text file if you don’t already have one.

Search engine User-agents

The most common rule you’d use in a robots.txt file is based on the User-agent of the search engine crawler.

Search engine crawlers use a User-agent to identify themselves when crawling, here are some common examples:

Top 3 US search engine User-agents:

Googlebot
Yahoo! Slurp
bingbot

Common search engine User-agents blocked:

AhrefsBot
Baiduspider
Ezooms
MJ12bot
YandexBot

Search engine crawler access via robots.txt file

There are quite a few options when it comes to controling how your site is crawled with the robots.txt file.

The User-agent: rule specifies which User-agent the rule applies to, and * is a wildcard matching any User-agent.

Disallow: sets the files or folders that are not allowed to be crawled.

Here are some of the most common uses of the robots.txt file:

Set a crawl delay for all search engines

Allow all search engines to crawl website

Disallow all search engines from crawling website

Disallow one particular search engines from crawling website

Disallow all search engines from particular folders

Disallow all search engines from particular files

Disallow all search engines but one

Set a c

rawl delay for all search engines:

If you had 1,000 pages on your website, a search engine could potentially index your entire site in a few minutes.

However this could cause high system resource usage with all of those pages loaded in a short time period.

A Crawl-delay: of 30 seconds would allow crawlers to index your entire 1,000 page website in just 8.3 hours

A Crawl-delay: of 500 seconds would allow crawlers to index your entire 1,000 page website in 5.8 days

You can set the Crawl-delay: for all search engines at once with:

User-agent: *
Crawl-delay: 30

Allow all search engines to crawl website:

By default search engines should be able to crawl your website, but you can also specify they are allowed with:

User-agent: *
Disallow:

Disallow all search engines from crawling website:

You can disallow any search engine from crawling your website, with these rules:

User-agent: *
Disallow: /

Disallow one particular search engines from crawling website:

You can disallow just one specific search engine from crawling your website, with these rules:

User-agent: Baiduspider
Disallow: /

Disallow all search engines from particular folders:

If we had a few directories like /cgi-bin/, /private/, and /tmp/ we didn’t want bots to crawl we could use this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /private/
Disallow: /tmp/

Disallow all search engines from particular files:

If we had a files like contactus.htm, index.htm, and store.htm we didn’t want bots to crawl we could use this:

User-agent: *
Disallow: /contactus.htm
Disallow: /index.htm
Disallow: /store.htm

Disallow all search engines but one:

If we only wanted to allow Googlebot access to our /private/ directory, and disallow all other bots we could use:

User-agent: *
Disallow: /private/

User-agent: Googlebot
Disallow:

When the Googlebot reads our robots.txt file, it will see it is not disallowed from crawling any directories.

Block bad users based on their User-Agent string

Some malicious users will send requests from different IP addresses, but still using the same User-Agent for sending all of the requests. In these events you can also block users by their User-Agent strings.

Block a single bad User-Agent

If you just wanted to block one particular User-Agent string, you could use this RewriteRule:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC]
RewriteRule .* - [F,L]

Alternatively, you can also use the BrowserMatchNoCase Apache directive like this:

BrowserMatchNoCase "Baiduspider" bots

Order Allow,Deny
Allow from ALL
Deny from env=bots

Block multiple bad User-Agents

If you wanted to block multiple User-Agent strings at once, you could do it like this:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(Baiduspider|HTTrack|Yandex).*$ [NC]
RewriteRule .* - [F,L]

Or you can also use the BrowserMatchNoCase directive like this:

BrowserMatchNoCase "Baiduspider" bots
BrowserMatchNoCase "HTTrack" bots
BrowserMatchNoCase "Yandex" bots

Order Allow,Deny
Allow from ALL
Deny from env=bots

Block by referer

Block a single bad referer

If you just wanted to block a single bad referer like example.com you could use this RewriteRule:

RewriteEngine On
RewriteCond %{HTTP_REFERER} example\.com [NC]
RewriteRule .* - [F]

Alternatively, you could also use the SetEnvIfNoCase Apache directive like this:

SetEnvIfNoCase Referer "example\.com" bad_referer

Order Allow,Deny
Allow from ALL
Deny from env=bad_referer

Block multiple bad referers

If you just wanted to block multiple referers like example.com and example.net you could use:

RewriteEngine On
RewriteCond %{HTTP_REFERER} example\.com [NC,OR]
RewriteCond %{HTTP_REFERER} example\.net
RewriteRule .* - [F]

Or you can also use the SetEnvIfNoCase Apache directive like this:

SetEnvIfNoCase Referer "example\.com" bad_referer
SetEnvIfNoCase Referer "example\.net" bad_referer  

Order Allow,Deny
Allow from ALL
Deny from env=bad_referer

Temporarily block bad bots

In some cases you might not want to send a 403 response to a visitor which is just a access denied message. A good example of this is lets say your site is getting a large spike in traffic for the day from a promotion you’re running, and you don’t want some good search engine bots like Google or Yahoo to come along and start to index your site during that same time that you might already be stressing the server with your extra traffic.

The following code will setup a basic error document page for a 503 response, this is the default way to tell a search engine that their request is temporarily blocked and they should try back at a later time. This is different then denying them access temporarily via a 403 response, as with a 503 response Google has confirmed they will come back and try to index the page again instead of dropping it from their index.

The following code will grab any requests from user-agents that have the words bot, crawl, or spider in them which most of the major search engines will match for. The 2nd RewriteCond line allows these bots to still request a robots.txt file to check for new rules, but any other requests will simply get a 503 response with the message “Site temporarily disabled for crawling”.

Typically you don’t want to leave a 503 block in place for longer than 2 days. Otherwise Google might start to interpret this as an extended server outage and could begin to remove your URLs from their index.

ErrorDocument 503 "Site temporarily disabled for crawling"
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(bot|crawl|spider).*$ [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* - [R=503,L]

This method is good to use if you notice some new bots crawling your site causing excessive requests and you want to block them or slow them down via your robots.txt file. As it will let you 503 their requests until they read your newrobots.txt rules and start obeying them. You can read about how to stop search engines from crawling your website for more information regarding this.

You should now understand how to use a. htaccess file to help block access to your website in multiple ways.