|
Creating
a robots.txt file
By Sumantra Roy
Some people believe
that they should create different pages for
different search engines, each page optimized
for one keyword and for one search engine.
Now, while I don't recommend that people create
different pages for different search engines,
if you do decide to create such pages, there
is one issue that you need to be aware of.
These pages,
although optimized for different search engines,
often turn out to be pretty similar to each
other. The search engines now have the ability
to detect when a site has created such similar
looking pages and are penalizing or even banning
such sites. In order to prevent your site
from being penalized for spamming, you need
to prevent the search engine spiders from
indexing pages which are not meant for it,
i.e. you need to prevent AltaVista
from indexing pages meant for Google
and vice-versa. The best way to do that is
to use a robots.txt file.
You should create
a robots.txt file using a text editor like
Windows Notepad. Don't use your word processor
to create such a file.
Here is the basic
syntax of the robots.txt file:
User-Agent: [Spider
Name]
Disallow: [File Name]
For instance,
to tell AltaVista's spider, Scooter, not to
spider the file named myfile1.html residing
in the root directory of the server, you would
write
User-Agent: Scooter
Disallow: /myfile1.html
To tell Google's
spider, called Googlebot, not to spider the
files myfile2.html and myfile3.html, you would
write
User-Agent: Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
You can, of course,
put multiple User-Agent statements in the
same robots.txt file. Hence, to tell AltaVista
not to spider the file named myfile1.html,
and to tell Google not to spider the files
myfile2.html and myfile3.html, you would write
User-Agent: Scooter
Disallow: /myfile1.html
User-Agent: Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
If you want to
prevent all robots from spidering the file
named myfile4.html, you can use the * wildcard
character in the User-Agent line, i.e. you
would write
User-Agent: *
Disallow: /myfile4.html
However, you
cannot use the wildcard character in the Disallow
line.
Once you have
created the robots.txt file, you should upload
it to the root directory of your domain. Uploading
it to any sub-directory won't work - the robots.txt
file needs to be in the root directory.
I won't discuss
the syntax and structure of the robots.txt
file any further - you can get the complete
specifications from http://www.robotstxt.org/wc/norobots.html
Now we come to
how the robots.txt file can be used to prevent
your site from being penalized for spamming
in case you are creating different pages for
different search engines. What you need to
do is to prevent each search engine from spidering
pages which are not meant for it.
For simplicity,
let's assume that you are targeting only two
keywords: "tourism in Australia" and "travel
to Australia". Also, let's assume that you
are targeting only three of the major search
engines: AltaVista, HotBot and Google.
Now, suppose
you have followed the following convention
for naming the files: Each page is named by
separating the individual words of the keyword
for which the page is being optimized by hyphens.
To this is added the first two letters of
the name of the search engine for which the
page is being optimized.
Hence, the files
for AltaVista are
tourism-in-australia-al.html
travel-to-australia-al.html
The files for
HotBot are
tourism-in-australia-ho.html
travel-to-australia-ho.html
The files for
Google are
tourism-in-australia-go.html
travel-to-australia-go.html
As I noted earlier,
AltaVista's spider is called Scooter and Google's
spider is called Googlebot.
A list of spiders
for the major search engines can be found
at http://www.jafsoft.com/searchengines/webbots.html
Now, we know
that HotBot uses Inktomi and from this list,
we find that Inktomi's spider is called Slurp.
Using this knowledge, here's what the robots.txt
file should contain:
User-Agent: Scooter
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent: Slurp
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent: Googlebot
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
When you put
the above lines in the robots.txt file, you
instruct each search engine not to spider
the files meant for the other search engines.
When you have
finished creating the robots.txt file, double-check
to ensure that you have not made any errors
anywhere in it. A small error can have disastrous
consequences - a search engine may spider
files which are not meant for it, in which
case it can penalize your site for spamming,
or, it may not spider any files at all, in
which case you won't get top rankings in that
search engine.
An useful tool
to check the syntax of your robots.txt file
can be found at http://www.tardis.ed.ac.uk/~sxw/robots/check/.
While it will help you correct syntactical
errors in the robots.txt file, it won't help
you correct any logical errors, for which
you will still need to go through the robots.txt
thoroughly, as mentioned above.
Article by Sumantra
Roy. Sumantra is one of the most respected
search engine positioning specialists on the
Internet. To have Sumantra's company place
your site at the top of the search engines,
go to http://www.1stSearchRanking.com/t.cgi?3761
For more advice on how you can take your web
site to the top of the search engines, subscribe
to his FREE newsletter by going to http://www.1stSearchRanking.com/t.cgi?3761&newsletter.htm
|