I am still amazed at how many web sites still don’t employ a robots.txt file at the root of their web server. Even SEO firms or people claiming to be SEO experts have them missing which I find very funny. There also countless arguments of whether you still need to have a robots.txt, but my advice is if the search engine robots still request it then I’d rather have it there with the welcome mat to the site.
For those of you who don’t know the history of a robots.txt file then i’d suggest you have a Google or Wikipedia for it. In short it ‘s a text file that specifies which parts of a web site to ‘index’ and ‘crawl’ and/or which parts to not index. You can also get specific and setup up rules based on a certain spiders and crawlers.
To start with you need to create a text file called robots.txt and place in the root of your web host. You should be able to access it through your web browser at www.yourdomain.com/robots.txt
You can view other web sites robots.txt files by accessing the robots.txt at the root of their domain.
If you want Google, etc. to come into your site and index everything then things are very easy. Simply add the following to your robots.txt file and away you go:
User-agent: *
Disallow:
Alternatively if you wish to stop all pages in your site being indexed then the following should be present in your file:
User-agent: *
Disallow: /
To stop robots indexing a folder called images and another called private you would add a Disallow line for each folder:
User-agent: *
Disallow: /images/
Disallow: /private/
The above would still index the rest of the site, but anything in those folders would be excluded from search engine results.
To disallow a file you specify the file as above with a folder:
User-agent: *
Disallow: /myPrivateFile.htm
If you only wanted Google access to your site you specify the following:
User-agent: Google
Disallow:
User-agent: *
Disallow: /
If you are looking at getting your site fully indexed then I would put the first example in your robots.txt file.