Is there any relevance of Robots.txt?
To your surprise, there are small text files which are actually responsible for your website rankings. One small mistake can either ruin your website or will lead to Search Engines not crawling your website, i.e. your webpages will not appear on the search results page.
Hence, It’s very important to understand Robots.txt file functioning.
What is Robots.txt?
Robots.txt is a text file which you put on your website to instruct the search robots which pages you would like them not to visit. By stating some rules in a text file, you can command robots not to crawl and index certain files, pages of your site.
Let’s say that we do not want a search engine to crawl or visit the images section of the site, as it uses the site’s bandwidth and can be meaningless, Robots.txt informs Google about it. Web developers can use robots.txt to maintain a link with web robots by accessing all or parts of your website which you want to keep in private. Robots.txt file is not a compulsion for search engines. Generally, search engines follow what they are asked to do.
In simple terms, robots.txt is used when the webmaster doesn’t want the search engines to crawl specific directories, pages or URLs.
Is Robots.txt really important. How does it affect the website?
Robots.txt file is important or not, completely depends on the webmaster; If the webmaster wants to hide some information or web pages from getting it crawled by the search engines, the file is important and if not, then there is no need to use robots.txt file.
Few reasons for using robots.txt file on your website:
- Saves Server bandwidth – The crawler will not be able to crawl those web pages where there is no important information.
- Provides protection – Though not a great source of security but somehow would not allow the search engines to reach the stuff you don’t want the search engines to reach. The individuals will actually have to visit your site and then go to the directory instead of finding it on search engines such as Google, Yahoo or Bing.
- Better Server logs – Every time any search engine crawls your site, it requests for robots.txt , if you don’t have one; it generates a “404 Not Found” error every time. It gets really hard to detect the genuine errors.
- Prevents spam and penalties – To prevent content or data getting spammed or duplicated can be protected by using robots.txt. This helps in preventing the content being indexed and avoids it misuse. Some website owners uses robots.txt file to hide the development or confidential areas of website for public viewing.
- Google Webmaster Tools – It is essential to have a robots.txt file so that Google can validate your site. Web master tools are important for the insight of your website.
Some of the uses of robots.txt are:
- To disallow the crawlers from visiting the private folders or pages that may have confidential data.
- To give the permission to specific web robots to crawl your site. By doing this, the bandwidth is saved.
- To ensure with bots by giving a directive in the robots.txt file, the location of your Sitemap.
The directives are independent of user-agent line and so it can be added anywhere in the robots.txt file. You just need to mention the location of your sitemap in the sitemap-location.xml part of the URL. You don’t need to worry if you have multiple sitemaps as you can specify the location of your sitemap index file.
Creating a robots.txt file a headache? Chill, Let’s create one!
Creating a robots.txt file is not that difficult. If you don’t have one, it’s always better to create:
- Create new text file: The first step is to create a new text file and save it. By using Notepad program on Windows PCs or TextEdit for Macs click on “Save As”. A new text file is created.
- Upload to root directory: After saving your file, upload the file to root directory of the website. It is a root level folder called “htdocs” or “www” which makes such that it appears directly after the domain name.
- Create robots.txt file for each subdomain: If you are using subdomains, a separate robots.txt file will be created for each subdomain.
Two vital components of robots.txt file!
Robots.txt file has two major components: User-agent and Disallow:
- User-agent: The user-agent is seen with a wildcard (*) which is an asterisk sign that indicates that blocking instructions are for all web robots.You can mention the web robots name under the user-agent directive, if you want bots to be allowed or blocked on certain pages.
- Disallow: When disallow specifies nothing it means that the web robots can crawl all the pages on a site. If you want to block a certain page you must use only one URL prefix per disallow. In robot.txt you cannot include multiple folders or URL prefixes under the disallow element.
Some basic elements which are essential to know:
What to include in robots txt file?
The “/robots.txt” file is a text file, with one or more records. It contains a single record such as:
Here is an example given below in which three directories are excluded.
To exclude any URL prefix you need to use separate “Disallow” line — “Disallow: /cgi-bin/ /tmp/” on a single line will not work. Any blank lines in the record will lead to delimit of multiple records.
Clotting or such types of regular expression are not supported by either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Also, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.
Excluding anything depends on your server. Let’s go through some examples:
- Excluding all robots from the entire server
- Allowing all robots for complete access
(create an empty “/robots.txt” file, or don’t use one at all)
- Excluding all robots from part of the server
- Excluding a single robot
- Allowing a single robot
- Excluding all files except one
As there is no “Allow” field, this can turn a bit tricky. The best way is to bring all files in a separate directory which needs to be disallowed, For example “shoes”, leaving one file in the level above this directory:
Disallowing all disallowed pages:
What not to include in robots.txt file?
Generally, In a robots.txt file of a website there is a command:
This command tells all the bots to simply ignore the whole domain, which means that not even a single page of that website will be crawled or indexed by the search engines.
What happens if you have no robots.txt file?
If there is no robots.txt file, the search engines will be free to crawl anything and everything they will find on the website. This could be a problem for the webmasters who wants to hide some information from the web pages and do not want those pages to be crawled and indexed by the search engines, leading to the information being public. Also, It is always better to pin point about your XML sitemap as it will help the search engines to crawl your new content easily.
Robots.txt can be a shortcoming for Websites:
One should really understand the logic behind creating robots.txt and its risks of URL blocking method. There are times you want to see through other mechanisms if your URLs are discoverable on the web or not.
- Robots.txt functions according to directives only.
Robots.txt is a file which functions according to the command given. By stating some rules in the text file, you can command robots not to crawl and index certain files, directories within your site. The web crawlers go with the instructions in a robots.txt file and some might not. Therefore, it is always advised to keep the information secured from web crawlers, by using some sorts of blocking methods.
- Every crawler can have a different syntax.
Generally, web crawlers follow the directives in a robots.txt file. Thereby, there are chances that crawler might interpret the instructions differently. One should have complete knowledge of syntax for addressing different web crawlers as some might not understand certain instructions.
- Prevention of reference from other sites on your URLs cannot be stopped by robots.txt.
There are chances that Google won’t crawl or index the content blocked by robots.txt, but we might still find and index a disallowed URL from other places on the web. Consequently , the URL address and other information, which is open can still appear in Google search results.
A Friend in need when stuck to validate robots.txt file. Here you go!
A tool to validate robot.txt file.
Robots.txt file acts as a saviour to your website when you don’t want your web pages to be indexed by the Search Engines. Moreover, the file will make sure that your web pages are secured giving optimum results in Search Engine Optimisation.