Kuvilam Blog

robots.txt file

How to Use Robots.txt ?

What Is Robots.txt?

A robots.txt file is a text file stored in a website’s root directory that gives web crawlers directions regarding which pages, folders and/or file types they should or shouldn’t access to crawl and index. These instructions can include all bots, or provide guidance to specific user-agents. Robots.txt files use the Robots Exclusion Protocol developed in 1994 as a protocol for websites communicating with crawlers and other internet bots.

When website owners wants to tell bots how to crawl their sites, they load the robots.txt file in their root directory, e.g. https://www.example.com/robots.txt. Crawlers arriving on the site will fetch and read the file before trying to fetch any other file from the server. When a website doesn’t have a robots.txt file, or the crawler can’t load it for some reason, a bot will assume the site owner doesn’t have any instructions to give it.

When creating a robots.txt file, it’s absolutely vital to do so using a plain text file. Using HTML or a word processor will include code in the file that crawlers can’t read. This could cause them to ignore directives in the file.

http://www.elliance.com/aha/infographics/robotstxtfileexplained.aspx

How Does a Robots.txt File Work?

A robots.txt file is made up of blocks of code containing two basic parts: user-agent and directives.

Robots.txt User-Agent

User-agent refers to the name used by a web crawler. When a crawler arrives on a site and opens its robots.txt file, the bot will look for its name in one of the user-agent lines.

Using the user-agent part of robots.txt is relatively simple. User-agent must always be listed before the disallow lines and each user-agent line can specify only one bot (Sort of. More on that in a bit). So, for example, if you have a page you don’t want Google to crawl for some reason, but you’re ok with Bing or Baidu, you’d write your instructions like this:

User-agent: googlebot
Disallow: https://www.example.com/page

That would tell Google’s web crawler not to open the page at example.com/page, while the user-agents for other search engines would continue unaffected. If you want give the same instructions to more than one user-agent, you have to create a set of directives for each one:

User-agent: googlebot
Disallow: https://www.example.com/page

User-agent: Bingbot
Disallow: https://www.example.com/page

That robots.txt file would tell Google and Bing not to crawl the page at https://www.example.com/page while other bots such as Baidu or Yandex would continue to do so.

If you want to provide directives to all web crawlers that access your site, you can use what’s called a wildcard. Wildcards are represented as an asterisk (*) and represent any character or string of characters. So in a robots.txt like this:

User-agent: *
Disallow: https://www/example.com/page

Bots that read the robots.txt file will automatically interpret the wildcard as their own user-agent.

These days, most search engines have multiple crawlers to do different things like crawl images, ads, videos or mobile content. In instances where a crawler encounters a robots.txt file that doesn’t specifically include its user-agent, it will follow the instructions for the most specific user-agent that is relevant to them. So, for example, if Googlebot-Images opens a robots.txt file directives for Googlebot, Bingbot and a wildcard, it will follow the disallow lines for Googlebot since that is the most specific set of instructions that could apply to Googlebot-Images.

This is very important to keep in mind while writing a robots.txt file so you don’t accidentally block the wrong user-agents.

Here are the most common search engines, their user-agents and what they search for:

User-AgentSearch EngineField
baiduspiderBaiduGeneral
baiduspider-imageBaiduImages
baiduspider-mobileBaiduMobile
baiduspider-newsBaiduNews
baiduspider-videoBaiduVideo
bingbotBingGeneral
msnbotBingGeneral
msnbot-mediaBingImages & Video
adidxbotBingAds
GooglebotGoogleGeneral
Googlebot-ImageGoogleImages
Googlebot-MobileGoogleMobile
Googlebot-NewsGoogleNews
Googlebot-VideoGoogleVideo
Mediapartners-GoogleGoogleAdSense
AdsBot-GoogleGoogleAdWords
slurpYahoo!General
yandexYandexGeneral

Robots.txt Disallow

The second part of robots.txt is the directive, or disallow, lines. This is the part of the code that controls what pages, folders or file types a user-agent shouldn’t crawl. These lines are usually called ‘disallow’ lines because that’s the most common directive used in robots.txt for SEO.

Technically, you don’t have to put anything in a disallow line; bots will interpret a blank line to mean they’re allowed to crawl the entire site. To block your whole server, use a slash (/) in the disallow line. Otherwise, create a new line for every folder, subfolder or page you don’t want to get crawled. Robots.txt file use relative linking, so you don’t have to include your whole domain in every line. However, you have to use the canonical version of your URLs that match the URL structures in your sitemap

Take this block of robots.txt code as an example:

User-agent: *
Disallow: /folder/subfolder/page.html
Disallow: /subfolder2/
Disallow: /folder2/

The first disallow line stops all bots (note the wildcard in the user-agent line) from crawling the page https://www.example.com/folder/subfolder/page.html. Since the command specifies the page.html file, the bots will still crawl other pages in that folder, as well as any instances of page.html in other directories. The second line, on the other hand, disallows the entire /subfolder2/ subdirectory, which means any page found in that folder shouldn’t be crawled. However, pages found in a /subfolder3/ directory could still be crawled and indexed. Finally, the third line instructs bots to skip all directories and files found within the /folder2/ directory.

Using your robots.txt file to disallow specific files or folders is the simplest, is the most basic way to use it. However, you can get more precise and efficient in your code by making use of the wildcard in the disallow lines.

Remember, the asterisk works like an eight card in Crazy Eights: it can represent any string of characters. For disallow, that means you can use a wildcard as a stand-in for any file or folder name to control how bots crawl the site. Here is the wildcard in action:

User-agent: *
Disallow: /*.pdf
Disallow: /images/*.jpg
Disallow: /copies/duplicatepage*.html

The wildcard is very useful here as these commands tell all user-agents not to crawl PDFs anywhere on the site or jpeg files in the ‘images’ file. The third line stops bots from crawling any file in the ‘copies’ folder that contains ‘duplicatepage’ and ‘.html’. So if your site uses URL parameters for analytics, remarketing or sorting, search engines won’t crawl the duplicate URLs such as:

  • /copies/duplicatepage1.html
  • /copies/duplicatepage2.html
  • /copies/duplicatepage.html?parameter=1234

Note that search engine crawlers are just looking for URLs that contain the exclusion parameters. They aren’t looking for direct matches, which is why that last example would be disallowed.

In the example above, a file at ‘/copies/duplicatepage/page.html’ would also be disallowed as the wildcard would expand to become the ‘/page’ part.

Using the rules above, there could be instances of pages unintentionally matching exclusion rules, such as when an excluded file extension is used in the file name, an HTML page called ‘how-to-create-a-.pdf’ for example. Resolve this by adding a dollar sign ($) to tell search engines to exclude only pages that end in the same way as the disallow line. So Disallow: /copies/duplicatepage*.html$ will exclude only HTML files that contain ‘duplicatepage’.

Non-Standard Robots.txt Directives

Disallow is the standard directive recognized by all search engine crawlers (it is the Robots Exclusion Protocol). However, there are other, lesser-known directives recognized by web crawlers.

Allow

If you want to disallow an entire folder except for one page using just the disallow command you would have to write a line for every page except the one you want crawled. Alternatively, use a disallow line to block the entire folder and then add an ‘Allow’ line specifying only the single page you want crawled. Allow works in much the same way Disallow, meaning it goes below the User-agent line:

User-agent: *
Disallow: /folder/subfolder/
Allow: /folder/subfolder/page.html

Wildcards and matching rules work the same way with allow as with disallow. Allow is recognized by both Google and Bing.

Other Commands

There are a few other non-standard directives recognized by web crawlers that you can use to further influence they way your site is crawled:

  • crawl-delay: This line uses a numerical value that specifies a number of seconds. It’s recognized by Bing and Yandex but used differently by each. Bing will wait the specified number of seconds before completing its next crawl action while Yandex wait that number of seconds between reading the robots.txt file and actually crawling the site. This number will limit the number of pages on your site that get crawled, so it’s not really recommended unless you get almost no traffic from those sources and need to save bandwidth.
  • Host: This is only recognized by Yandex and works as a WWW resolve, telling the search engine which is the canonical version of the domain. However, since Yandex is the only search engine and uses it, it’s not recommended to use it. Instead, set your preferred domain in Google Search Console and Bing Webmaster Tools and then set a 301 redirect to implement a WWW resolve.

Finally, while not really a command, you can use the robots.txt file to link to your XML sitemap via the Sitemap: line. This line is interpreted independently of user-agent, so add it at the start or end of the file. If you have multiple sitemaps, such as image and/or video sitemaps, include a line for each, along with a line for your sitemap index file.

How Do I Use Robots.txt for SEO?

If SEO’s objective is for your site to get crawled and indexed in order to rank in search results, why would you want to block pages? The fact is, there’s a few situations in which you wouldn’t want content to be crawled or appear in search results:

  • Disallowing unimportant folders or pages will help the bots use their crawl budgets more efficiently. Think about it: every second they’re not crawling your temp files is a second they can spend crawling a product page. Adding the Sitemap: line will also help search engines access your sitemap more easily and efficiently.
  • As discussed above, sometimes duplicate and/or thin content is unavoidable. Disallow those pages with your robots.txt to help your website stay on the right side of Panda.
  • Disallow user agents from search engines that operate in countries you don’t. If you don’t/can’t ship to Russia or China, it might not make sense to have Yandex and Baidu (the two most popular search engines in those countries, respectively) using bandwidth by crawling your site.
  • You have private pages you don’t want to appear in search results. Remember, though, that robots.txt files are public, so anyone can open it and see these pages. Plus robots.txt doesn’t stop direct traffic or people following links.
  • When redesigning or migrating a site, it’s a good idea to disallow the entire server until you’re ready to add redirects to your legacy site. This will prevent search engines from crawling your site before you’re ready, making it look like content copied from your old site. Incurring this ‘penalty’ at the launch of your site is not a good way to start.

When using your robots.txt file during a site migration, be extra sure that you update the file when setting your new site live. This is a common mistake and one of the first things you should look at when trying to diagnose a loss of search traffic and/or ranking drop.

Before uploading your robots.txt file, run it through Google’s robots.txt Tester in Google Search Console. To test your file, copy and paste your code into the tester; syntax and logic errors will be highlighted immediately. Once you fix those, test individual URLs you know should be blocked and allowed to see if your robots.txt is correct.

Note that, naturally, Google’s robots.txt tester only applies to Googlebot. To verify that your file works for Bing, use the Fetch as Bingbot feature in Bing Webmaster Tools.


Posted

in

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.