Address
304 North Cardinal St.
Dorchester Center, MA 02124

Work Hours
Monday to Friday: 7AM - 7PM
Weekend: 10AM - 5PM

Website robots.txt Configuration Method

This article introduces how to configure and use the robots.txt file of the website, and how to write the robots.txt file.

Search engines crawl the content of web pages through web spiders and display them in relevant search results. But some web content we may not want to be indexed and indexed by search engines.

We can declare to allow/forbid search engine spiders to crawl certain directories or web pages through the robots.txt file, thereby limiting the scope of search engines’ inclusion.

This article introduces how to configure and use the robots.txt file of the website, and how to write the robots.txt file.

What is robots.txt?

robots.txt is a plain text file stored in the root directory of the website, which is used to tell web spiders what content on this site is allowed to crawl and what content is not allowed to crawl.

When a search engine’s spider visits a website, it will first check the robots.txt file of the website to obtain the scope of crawling allowed on the website.

It should be noted that robots.txt is just a customary rule, not mandatory, and some search engines do not support it, so it cannot guarantee that the web content will/will not be crawled.

robots.txt file to set rules

# Format and position

  • The file name must be robots.txt (all lowercase);
  • The file format is a plain text file encoded in UTF-8;
  • It must be placed in the root directory of the website and can be accessed through http://www.example.com/robots.txt;
  • There can be only one robots.txt file per site;
  • It is also valid for subdomains and ports, such as http://blog.example.com;

# After the beginning is the remark content;

  • Be careful to use English characters.

# Instruction syntax description

  • Each record consists of fields: values, such as Sitemap: https://example.com/sitemap.xml.
  • User-agent: used to specify the target crawler (web spider) that the instruction acts on, followed by the name of the crawler;
  • Disallow: Specifies the directory or webpage that is not allowed to be crawled, and if it is empty, it means that all pages are allowed to be crawled;
  • Allow: Specify the directory or webpage that is allowed to be crawled;
  • Sitemap: The location of the sitemap, which must be an absolute path;
  • *: means wildcard;
  • $: indicates the end of the URL;
  • /: Matches the root directory and any subordinate URLs.

# Prevent Google from crawling all content under the news tag of the website

user-agent: googlebot

disallow: /tag/news

Practical robots.txt writing examples and instructions

If the website does not have a robots.txt file, it can be manually created and uploaded to the root directory of the website; even if there are no pages that need to be blocked from search engines, it is recommended to add an empty robots.txt file.

Please pay attention to the difference between “only”, “allowed” and “prohibited” in the text!

Path matching example: (↓ Screenshot from Google Developers)

robots google

Tips

# Common search engine spider (robot) name

  • Google Spider: Googlebot, Googlebot-Mobile, Googlebot-Image;
  • Baidu Spider: Baiduspider, Baiduspider-mobile, Baiduspider-image;
  • Sogou spider: Sogou web spider, Sogou inst spider, Sogou spider2, Sogou blog, Sogou News Spider, Sogou Orion spider;
  • Bing spider:bingbot;
  • 360Spider: 360Spider;
  • Youdao Spider: YoudaoBot;
  • yahoo spider:slurp;
  • Yandex spider: yandex.