Website robots.txt Configuration Method

Search engines crawl the content of web pages through web spiders and display them in relevant search results. But some web content we may not want to be indexed and indexed by search engines.

We can declare to allow/forbid search engine spiders to crawl certain directories or web pages through the robots.txt file, thereby limiting the scope of search engines’ inclusion.

This article introduces how to configure and use the robots.txt file of the website, and how to write the robots.txt file.

What is robots.txt?

robots.txt is a plain text file stored in the root directory of the website, which is used to tell web spiders what content on this site is allowed to crawl and what content is not allowed to crawl.

When a search engine’s spider visits a website, it will first check the robots.txt file of the website to obtain the scope of crawling allowed on the website.

It should be noted that robots.txt is just a customary rule, not mandatory, and some search engines do not support it, so it cannot guarantee that the web content will/will not be crawled.

robots.txt file to set rules

# Format and position

The file name must be robots.txt (all lowercase);
The file format is a plain text file encoded in UTF-8;
It must be placed in the root directory of the website and can be accessed through http://www.example.com/robots.txt;
There can be only one robots.txt file per site;
It is also valid for subdomains and ports, such as http://blog.example.com;

# After the beginning is the remark content;

Be careful to use English characters.

# Instruction syntax description

Each record consists of fields: values, such as Sitemap: https://example.com/sitemap.xml.
User-agent: used to specify the target crawler (web spider) that the instruction acts on, followed by the name of the crawler;
Disallow: Specifies the directory or webpage that is not allowed to be crawled, and if it is empty, it means that all pages are allowed to be crawled;
Allow: Specify the directory or webpage that is allowed to be crawled;
Sitemap: The location of the sitemap, which must be an absolute path;
*: means wildcard;
$: indicates the end of the URL;
/: Matches the root directory and any subordinate URLs.

# Prevent Google from crawling all content under the news tag of the website

user-agent: googlebot

disallow: /tag/news

Practical robots.txt writing examples and instructions

If the website does not have a robots.txt file, it can be manually created and uploaded to the root directory of the website; even if there are no pages that need to be blocked from search engines, it is recommended to add an empty robots.txt file.

Please pay attention to the difference between “only”, “allowed” and “prohibited” in the text!

Path matching example: (↓ Screenshot from Google Developers)

Tips

# Common search engine spider (robot) name

Google Spider: Googlebot, Googlebot-Mobile, Googlebot-Image;
Baidu Spider: Baiduspider, Baiduspider-mobile, Baiduspider-image;
Sogou spider: Sogou web spider, Sogou inst spider, Sogou spider2, Sogou blog, Sogou News Spider, Sogou Orion spider;
Bing spider:bingbot;
360Spider: 360Spider;
Youdao Spider: YoudaoBot;
yahoo spider:slurp;
Yandex spider: yandex.

Website robots.txt Configuration Method

What is robots.txt?

robots.txt file to set rules

# Format and position

# After the beginning is the remark content;

# Instruction syntax description

# Prevent Google from crawling all content under the news tag of the website

Practical robots.txt writing examples and instructions

Tips

Free Grok-1 GitHub Open Source: A Deep Dive into Grok-1’s AI Evolution

Unlocking Creativity with Sora Video Generator: A Comprehensive Guide

Money Trees CapCut Templates for Free 2023

ChatGPT Opens New Era of AI Cross-Border

Adamantium vs Vibranium: Which One is Stronger?

I Hate Driving Even Though I Passed My Driver’s License Test