How to compose a robots txt file correctly. How to edit your robots txt file

Robots.txt is a text file that contains site indexing parameters for the search engine robots.

Recommendations on the content of the file

Yandex supports the following directives:

Directive What it does
User-agent *
Disallow
Sitemap
Clean-param
Allow
Crawl-delay

We recommend using the crawl speed setting

Directive What it does
User-agent * Indicates the robot to which the rules listed in robots.txt apply.
Disallow Prohibits indexing site sections or individual pages.
Sitemap Specifies the path to the Sitemap file that is posted on the site.
Clean-param Indicates to the robot that the page URL contains parameters (like UTM tags) that should be ignored when indexing it.
Allow Allows indexing site sections or individual pages.
Crawl-delay

Specifies the minimum interval (in seconds) for the search robot to wait after loading one page, before starting to load another.

We recommend using the crawl speed setting in Yandex.Webmaster instead of the directive.

* Mandatory directive.

You "ll most often need the Disallow, Sitemap, and Clean-param directives. For example:

User-agent: * #specify the robots that the directives are set for Disallow: / bin / # disables links from the Shopping Cart. Disallow: / search / # disables page links of the search embedded on the site Disallow: / admin / # disables links from the admin panel Sitemap: http://example.com/sitemap # specify for the robot the sitemap file of the site Clean-param: ref /some_dir/get_book.pl

Robots from other search engines and services may interpret the directives in a different way.

Note. The robot takes into account the case of substrings (file name or path, robot name) and ignores the case in the names of directives.

Using Cyrillic characters

The use of the Cyrillic alphabet is not allowed in the robots.txt file and server HTTP headers.

For domain names, use Punycode. For page addresses, use the same encoding as that of the current site structure.

Good afternoon dear friends! You all know that search engine optimization is a responsible and delicate business. You need to take into account absolutely every little thing in order to get an acceptable result.

Today we'll talk about robots.txt - a file that every webmaster is familiar with. It is in it that all the most basic instructions for search robots are written. As a rule, they gladly follow the prescribed instructions and, in case of incorrect compilation, refuse to index the web resource. Next, I'll show you how to compose the correct robots.txt, as well as how to set it up.

In the preface I have already described what it is. Now I'll tell you why you need it. Robots.txt is a small text file that is stored at the root of the site. It is used by search engines. It clearly spells out the rules for indexing, that is, which sections of the site need to be indexed (added to the search), and which - not.

Usually, the technical sections of the site are closed from indexing. Occasionally, non-unique pages get blacklisted (copy-paste of the privacy policy is an example of this). Here, "robots are explained" the principles of working with sections that need to be indexed. Very often rules are prescribed for several robots separately. We will talk about this further.

With proper robots.txt settings, your site is guaranteed to grow in search engine rankings. Robots will only consider useful content, leaving out duplicated or technical sections.

Building robots.txt

To create a file, you just need to use the standard functionality of your operating system, and then upload it to the server via FTP. Where it lies (on the server) is easy to guess - at the root. This folder is usually called public_html.

You can easily get into it using any FTP client (for example) or the built-in file manager. Naturally, we will not upload an empty robot to the server. We will add several basic directives (rules) there.

User-agent: *
Allow: /

By using these lines in your robots file, you will refer to all robots (User-agent directive), allowing them to index your site completely and completely (including all those pages Allow: /)

Of course, this option is not particularly suitable for us. The file will not be particularly useful for search engine optimization. It definitely needs to be properly tuned. But before that, we'll go over all the basic directives and robots.txt values.

Directives

User-agentOne of the most important, because it indicates which robots to follow the rules following it. The rules are taken into account until the next User-agent in the file.
AllowAllows indexing of any resource blocks. For example: “/” or “/ tag /”.
DisallowOn the contrary, it prohibits indexing of partitions.
SitemapThe path to the sitemap (in xml format).
HostThe main mirror (with or without www, or if you have multiple domains). The secure protocol https (if available) is also indicated here. If you have standard http, you do not need to specify it.
Crawl-delayWith its help, you can set the interval for visiting and downloading files of your site for robots. Helps reduce host load.
Clean-paramAllows you to disable indexing of parameters on certain pages (like www.site.com/cat/state?admin_id8883278).
Unlike the previous directives, 2 values ​​are specified here (the address and the parameter itself).

These are all the rules that are supported by the flagship search engines. It is with their help that we will create our robots, operating with various variations for the most different types of sites.

Customization

To correctly configure the robots file, we need to know exactly which sections of the site should be indexed and which should not. In the case of a simple one-page in html + css, we just need to write a few basic directives, such as:

User-agent: *
Allow: /
Sitemap: site.ru/sitemap.xml
Host: www.site.ru

Here we have specified the rules and values ​​for all search engines. But it's better to add separate directives for Google and Yandex. It will look like this:

User-agent: *
Allow: /

User-agent: Yandex
Allow: /
Disallow: / politika

User-agent: GoogleBot
Allow: /
Disallow: / tags /

Sitemap: site.ru/sitemap.xml
Host: site.ru

Now absolutely all files will be indexed on our html site. If we want to exclude a page or picture, then we need to specify a relative link to this fragment in Disallow.

You can use robots automatic file generation services. I do not guarantee that with their help you will create a perfectly correct version, but as a guide, you can try.

These services include:

With their help, you can create robots.txt automatically. Personally, I highly do not recommend this option, because it is much easier to do it manually, setting it up for your platform.

Speaking of platforms, I mean all kinds of CMSs, frameworks, SaaS systems, and much more. Next, we will talk about how to customize the WordPress and Joomla robots file.

But before that, let's highlight several universal rules that can be followed when creating and configuring robots for almost any site:

Close from indexing (Disallow):

  • site admin panel;
  • personal account and registration / authorization pages;
  • shopping cart, data from order forms (for an online store);
  • cgi folder (located on the host);
  • service sections;
  • ajax and json scripts;
  • UTM and Openstat tags;
  • various parameters.

Open (Allow):

  • Pictures;
  • JS and CSS files;
  • other elements that should be taken into account by search engines.

In addition, at the end, do not forget to specify the sitemap (path to the sitemap) and host (main mirror) data.

Robots.txt for WordPress

To create a file, we need to throw robots.txt into the site root in the same way. In this case, it will be possible to change its contents using the same FTP and file managers.

There is also a more convenient option - create a file using plugins. In particular, Yoast SEO has such a feature. It's much more convenient to edit robots directly from the admin area, so I myself use this method of working with robots.txt.

How you decide to create this file is up to you, it is more important for us to understand which directives should be there. On my WordPress sites I use this option:

User-agent: * # rules for all robots, except for Google and Yandex

Disallow: / cgi-bin # folder with scripts
Disallow: /? # parameters of requests from the home page
Disallow: / wp- # files of the CSM itself (with the prefix wp-)
Disallow: *? S = # \
Disallow: * & s = # everything related to search
Disallow: / search / # /
Disallow: / author / # author archives
Disallow: / users / # and users
Disallow: * / trackback # WP notifications that someone is linking to you
Disallow: * / feed # feed in xml
Disallow: * / rss # and rss
Disallow: * / embed # inline elements
Disallow: /xmlrpc.php # WordPress API
Disallow: * utm = # UTM tags
Disallow: * openstat = # Openstat tags
Disallow: / tag / # tags (if available)
Allow: * / uploads # open downloads (pictures, etc.)

User-agent: GoogleBot # for Google
Disallow: / cgi-bin
Disallow: /?
Disallow: / wp-
Disallow: *? S =
Disallow: * & s =
Disallow: / search /
Disallow: / author /
Disallow: / users /
Disallow: * / trackback
Disallow: * / feed
Disallow: * / rss
Disallow: * / embed
Disallow: /xmlrpc.php
Disallow: * utm =
Disallow: * openstat =
Disallow: / tag /
Allow: * / uploads
Allow: /*/*.js # open JS files
Allow: /*/*.css # and CSS
Allow: /wp-*.png # and pictures in png format
Allow: /wp-*.jpg # \
Allow: /wp-*.jpeg # and in other formats
Allow: /wp-*.gif # /
# works with plugins

User-agent: Yandex # for Yandex
Disallow: / cgi-bin
Disallow: /?
Disallow: / wp-
Disallow: *? S =
Disallow: * & s =
Disallow: / search /
Disallow: / author /
Disallow: / users /
Disallow: * / trackback
Disallow: * / feed
Disallow: * / rss
Disallow: * / embed
Disallow: /xmlrpc.php
Disallow: / tag /
Allow: * / uploads
Allow: /*/*.js
Allow: /*/*.css
Allow: /wp-*.png
Allow: /wp-*.jpg
Allow: /wp-*.jpeg
Allow: /wp-*.gif
Allow: /wp-admin/admin-ajax.php
# clean UTM tags
Clean-Param: openstat # and don't forget about Openstat

Sitemap: # write the path to the sitemap
Host: https://site.ru #main mirror

Attention! When copying lines to a file, do not forget to remove all comments (text after the #).

This robots.txt option is the most popular among WP webmasters. Is he perfect? No. You can try to add something or remove it. But keep in mind that when optimizing a text editor of robots, mistakes are not uncommon. We will talk about them further.

Robots.txt for Joomla

And although in 2018 Joomla is rarely used by anyone, I believe that this wonderful CMS cannot be ignored. When promoting projects on Joomla, you will certainly have to create a robots file, otherwise how do you want to close unnecessary elements from indexing?

As in the previous case, you can create a file manually by simply uploading it to the host, or you can use the module for these purposes. In both cases, you will have to properly configure it. This is how the correct version for Joomla will look like:

User-agent: *
Allow: /*.css?*$
Allow: /*.js?*$
Allow: /*.jpg?*$
Allow: /*.png?*$
Disallow: / cache /
Disallow: /*.pdf
Disallow: / administrator /
Disallow: / installation /
Disallow: / cli /
Disallow: / libraries /
Disallow: / language /
Disallow: / components /
Disallow: / modules /
Disallow: / includes /
Disallow: / bin /
Disallow: / component /
Disallow: / tmp /
Disallow: /index.php
Disallow: / plugins /
Disallow: / * mailto /

Disallow: / logs /
Disallow: / component / tags *
Disallow: / *%
Disallow: / layouts /

User-agent: Yandex
Disallow: / cache /
Disallow: /*.pdf
Disallow: / administrator /
Disallow: / installation /
Disallow: / cli /
Disallow: / libraries /
Disallow: / language /
Disallow: / components /
Disallow: / modules /
Disallow: / includes /
Disallow: / bin /
Disallow: / component /
Disallow: / tmp /
Disallow: /index.php
Disallow: / plugins /
Disallow: / * mailto /

Disallow: / logs /
Disallow: / component / tags *
Disallow: / *%
Disallow: / layouts /

User-agent: GoogleBot
Disallow: / cache /
Disallow: /*.pdf
Disallow: / administrator /
Disallow: / installation /
Disallow: / cli /
Disallow: / libraries /
Disallow: / language /
Disallow: / components /
Disallow: / modules /
Disallow: / includes /
Disallow: / bin /
Disallow: / component /
Disallow: / tmp /
Disallow: /index.php
Disallow: / plugins /
Disallow: / * mailto /

Disallow: / logs /
Disallow: / component / tags *
Disallow: / *%
Disallow: / layouts /

Host: site.ru # don't forget to change the address here to yours
Sitemap: site.ru/sitemap.xml # and here

As a rule, this is enough to prevent unnecessary files from ending up in the index.

Configuration errors

Very often people make mistakes when creating and configuring a robots. Here are the most common ones:

  • The rules are specified only for User-agent.
  • Host and Sitemap are missing.
  • The presence of the http protocol in the Host directive (you only need to specify https).
  • Failure to comply with the rules of nesting when opening / closing images.
  • UTM and Openstat tags are not closed.
  • Prescribing host and sitemap directives for each robot.
  • Surface study of the file.

It is very important to configure this little file correctly. If you make gross mistakes, you can lose a significant part of the traffic, so be extremely careful when setting up.

How do I check a file?

For these purposes, it is better to use special services from Yandex and Google, since these search engines are the most popular and in demand (most often the only ones used), there is no point in considering search engines such as Bing, Yahoo or Rambler.

To begin with, consider the option with Yandex. We go to the Webmaster. Then go to Tools - Robots.txt Analysis.

Here you can check the file for errors, as well as check in real time which pages are open for indexing and which are not. Very convenient.

Google has exactly the same service. Go to Search Console... Find the Scanning tab, select - Robots.txt File Checker Tool.

Here are exactly the same functions as in the domestic service.

Please note that it is showing me 2 errors. This is due to the fact that Google does not recognize the directives for clearing parameters that I specified for Yandex:

Clean-Param: utm_source & utm_medium & utm_campaign
Clean-Param: openstat

You shouldn't pay attention to this, since Google robots only use the rules for GoogleBot.

Conclusion

The robots.txt file is very important for your website's SEO optimization. Approach its setup with all responsibility, because if implemented incorrectly, everything can go to dust.

Consider all the instructions I've shared in this article, and don't forget that you don't have to copy my robots exactly. It is quite possible that you will have to additionally understand each of the directives, adjusting the file for your specific case.

And if you want to dig deeper into robots.txt and WordPress website building, then I invite you to. On it you will learn how you can easily create a website, without forgetting to optimize it for search engines.

Robots.txt- this is a text file located at the root of the site - http://site.ru/robots.txt... Its main purpose is to set certain directives for search engines - what and when to do on the site.

Simplest Robots.txt

The simplest robots.txt, which allows all search engines to index everything, looks like this:

User-agent: *
Disallow:

If the Disallow directive does not have a slash at the end, then all pages are allowed for indexing.

This directive completely prohibits the site from indexing:

User-agent: *
Disallow: /

User-agent - indicates for whom the directives are intended, an asterisk indicates that for all PSs, for Yandex indicate User-agent: Yandex.

Yandex Help says that its search robots process User-agent: *, but if User-agent: Yandex is present, User-agent: * is ignored.

Disallow and Allow directives

There are two main directives:

Disallow - deny

Allow - allow

Example: On the blog, we prohibited indexing the / wp-content / folder where plugin files, template, etc. are located. But also there are images that must be indexed by the PS to participate in the image search. To do this, you need to use the following scheme:

User-agent: *
Allow: / wp-content / uploads / # Allow indexing of images in the uploads folder
Disallow: / wp-content /

The order in which directives are used is important for Yandex if they apply to the same pages or folders. If you specify like this:

User-agent: *
Disallow: / wp-content /
Allow: / wp-content / uploads /

Images will not be uploaded by the Yandex robot from the / uploads / directory, because the first directive is being executed, which denies all access to the wp-content folder.

Google is simpler and follows all the directives in the robots.txt file, regardless of their location.

Also, do not forget that directives with and without a slash perform a different role:

Disallow: / about Denies access to the entire directory site.ru/about/, and pages containing about - site.ru/about.html, site.ru/aboutlive.html, etc. will not be indexed either.

Disallow: / about / It will prohibit the indexing by robots of pages in the site.ru/about/ directory, and pages of the site.ru/about.html type, etc. will be available for indexing.

Regular expressions in robots.txt

Two characters are supported, these are:

* - implies any order of characters.

Example:

Disallow: / about * will deny access to all pages that contain about, in principle, and without an asterisk, such a directive will also work. But in some cases, this expression is not replaceable. For example, in one category there are pages with .html at the end and without, in order to close all pages that contain html from indexing, we write the following directive:

Disallow: /about/*.html

Now the pages site.ru/about/live.html is closed from indexing, and the page site.ru/about/live is open.

Another example by analogy:

User-agent: Yandex
Allow: /about/*.html # allow indexing
Disallow: / about /

All pages will be closed, except for pages that end with .html

$ - trims the rest and marks the end of the line.

Example:

Disallow: / about- This directive robots.txt prohibits indexing all pages that start with about, as well as prohibiting pages in the / about / directory.

By adding a dollar sign at the end - Disallow: / about $, we will inform the robots that it is impossible to index only the / about page, and the / about / directory, the / aboutlive pages, etc. can be indexed.

Sitemap directive

This directive specifies the path to the Sitemap, as follows:

Sitemap: http: //site.ru/sitemap.xml

Host directive

Indicated in this form:

Host: site.ru

No http: //, oblique slashes and the like. If you have the main mirror of a site with www, then write:

Bitrix robots.txt example

User-agent: *
Disallow: /*index.php$
Disallow: / bitrix /
Disallow: / auth /
Disallow: / personal /
Disallow: / upload /
Disallow: / search /
Disallow: / * / search /
Disallow: / * / slide_show /
Disallow: / * / gallery / * order = *
Disallow: / *? *
Disallow: / * & print =
Disallow: / * register =
Disallow: / * forgot_password =
Disallow: / * change_password =
Disallow: / * login =
Disallow: / * logout =
Disallow: / * auth =
Disallow: / * action = *
Disallow: / * bitrix _ * =
Disallow: / * backurl = *
Disallow: / * BACKURL = *
Disallow: / * back_url = *
Disallow: / * BACK_URL = *
Disallow: / * back_url_admin = *
Disallow: / * print_course = Y
Disallow: / * COURSE_ID =
Disallow: / * PAGEN_ *
Disallow: / * PAGE_ *
Disallow: / * SHOWALL
Disallow: / * show_all =
Host: sitename.ru
Sitemap: https://www.sitename.ru/sitemap.xml

A robots.txt example for WordPress

After all the necessary directives have been added, which are described above. You should get something like this robots file:

This is the basic version of robots.txt for wordpress, so to speak. There are two User-agents - one for everyone and the second for Yandex, where the Host directive is specified.

Meta robots tags

It is possible to block a page or site from indexing not only with a robots.txt file, but with a meta tag.

<meta name = "robots" content = "noindex, nofollow">

It must be registered in the tag and this meta tag will prohibit indexing the site. There are plugins in WordPress that allow you to expose such meta tags, for example - Platinum Seo Pack. With it, you can close any page from indexing, it uses meta tags.

Crawl-delay directive

Using this directive, you can set the time at which the search bot should interrupt between downloading the site pages.

User-agent: *
Crawl-delay: 5

The timeout between loading two pages will be 5 seconds. To reduce the load on the server, usually set 15-20 seconds. This directive is needed for large, frequently updated sites where search bots simply "live".

For regular sites / blogs, this directive is not needed, but this way you can limit the behavior of other non-relevant search robots (Rambler, Yahoo, Bing), etc. After all, they also go to the site and index it, thereby creating a load on the server.

Correct, competent setting of the root robots.txt file is one of the most important tasks of the WEB-wizard. In the case of unforgivable mistakes in the search results, many unnecessary pages of the site may appear. Or, on the contrary, important documents of your site will be closed for indexing, in the worst case, you can close the entire root directory of the domain for search robots.

Setting up your robots.txt file correctly with your own hands is actually not a very difficult task. After reading this article, you will learn the intricacies of directives, and independently write the rules for the robots.txt file on your site.

A specific, but not complex syntax is used to create a robots.txt file. There are not many directives used. Let's look at the rules, structure and syntax of the robots.txt file step by step and in detail.

General robots.txt rules

First, the robots.txt file itself must be ANSI encoded.

Secondly, you cannot use any national alphabets to write the rules, only the Latin alphabet is possible.

Structurally, a robots.txt file can consist of one or several blocks of instructions, separately for robots from different search engines. Each block or section has a set of rules (directives) for indexing the site by a particular search engine.

In the directives themselves, in the blocks of rules and between them, any extra headers and symbols are not allowed.

Directives and rule blocks are separated by line breaks. The only assumption is comments.

Robots.txt commenting

The '#' symbol is used for commenting. If you put a hash symbol at the beginning of a line, the entire content is ignored by search robots until the end of the line.

User-agent: *
Disallow: / css # write a comment
# Write another comment
Disallow: / img

Sections in robots.txt file

When a robot reads a file, only the section addressed to the robot of this search engine is used, that is, if the user-agent section contains the name of the Yandex search engine, then its robot will only read the section addressed to it, ignoring others, in particular the section with the directive for all robots - User-agent: *.

Each of the sections is independent. There can be several sections, for robots of each or some search engines, or one universal section, for all robots or robots of one of their systems. If there is only one section, then it starts from the first line of the file and occupies all lines. If there are several sections, then they must be separated by at least one empty line.

The section always starts with the User-agent directive and contains the name of the search engine for which it is intended for robots, if it is not a universal section for all robots. In practice, it looks like this:

User-agent: YandexBot
# user agent for robots of the Yandex system
User-agent: *
# user agent for all robots

It is forbidden to list multiple bot names. For bots of each search engine, its own section is created, its own separate block of rules. If, in your case, the rules for all robots are the same, use one universal, common section.

What are the directives?

A directive is a command or rule that informs a search robot of certain information. The directive tells the search bot how to index your site, which directories not to view, where the XML sitemap is located, which domain name is the main mirror, and some other technical details.

The robots.txt section consists of separate commands,
directives. The general syntax for directives is as follows:

[DirectiveName]: [optional space] [value] [optional space]

The directive is written in one line, without hyphenation. According to accepted standards, a line gap between directives in one section is not allowed, that is, all directives of one section are written on each line, without additional line gaps.

Let's describe the meanings of the main directives used.

Disallow directive

The most used directive in the robots.txt file is "Disallow". The "Disallow" directive prohibits indexing of the path specified in it. It can be a separate page, pages containing the specified "mask" in their URL (path), a part of the site, a separate directory (folder) or the entire site.

"*" - an asterisk means - "any number of characters". That is, the / folder * path is the same as “/ folders”, “/ folder1”, “/ folder111”, “/ foldersssss” or “/ folder”. Robots, when reading the rules, automatically add the "*" sign. In the example below, both directives are absolutely equivalent:

Disallow: / news
Disallow: / news *

"$" - the dollar sign prohibits robots from automatically appending the "*" character when reading directives(asterisk) at the end of the directive. In other words, the "$" character denotes the end of the comparison string. That is, in our example we prohibit indexing of the “/ folder” folder, but we do not prohibit it in the “/ folder1”, “/ folder111” or “/ foldersssss” folders:

User-agent: *
Disallow: / folder $

"#" - (sharp) comment mark... Everything that is written after this icon, on the same line with it, is ignored by search engines.

Allow directive

The ALLOW directive of the robots.txt file is opposite in meaning to the DISSALOW directive, the ALLOW directive allows. The example below shows that we prohibit indexing the entire site except for the / folder:

User-agent: *
Allow: / folder
Disallow: /

An example of the simultaneous use of "Allow", "Disallow" and priority

Do not forget about understanding the priority of prohibitions and permissions, when specifying directives. Previously, priority was indicated by the order in which bans and permissions were declared. Now the priority is determined by specifying the maximum existing path within one block for the search engine robot (User-agent), in order of increasing path length and the place where the directive is indicated, the longer the path, the more priority:

User-agent: *
Allow: / folders
Disallow: / folder

In the above example, indexing of URLs starting with “/ folders” is allowed, but it is prohibited in paths that have “/ folder”, “/ folderssss” or “/ folder2” in their URLs. In case of falling of the same path under both directives "Allow" and "Disallow", preference is given to the directive "Allow".

Empty parameter value in the directives "Allow" and "Disallow"

There are errors of WEB-masters when in the robots.txt file in the directive "Disallow"
they forget to include the "/" symbol. This is an incorrect, erroneous interpretation of the meanings of directives and their syntax. As a result, the prohibiting directive becomes permissive: "Disallow:" is absolutely identical to "Allow: /". The correct prohibition on indexing the entire site looks like this:

The same can be said for "Allow:". The "Allow:" directive without the "/" character prohibits indexing of the entire site, just like "Disallow: /".

Sitemap directive

By all the canons of SEO optimization, you must use a sitemap (SITEMAP) in XML format and provide it to search engines.

Despite the functionality of "webmasters' cabinets" in search engines, it is necessary to declare the presence of sitemap.xml in robots.txt using the directive " SITEMAP". When crawling your site, search robots will see an indication of the sitemap.xml file and will be sure to use it in their next crawls. An example of using the sitemap directive in a robots.txt file:

User-agent: *
Sitemap: https://www.domainname.zone/sitemap.xml

Host directive

Another important robots.txt directive is HOST.

It is believed that not all search engines recognize it. But Yandex indicates that it reads this directive, and Yandex in Russia is the main “search provider”, so we will not ignore the “host” directive.

This directive tells search engines which domain is the main mirror. We all know that a site can have multiple addresses. The site URL may or may not use the WWW prefix, or the site may have several domain names, for example, domain.ru, domain.com, domen.ru, www.domen.ru. It is in such cases that we tell the search engine in the robots.txt file using the host directive which of these names is the main one. The value of the directive is the name of the main mirror itself. Let's give an example. We have several domain names (domain.ru, domain.com, domen.ru, www.domen.ru) and they all redirect visitors to the site www.domen.ru, an entry in the robots.txt file will look like this:

User-agent: *
Host: www.domen.ru

If you want your main mirror to be without a prefix (WWW), then, accordingly, you should specify the site name without a prefix in the directive.

The HOST directive solves the problem of duplicate pages, which is very often faced by WEB-masters and SEO-specialists. Therefore, the HOST directive must be used if you are targeting the Russian-speaking segment and it is important for you to rank your site in the Yandex search engine. Let us repeat, for today only Yandex announces that it has read this directive. To specify the main mirror in other search engines, you must use the settings in the WEB-masters cabinets. Do not forget that the name of the main mirror must be specified correctly (correct spelling, adherence to the encoding and syntax of the robots.txt file). This directive is allowed only once in a file. If you enter it several times by mistake, then the robots will only take into account the first occurrence.

Crawl-delay directive

This directive is a technical one, a command to search robots how often you need to visit your site. More precisely, the Crawl-delay directive specifies the minimum interval between visits to your site by robots (search engine crawlers). Why specify this rule? If robots come to you very often, and new information on the site appears much less often, then over time, search engines will get used to rare changes in information on your site and will visit you much less often than you would like. This is a search argument for using the "Crawl-delay" directive. Now for the technical argument. Too frequent visits to your site by robots creates an additional load on the server, which you do not need at all. It is better to specify an integer as the directive value, but now some robots have learned to read fractional numbers as well. The time is indicated in seconds, for example:

User-agent: Yandex
Crawl-delay: 5.5

Clean-param directive

The optional "Clean-param" directive instructs crawlers on the site address parameters that do not need to be indexed and should be treated as the same URL. For example, you have the same pages displayed at different addresses that differ in one or more parameters:

www.domain.zone/folder/page/
www.domain.zone/index.php?folder=folder&page=page1/
www.domain.zone/ index.php? folder = 1 & page = 1

Search bots will crawl all similar pages and notice that the pages are the same, contain the same content. First, it will create confusion in the site structure when indexing. Secondly, the additional load on the server will increase. Thirdly, the scanning speed will drop noticeably. To avoid these troubles, the "Clean-param" directive is used. The syntax is as follows:

Clean-param: param1 [& param2 & param3 & param4 & ... & param * N] [Path]

The “Clean-param” directive, like “Host”, is not read by all search engines. But Yandex understands it.

Errors common in robots.txt

The robots.txt file is not at the root of the site

Robots. txt should be located at the root of the site, only in the root directory... All other files with the same name, but located in other folders (directories) are ignored by search engines.

Robots.txt filename error

The file name is written in small letters (lowercase) and must be named robots.txt... All other options are considered erroneous and search engines will inform you that the file is missing. Common mistakes look like this:

ROBOTS.txt
Robots.txt
robot.txt

Using invalid characters in robot.txt

The robots.txt file must be ANSI encoded and contain only Latin characters. Writing directives and their meanings with any other national characters is unacceptable, except for the content of comments.

Robots.txt syntax errors

Strictly follow the syntax rules in your robots.txt file. Syntax errors can cause search engines to ignore the contents of the entire file.

Listing several robots in one line in the User-agent directive

A mistake often made by novice WEB-masters, rather due to their own laziness, is not to split the robots.txt file into sections, but to combine commands for several search engines in one section, for example:

User-agent: Yandex, Googlebot, Bing

For each search engine it is necessary to create its own separate section, taking into account the directives that this search engine reads. The exception, in this case, is a single section for all search engines:

User-agent with empty value

The User-agent directive cannot be empty. Only "Allow" and "Disallow" can be empty, and then taking into account the fact that they change their value. Specifying the User-agent directive with an empty value is a gross mistake.

Multiple values ​​in the Disallow directive

A less common error, but, nevertheless, it can be seen from time to time on sites, this is the indication of several values ​​in the Allow and Disallow directives, for example:

Disallow: / folder1 / folder2 / folder3

Disallow: / folder1
Disallow: / folder2
Disallow: / folder3

Failure to prioritize directives in robots.txt

This error has already been described above, but to consolidate the material, we will repeat it. Previously, the priority was determined by the order in which directives were specified. As of today, the rules have changed, the priority is specified by the length of the line. If the file contains two mutually exclusive directives, Allow and Disallow with the same content, then Allow will take precedence.

Search engines and robots.txt

The directives in the robots.txt file are recommendations for search engines. This means that the rules of reading can be changed or supplemented from time to time. Also remember that each search engine treats file directives differently. And not all directives are read by each of the search engines. For example, only Yandex reads the "Host" directive today. At the same time, Yandex does not guarantee that the domain name specified as the main mirror in the Host directive will necessarily be assigned to the main one, but claims that priority will be given to the specified name in the directive.

If you have a small set of rules, then you can create a single section for all robots. Otherwise, don't be lazy, create separate sections for each search engine you are interested in. This is especially true for bans if you do not want certain pages to be found in the search.

Quick navigation on this page:

The modern reality is that in Runet no self-respecting site can do without a file called robots.txt - even if you have nothing to prohibit from indexing (although almost every site has technical pages and duplicate content that require closing from indexing), then at least it is definitely worth prescribing a directive with www and without www for Yandex - this is what the rules for writing robots.txt, which will be discussed below, are used for this.

What is robots.txt?

The file with this name dates back to 1994, when the W3C consortium decided to introduce such a standard so that sites could supply search engines with indexing instructions.

A file with this name must be saved in the root directory of the site, placing it in any other folders is not allowed.

The file performs the following functions:

  1. prohibits any pages or groups of pages from indexing
  2. allows any pages or groups of pages to be indexed
  3. tells the Yandex robot which mirror of the site is the main one (with or without www)
  4. shows the location of the sitemap file

All four points are extremely important for the search engine optimization of a website. Banning indexing allows you to block pages from indexing that contain duplicate content - for example, tag pages, archives, search results, printable pages, and so on. The presence of duplicate content (when the same text, albeit in the size of several sentences, is present on two or more pages) is a disadvantage for the site in the ranking of search engines, so there should be as few duplicates as possible.

The allow directive has no independent meaning, since by default all pages are already available for indexing. It works in conjunction with disallow - when, for example, some category is completely closed from search engines, but you would like to open this or a separate page in it.

Pointing to the main mirror of the site is also one of the most important elements in optimization: search engines consider www.yoursite.ru and yoursite.ru sites as two different resources, unless you directly tell them otherwise. As a result, the content doubles - the appearance of duplicates, a decrease in the strength of external links (external links can be placed both with www and without www) and as a result, this can lead to a lower ranking in the search results.

For Google, the main mirror is registered in the Webmaster's tools (http://www.google.ru/webmasters/), but for Yandex, these instructions can be written only in that very robots.tht.

Pointing to an xml file with a sitemap (for example - sitemap.xml) allows search engines to find this file.

User-agent specification rules

User-agent in this case is a search engine. When writing instructions, it is necessary to indicate whether they will apply to all search engines (then an asterisk - * is put down) or whether they are designed for some separate search engine - for example, Yandex or Google.

In order to set a User-agent indicating all robots, write the following line in your file:

User-agent: *

For Yandex:

User-agent: Yandex

For Google:

User-agent: GoogleBot

Disallow and allow rules

First, it should be noted that the robots.txt file must contain at least one disallow directive for its validity. Now, considering the application of these directives with specific examples.

With this code, you enable indexing of all pages on the site:

User-agent: * Disallow:

And by means of such a code, on the contrary, all pages will be closed:

User-agent: * Disallow: /

To prohibit indexing of a specific directory named folder, specify:

User-agent: * Disallow: / folder

You can also use asterisks to substitute an arbitrary name:

User-agent: * Disallow: * .php

Important: the asterisk replaces the entire file name, that is, you cannot specify file * .php, you can only specify * .php (but all pages with the .php extension will be prohibited, to avoid this, you can specify a specific page address).

The allow directive, as mentioned above, is used to create exceptions in disallow (otherwise it does not make sense, since the pages are already open by default).

For example, we will prohibit indexing the page in the archive folder, but leave the index.html page from this directory open:

Allow: /archive/index.html Disallow: / archive /

Specify the host and sitemap

The host is the main mirror of the site (that is, the domain name plus www or the domain name without this prefix). The host is specified only for the Yandex robot (in this case, there must be at least one disallow command).

To specify host robots.txt must contain the following entry:

User-agent: Yandex Disallow: Host: www.yoursite.ru

As for the sitemap, in robots.txt, the sitemap is indicated by simply writing the full path to the corresponding file, indicating the domain name:

Sitemap: http: //yoursite.ru/sitemap.xml

How to make a sitemap for WordPress is written.

A robots.txt example for WordPress

For wordpress, instructions must be specified in such a way as to close all technical directories (wp-admin, wp-includes, etc.) for indexing, as well as duplicate pages created by tags, rss files, comments, search.

As an example of robots.txt for wordpress, you can take a file from our site:

User-agent: Yandex Disallow: / wp-admin Disallow: / wp-includes Disallow: /wp-login.php Disallow: /wp-register.php Disallow: /xmlrpc.php Disallow: / search Disallow: * / trackback Disallow: * / feed / Disallow: * / feed Disallow: * / comments / Disallow: /? feed = Disallow: /? s = Disallow: * / page / * Disallow: * / comment Disallow: * / tag / * Disallow: * / attachment / * Allow: / wp-content / uploads / Host: www..php Disallow: /wp-register.php Disallow: /xmlrpc.php Disallow: / search Disallow: * / trackback Disallow: * / feed / Disallow: * / feed Disallow: * / comments / Disallow: /? feed = Disallow: /? s = Disallow: * / page / * Disallow: * / comment Disallow: * / tag / * Disallow: * / attachment / * Allow: / wp -content / uploads / User-agent: * Disallow: / wp-admin Disallow: / wp-includes Disallow: /wp-login.php Disallow: /wp-register.php Disallow: /xmlrpc.php Disallow: / search Disallow: * / trackback Disallow: * / feed / Disallow: * / feed Disallow: * / comments / Disallow: /? feed = Disallow: /? s = Disallow: * / page / * Disallow: * / comment Disallow: * / tag / * Disa llow: * / attachment / * Allow: / wp-content / uploads / Sitemap: https: //www..xml

You can download the robots.txt file from our website at.

If after reading this article you still have any questions - ask in the comments!