Best Practices for Setting Up Robots.txt

Best Practices To Create A Perfect Robots.txt For SEO

Getting more control over the search engines in a way you want can be a tough fight.

Really, with some hacks, you can easily handle search bots that crawl and index your website – even on the page level. 

Here we are going to talk about a legitimate Search Engine Optimization hack that can improve your website SEO and easy to implement. 

It’s the robots.txt file which is also known as “robots exclusion protocol” or “robots exclusion standard”

Robots.txt supervise the availability of your web content to crawlers but doesn’t notify them to index or not to do so. 

What Is Robots.txt File?

A robots.txt file allows search engine robots which web pages, directory, sub-directory, file, folder, or dynamic web pages the spiders can or can’t crawl from your website (like “follow” & “nofollow). This robots exclusion protocol is usually used to access and index the web content and  prevent website overloading with requests

It will not keep your web page out of search engines. In case you don’t want search engine robots to index your website. You can use noindex directives or protect your page with a password. 

You can tell spiders to crawl or not for an individual part of the website and user-agent by specifying “disallowing” or “allowing”.

Location Of Robots.txt File

Keep your robots.txt file in the root directory of your domain or subdomain.

Search bots always look for robots.txt files. If any bot identifier is unable to find robots exclusion protocol file under the root directory or access the www.xyz.com/robots.txt URL, user-agent will consider that the website doesn’t have robots.txt file. 

To locate robots exclusion standards, go to your cPanel >> public_html web directory. 

Basic Format Of Robots.txt File

User-agent: *
Allow: /media/terms-and-conditions.pdf
Disallow: /wp-admin/
Disallow: *.php$
Crawl-delay: 5
Sitemap: https://www.xyz.com/sitemap.xml

User-agent Directive

Search engines’ directives/robots that visit the website. Major user-agents are-
Google – googlebot
Bing – bingbot
Yahoo – Slurp
Msn – msnbot
DuckDuckGo – duckduckbot
Baidu – baiduspider
Yandex – yandexbot
Facebook – facebot

Example 
User-agent: googlebot
Disallow: /wp-admin/

In the example search bots are told not to crawl /wp-admin/ directory for Google.

Note: It is important to define user-agent(s) correctly to ensure search bots crawl your website.

Wildcard (*) Directive

This indicates that the directives are meant for all search engines. Wildcard characters can match any sequence of characters you want. It is an excellent approach for the same pattern URLs. Google & Bing bots support this wildcard.

Example
User-agent: *
Disallow: /plugin/
Disallow: *?

In the mentioned example crawlers are not able to crawl the plugin directory and the URLs which includes “question mark (?)”.

Disallow Directive

This robots.txt directive is used to specify which part of a website should not be accessed by all or any individual user-agent. 

Example
User-agent: Slurp
Disallow: /services/

User-agent: bingbot
Disallow: /ebooks/* .pdf 
Disallow: /keywords/

Here for Yahoo robots /services/ directory will not be crawled and for bing search engine spiders /keywords/ directory and all pdf files in /ebooks/ directory will not be crawled.

Allow Directive

Allow directive tell search robots to crawl a subdirectory or webpage – even if the main folder is disallowed. This directive is supported by Google and Bing. An accessible path is necessary to be accessed. If the path is not defined, the directive is ignored. 

Example
User-agent: *
Allow: /blog/
Disallow: /blog/permanent-301-vs-temporary-302-redirects-which-one-is-better/

/blog/ directory will be crawled by bots but ‘permanent-301-vs-temporary-302-redirects-which-one-is-better‘ will be disallowed by robots.

Wildcards ($) Directive

To specify the end of URL, use a dollar sign ($) at the end of the path. Google & Bing bots support this wildcard.

Example
User-agent: *
Disallow: /*.php$

This example shows that all search robots are disallowed to access all URL ends with .php. But crawlers can access URLs that do not end with .php such as https://xyz.com/services.php?lang=en.

Crawl-delay Directive

Crawl-delay directive is used to define “how many milliseconds a crawler waits before crawling the next web page.” Crawl-delay directive prevents server overloading with multiple requests at a time. Yahoo, Bing, and Yandex support this but Google doesn’t support crawl-delay. But one can set the crawl delay in Google Search Console.

Example
User-agent: Slurp
Crawl-delay: 5

Here you direct the crawler to wait for 5 seconds before crawling the next action

Sitemap Directive

Sitemap directive notifies the location of your XML sitemap to the search engines. However, if you have less knowledge about sitemaps than you can use google webmaster tool to submit each URL one by one. 

Example
User-agent: *
Disallow: /media/
Sitemap: https://www.xyz.com/sitemap.xml

Note: Robots Exclusion Standard text file is supported by most search engines. But, you must know that some search engines do not support robots.txt files. 

Why Is Robots.txt File Important?

Google generally crawls and indexes important pages of your website, but ignores the pages which are not important or have duplicate content. Robots.txt is not a mandatory aspect to create a successful website. You can rank well in search engines without a robots exclusion protocol.

Still here are some why you should include a robots.txt file – 

  • To disallow web pages from appearing in Search Engine Result Pages which contain duplicate content.
  • To prevent search robots from crawling your private web folders.
  • To maximize the crawl budget by disallowing less important web pages with robots.txt.
  • To keep the entire section of a website away from search robots.
  • To specify the location of the sitemap.
  • Add crawl-delay to avoid server overloading from multiple requests at once.
  • Block images, videos, pdf, and resources files from occurring in search results

What Are The Best Practices For Robots.txt File?

Create A Robots.txt File

As the robots exclusion standard is a text file, you can create one using the notepad or notepad++.

New Line For Each Directive

To avoid confusion for search engines use different lines for each directive.

Example
Not Correct
User-agent: * Disallow: /wp-admin/ Disallow: /wp-admin-new/

Correct
User-agent: * 
Disallow: /wp-admin/ 
Disallow: /wp-admin-new/

Make Robots.txt Easy To Find

You can place the robots.txt file in the root directory of your website. The recommended location is – https://www.xyz.com/robots.txt

Note: Robots.txt file is case sensitive, ensure that you use lowercase “r” while naming the file. 

Look For Errors And Mistakes

Setting up the correct robots.txt is EXTREMELY important, otherwise, your complete website could get deindexed. 

Suppose, you are working on a multilingual website and editing the Spanish version under /es/ sub-directory. So, you like to avoid search engine robots from crawling it. Use the following robots.txt to disallow spiders from crawling that entire subdirectory.

Example
Not Correct
User-agent: *
Disallow: /es

This will keep bots away from crawling any web page or subfolders begins with /es. For example –
/essentials/
/escrow-services.html
/essentials-services.pdf

Correct
User-agent:
Disallow: /es/

The simple solution is to resolve this – add a slash after the subdirectory name. 

Take Advantage Of Comments To Explain Robots.txt File

Comments help developers and humans to understand the robots.txt file. To add a comment, start with (#).

Example
# Disallow googlebot from crawling. 
User-agent: googlebot
Disallow: /wp-admin/

Search bots will ignore /wp-admin/ directory for google spiders. 

Utilize Each User-agent Only Once

Use one user-agent once in a robots.txt to avoid confusion for search engine spiders. 

Example
User-agent: bingbot
Disallow: /blog/

User-agent: bingbot
Disallow: /articles/

Bing would not crawl either /blog/ subfolder or /articles/ subdirectory. 

Define Wildcards (*) To Streamline Instructions

You can use wildcards (*) directive to define all user-agents and the same URL patterns. 

Example
Not Correct
User-agent: * 
Disallow: /services/seo=?
Disallow: /services/smm=?
Disallow: /services/smo=?

This is not an efficient way.

Correct
User-agent: * 
Disallow: /services/*=?

With this one line directive, you can block search spiders from crawling all web pages under “/service/” directory followed by “=?”

Create Different Robots.txt File For Each Subdomain

If you have a subdomain, you need to create a separate robots.txt file and put that file into the subdomain directory. 

Suppose if you create a blog subdomain such as blog.xyz.com; you should create a different robots exclusion standard file for that blog.

Take Care of Conflicting Rules

The first matching directive always wins in robots.txt. But for both Bing and Google, Allow directive can win over Disallow if the character length of Allow is longer.  

Example
User-agent: *
Allow: /blog/seo/
Disallow: /blog/

Here Google and Bing bot are not permitted to crawl /blog/ directory, but they can crawl and index /blog/seo.

Example
User-agent: *
Disallow: /blog/
Allow: /blog/seo/

Here all bot identifiers are not permitted to crawl /blog/ directory with /blog/seo/ subdirectory. But as mentioned above, Google and Bing bots can access Allow directive, if it has more characters than Disallow directory. 

In case, if both Allow and Disallow are equal in characters, the least restrictive directive gets crawled. 

Limitations Of The Robots.txt File

Robots.txt file is not supported by all search engines

Robots.txt directives can’t force bots to crawl your website, it completely depends upon the crawler to follow them. 

Different search bots treat syntax differently

Most web identifiers follow robots.txt file directives, but every search bot treats it in their way. You must know the exact syntax before using multiple web crawlers. 

Disallowed page still appear in search results

Web pages which are disallowed in robots.txt, but still linked with an indexed web page. In case, you don’t want web spiders to index your website, you can use other methods such as noindex meta tag or protect your private files with a password. 

By permitting robots.txt directives to crawl the right syntax, search crawlers organize and display your content in the way you want to make it appear in search engine results pages. 

Robots.txt Frequently Asked Questions

Here are some FAQs related to robots.txt; in case you have any question or feedback, do comments. We will update accordingly.

Will search robots crawl a website that doesn’t have a robots.txt?

Yes, if search robots don’t find robots.txt file in the root directory, crawlers presume that there are no directives and access the entire website.

Will search robots crawl a website that doesn’t have a robots.txt?

500 KB

What will happen, if I use the noindex directive in robots.txt?

One must know that search engines never follow the “noindex” directive because they can not see the “noindex” syntax. 

24 thoughts on “Best Practices To Create A Perfect Robots.txt For SEO”

  1. Hello! Someone in my Facebook group shared this website
    with us so I came to take a look. I’m definitely enjoying the information. I’m book-marking and
    will be tweeting this to my followers! Fantastic blog and
    terrific style and design.

  2. Someboɗy essentially һelp to make ѕeverely articles I
    might state. This is the ᴠery first time I frequented your websitе
    page and up to now? I amazed with the analysis
    you made to maҝe this pаrticular poѕt incredible. Fantastic task!

  3. You made some good factors there. I viewed the net for the issue and discovered most people will certainly go along with with your web site. Camellia William Ladew

  4. Awesome post. I am a normal visitor of your blog and appreciate you taking the time to maintain the excellent site. I will be a regular visitor for a really long time. Truda Lalo Howey

  5. Howdy! I just wish to offer you a huge thumbs up for the excellent info you have right here on this post. I am returning to your website for more soon. Rheta Jozef Agustin

  6. Great beat ! I would like to apprentice while you amend your site, how can i subscribe for a blog web site? The account helped me a acceptable deal. I had been a little bit acquainted of this your broadcast offered bright clear concept Karalynn Orton Etta

  7. I don’t even know how I ended up here, but I thought this post was good.
    I do not know who you are but definitely you’re going to a famous blogger
    if you are not already 😉 Cheers!

  8. We’re a group of volunteers and opening a new
    scheme in our community. Your web site provided us with valuable info
    to work on. You have done a
    formidable job and our whole community will be grateful
    to you.

Leave a Comment

Your email address will not be published. Required fields are marked *