On the virtues of the infamous robots.txt file
A Robots.txt File?
Yes. A robots.txt file. It’s your chance to take control and tell web bots (think webcrawlers) how to crawl the pages of your site. The problem is that webcrawlers, if left to their own spidery devices, may end up crawling EVERYTHING on your site.
This might sound potentially, at least on the surface, like a good thing. But it’s not…
You don’t want webcrawlers to crawl the very building blocks you use to create the site itself. As a user, if I search for your business and one of the top options is a template you’ve used to create your site, that is going to make for an embarrassing user experience.
The Basic Format
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
Want a real world example?
All you have to do is add /robots.txt to the end of your favorite website. For example, if you like pancakes, you might be an International House of Pancakes fan (and who isn’t a fan of pancakes?). Their robots.txt file can be found at https://www.ihop.com/robots.txt., and looks like this:
First, the * indicates that the following rules apply to ALL webcrawlers. Otherwise, they would specify the particular bots they’re targeting with each subset of rules (and there would be multiple entries for bots like “msnbot”, and “Slurp”, for example).
Second, the Disallow commands are clearly stating which pages iHop doesn’t wish webcrawlers to crawl. In this case, the pages /LTO and /Initiatives are prohibited.
Lastly, the sitemap addition gives webcrawlers a guide to crawling your site and its associated content.
A Quick Robots.txt Rule Guide
Ultimate control over what webcrawlers do and do not crawl for indexing
The web crawler you’re feeding instructions to. This will usually be a search engine like Google, Bing, or Yahoo. If you’re wondering what other options exist, go here: https://www.robotstxt.org/db.html
Although only applicable to Googlebot, Google is probably the one search engine you should care most about. This command tells Google to crawl a specific page, even when the parent folder has been flagged as not for crawling.
The thing to remember about this command is that each URL you want to prevent crawling for needs to have its own Disallow command. So if you have 5 URLs you’d like webcrawlers not to crawl for indexing, you’ll need 5 Disallow lines in your robots.txt file.
This is where you can set up a delay in the number of seconds before a bot crawls your site. The thing to remember here is that Googlebots don’t follow this rule. If you’d like to manually control the crawl rate for Googlebots, you’ll need to do so within Google Console. More information on this can be found here: https://support.google.com/webmasters/answer/48620?hl=en
This is used to call out the location of your XML sitemap. Although only Google, Bing, Yahoo, and Ask will follow this rule, do we really care about any of the other bots? Most of the time, probably not.
Great…how do I add a robots.txt file to my site?
The creation of a robots.txt file is pretty simple. Open up the text editorof your choice and follow the rules above. Save it as “robots.txt” and you’re golden. The real problem people have is in placing it in the right place on their website. This is key, because if you don’t put it in the right place, webcrawlers will assume you don’t have one.
So where is this magical position? It’s in the main directory, or root domain. If this is confusing, we understand. As a rule of thumb, just remember it should ultimately resolve to this URL position: www.yourcoolbananaswebsite.com/robots.txt. If it’s not placed here, all of your hard learning and work will be for naught.