Earlier this week, I discovered that several of our uploaded images were being assigned a unique URL and indexed in Google. While it’s not likely that someone would be performing a site search of evolvedigitallabs.com, it is still alarming to see pages I didn’t know existed floating around the SERPs.
There are a couple ways to prevent search bots from showing a website’s pages in the results. One is adding the URL(s) to the robots.txt file. Another way – the more effective way in many cases – is by using meta tags to instruct robots what you want them to do. This post explains each route and provides some scenarios so you can better understand which way to go. It can be confusing.
Similar to a sitemap, this public file specifies which bots are prohibited from exploring pages. They can easily be created using a text editor and viewed by adding /robots.txt to the end of a domain's homepage. Webmasters can give direction to all bots at once or clarify a subset, such as the Googlebot-Image, which (as you can imagine) crawls images. Additionally, they can specify a singular URL or a directory, such as /secret-posts. Of course, as this is a public file, webmasters don’t add URLs they want kept secret. Pages might stay indexed The problem with robots.txt is that it doesn’t always remove pages from the index – particularly if there are other pages linking to it. A robot essentially has no idea why it’s being blocked because it can’t read the meta robots instructions inside. Therefore, it will consider the page to be relevant to people and will allow it to remain in the index. For example, a sweet tattoo shop here in town has disallowed ALL bots from accessing ALL of its site's pages. Not sure I understand the strategy there, but I feel okay picking on them because they're always booked solid for appointments anyway. As you can see, the homepage stil ranks, even though its robots.txt file clearly blocks bots from entering. This is because other websites have linked to the homepage, making the page known to search engines.
When is this a good option? In Evolve’s case, we were seeing pages that were never meant to be found. We didn’t even know they existed because our WordPress generated the pages automatically. The chances of them being linked to are pretty much zero, so adding them to robots.txt will help them from Google’s index over a few weeks. Other instances for using robots.txt:
- User registration/login/passwords
- Administrative pages
- License pages
- Utility pages
Meta Tag: noindex
In the meta tag of any given page, webmasters are able to command robots to not index the URL in the search engines; in addition to being more effective than robots.txt at keeping certain pages out of an index, it also allows PageRank (i.e., the authority that filters in through links) to benefit the website because spiders are able to crawl the page. A page with this parameter will have this snippet in its header: Use this method if you have a page that you don't want to appear in Google, but is still able to acquire links (and PageRank) from other sources. It's important to clarify robots to "follow" so they know to continue crawling the links on that page. Examples of using noindex, follow:
- Duplicate content (such as A/B testing pages)
- Any page you want crawled but not indexed (contact form, thank you page, shopping cart, temporary page)
It can be confusing and time consuming, but there's a strategic purpose in controlling how search engines crawl and index your site's pages. If your site is acquiring a significant amount of traffic to pages you don't want indexed, or if duplicate pages are competing with one another to rank, you should consider regulating which pages the search bots can crawl.