As Google grows more efficient at crawling the web, there are some that believe that the days of worrying about Google’s “crawl budget” are effectively over. Though this may be true for smaller sites, what about larger ecommerce sites? Without configuration, will Google crawl your site as often and as well as you would like? With a few short steps, you can diagnose the state of your site’s crawlability and react accordingly.
What is Crawl Budget?
Let’s say you’ve made a large update to your site that affects many of your pages and you’re hoping these changes will improve your SEO. The benefits of these changes cannot begin to percolate until Google crawls these pages again—do you know when that will be or how long it might take? While you can’t force Google to continually crawl these pages to pick up your periodic changes, you can take measures to point Google’s crawlers away from your unimportant URLs in favor of your critical pages.
When Google’s crawler (known as Googlebot) crawls your website, it does not typically crawl your entire site on each visit. Instead, Googlebot crawls portions of your site on each visit to work its way across your entire site. If you have a large site, it can take Googlebot a long time to complete a full crawl of your site (as long as several weeks), and even longer to re-crawl pages it has previously crawled.
The rate that Googlebot crawls your site is a direct result of your crawl budget, which is the amount of pages Googlebot determines it needs to crawl to effectively evaluate your website. Crawl budget is determined by a number of factors, but is mostly based on your website’s ability to handle the traffic of the crawl, the crawlability of your site, and the relevance of the pages crawled. Given the importance of having Google crawl your website’s critical pages, it is important to maximize the value of Googlebot’s crawls since it only crawls a limited number of pages on each visit.
Checking Your Crawl Stats
To get an idea of the performance of Googlebot on your site, you can review your crawl stats in Google Search Console (This feature is currently only available on the old version of Google Search Console, located under the “Crawl” tab). On this screen, you can view the amount of pages Googlebot is crawling per day on your site as well as stats on download speeds and download amounts. Take note of the amount of pages Googlebot is crawling per day to get an idea of how long it might take for Google to complete a crawl of your site. While the number of crawled pages you see might be higher than expected, it’s important to remember that Googlebot spiders across your site and can crawl any link on your site unless configured otherwise.
Additionally, you can use the URL Inspection Tool in the new version of Google Search Console to check an individual URL’s crawl status, including when the URL was last crawled as well as how Googlebot reached the page. Checking the last crawl date of important URLs with this tool will give you an idea of how often Googlebot crawls these pages. If the last crawl date for these pages is multiple weeks behind, it may be a sign that Googlebot is not efficiently crawling your site. So how do you ensure that Googlebot crawls your important pages early and often? Thankfully, there are several measure you can take to guide Google’s crawl of your site in the right directions.
A surefire way to ensure Googlebot prioritizes crawling your most important pages is via sitemap submission. Providing Google with the URLs you want crawled will allow you to track if Google is indexing your pages as expected. In Google Search Console, you can check your indexation status by navigating to the “Coverage” menu under the “Index” tab. Once on the Coverage screen, review all four tabs (Error, Valid with warnings, Indexed, Excluded) to see which pages Google has crawled (this will include both pages Google has indexed and has not indexed). Is Google crawling (or even worse, indexing) pages that you believe are irrelevant or duplicate content?
The most important areas to look to first in this report are the “Error” and “Valid with warnings” pages. If one of the pages you have submitted to Google via your sitemap has a problem, it will appear in this report. Take a look at all the problem URLs in this report and act accordingly: if the errored page is one that should not have been submitted, remove it from the sitemap. If the errored page is one you want indexed (i.e., a page you included in your sitemap has a noindex tag) review Google’s feedback in Search Console to correct the issue. If Google continually runs into errors when crawling your site, Googlebot will be less likely to crawl your site as often.
In the “Valid” and “Excluded” sections, verify that the indexed/excluded pages match up with your expectations. Are the majority of your valid URLs ones in your sitemap (“submitted and indexed”), or are they pages Googlebot found outside of your sitemap (“indexed, not submitted in sitemap”)? If there are many pages outside of your sitemap indexed by Google, make sure these pages are not duplicate content of your submitted pages—sometimes Google will crawl one of your category pages, follow a left-hand filter link to a filtered view of that category, and index the page even if you have a canonical tag on the page to prevent this. Because Googlebot can crawl any link on your site (provided the link does not have a nofollow attribute), it can sometimes crawl deep into your left hand filters and crawl URLs with several parameters that will be a waste of your crawl budget (i.e., is there any value in Google crawling a URL like “www.example.com/category?size=s&material=leather&new-arrival=yes”?). Given that Googlebot crawls a limited amount of URLs per visit, any irrelevant URL crawled by Googlebot comes at the expense of a Googlebot not crawling a legitimate page.
In reviewing your crawled pages, it is likely that you came across some URLs that were unnecessarily crawled. Thankfully, it is easy to restrict Googlebot (and all search engine crawlers) from crawling irrelevant pages and on the right path. The best way to achieve this is to update your robots.txt directives to specify which pages should not be crawled.
If you have entire subfolders that shouldn’t be crawled, you can use the “Disallow” directive to block crawlers from accessing those directories (i.e., “Disallow: /checkout/” would block crawlers from accessing your checkout pages). Once you have made updates to your robots.txt file, you can immediately test your changes using Google’s robots.txt checker in Google Search Console (old version). The tool is a great way to quickly confirm that the syntax of your directives blocks the correct pages.
If there are URL parameters such as left hand navigation filters eating up Googlebot’s crawl, you can also use the disallow directive with wildcards to prevent all URLs with that parameter present from being crawled (i.e., if Google is crawling pages with your size filters unnecessarily, including “Disallow: *size=*” in your robots.txt file will prevent Googlebot from crawling any URL that includes the size parameter). In addition to your robots.txt file, you should also configure your URL parameters in Google Search Console (currently only in the old version) to specifically communicate to Google the function of each of your parameters.
One important thing to remember is that if you have specific pages you do not want indexed, you should use the meta noindex directive, as sometimes disallowed directories can still be crawled under certain conditions.
While Google’s improvements to its crawlers have resulted in more frequent and efficient crawls, crawl budgets can still be an issue for sites with large catalogs and category trees. If your important pages go weeks between crawls, it will take Google just as long to pick up any potential optimizations and then reevaluate your page (which in turn further sets back your ability to assess the results of your changes). If you find that Google is wasting crawl budget on pages you see no value in, some quick updates can drastically improve your site’s crawlability and in turn result in better indexation.
About Chris Brown
Chris Brown has nearly 20 years of retail leadership while driving impressive results in a diverse range of retail business models including, pure play ecommerce, brick and mortar, omnichannel (with substantial mobile expertise) and merchandising. As Vice President of Omni-channel and eCommerce Strategy, Chris connects with clients to help drive their digital strategy, combining his experience in high-growth retail environments with software solutions to build revenue, increase conversion and drive retention. Connect with Chris on Linkedin: