Google Crawling Explained: Tips from SEO Authorities

In the recent episode 64 of Search Off the Record, titled “The effect of quality on search,” John Mueller and Gary Illyes talked about intricacies of web crawling and its impact on search rankings. This article, inspired by their discussions and my personal experiences, seeks to elaborate on Google’s crawling mechanisms. A point of interest was their reference to episode 5 where they previously touched on similar subjects. While this information is important for large or complex websites, it’s worth noting that for small and niche websites, the granular details might be less applicable.

For those new to SEO: This article delves into some advanced concepts related to crawling. If you’re just starting, it might be helpful to familiarize yourself with the basics of crawling and crawl budget first, as the subsequent sections address more nuanced details.

Understanding Crawl Budget in the Context of Web Architecture

The term ‘crawl budget’ is not new to the SEO community. However, as web technologies evolve, understanding the nuances of crawl budget and its interaction with these technologies becomes more crucial. One such evolution is the extensive use of JavaScript in web development.

Crawl Budget Meets JavaScript: In Episode 5, Martin Splitt highlighted an essential aspect of crawl budget, shedding light on its interplay with JavaScript. He explained that if a site, particularly sensitive to crawl budget, is structured to load content via numerous API requests through JavaScript, each of those requests count against the crawl budget.

The Architecture’s Impact on Crawl Budget: He also mentions that the structure and design of a web application significantly influence its crawl budget. Notably, if a web application heavily relies on client-side rendering where JavaScript manages multiple API requests, these requests can accumulate rapidly. For larger websites, this accumulation can have a notable impact on their available crawl budget, as every request contributes to the consumption of this budget.

Identifying Crawl Budget Concerns: Based on these points, I think recognizing issues with a site’s crawl budget is pivotal for its optimal performance. Here are some strategies to pinpoint potential challenges:

Review Uncrawled URLs: Start by examining URLs that have never undergone crawling.
Monitor Refresh Rates: Keep an eye on the update frequencies of different sections of your site. If some sections haven’t been refreshed for extended periods despite content modifications, it’s a hint towards crawl budget considerations.
Inspect Server Logs: Server logs provide an authentic snapshot of crawl activity. Consistently checking them can offer insights into how often Googlebot visits your site and which sections it’s most interested in.

Dive into Crawl Scheduling

A term I’ve encountered less often but find particularly intriguing is Crawl Scheduling. Google’s approach to ensuring the web remains accessible and updated is deeply rooted in this concept.

Crawl Scheduling: At its core, Google has established a system dubbed “crawl scheduling”. Through this, the crawl scheduler:

Estimates which specific pages are due for a recrawl.
Seeks to identify sections of websites potentially housing undiscovered or new URLs. These specific endeavors are termed as “discovery crawls”.

Elaborating on this in a newer episode, it was discussed that the Crawl Scheduler system is decisive about its crawling priorities. The system organizes URLs by assigning a priority. This priority is influenced by several factors, including the quality of content on a page and the frequency of its updates.

For me, understanding this depth of prioritization from Google’s end reinforces that the <priority> tags in sitemaps might have even less significance than I initially perceived.

Google’s Crawling Capacity and the Importance of “Back Off Signals”

As we delve deeper into the intricate workings of Google’s crawling mechanisms, two aspects particularly stand out: Google’s Crawling Capacity and the concept of “Back Off Signals.” These seemingly separate elements are interconnected and play a significant role in determining how Google interacts with a website.

Understanding Google’s Crawling Capacity:

At its core, the term refers to the sheer capability of Googlebot, Google’s web-crawling tool. This bot possesses the potential to crash segments of the internet due to its intense and continuous activity. Yet, while Googlebot holds this immense power, Google consciously refrains from unleashing its full capacity. Why? The primary reason is to prevent overwhelming or even inadvertently crashing servers.

Delving into “Back Off Signals”:

But, how does Googlebot know when to pull back? That’s where the concept of “Back Off Signals” comes into play. These signals, such as status codes like 429 (Too Many Requests) or 5xx (indicative of server errors), act as communication cues from websites. If a website starts emitting these signals, it’s an indication of its stress or strain due to Googlebot’s activity. In response, Googlebot might dial back its crawl rate, and if these signals persist, it might even temporarily halt its crawling actions. The objective here is to strike a balance ensuring continuous and smooth server operations without impeding the site’s functionality.

The Role of URL Structure in SEO:

To many of us in the SEO field, URL structures might seem straightforward, almost a rudimentary element of website architecture. However, digging deeper reveals that to Google’s intricate algorithms and bots, it’s far from trivial.

Influence of Quality on URL Patterns:

URL structures, often termed as ‘URL patterns’, play a decisive role in how Google perceives the quality of content on a website. For instance, when a distinct URL pattern is consistently associated with high-quality content, Google’s systems take note. They may even prioritize other URLs that share this same structure. It’s a clear indication of the perceived trust and value Google places on organized and structured URL patterns that deliver value.

On the flip side, if a particular URL pattern tends to house low-quality content more often than not, Google might become more reserved with its crawling frequency for those URLs. In essence, not only do the contents of a page matter, but the very structural patterns of the URLs also play a part in how Google’s bots evaluate them.

Wrapping Up: Key Takeaways on Google’s Crawling Mechanisms

After delving deep into the intricacies of Google’s crawling mechanisms, there are two central points I’d like to emphasize:

Crawl Budget and Small Websites: If you own or manage a small or niche website, the entire concept of ‘crawl budget’ might not be something you need to stress over. Based on the insights gleaned from the podcast episodes, only a particular segment of the vast web ecosystem genuinely needs to meticulously manage their crawl budget.
Quality Reigns Supreme: Gary Illyes highlighted a salient point in the recent episode, stating that the primary driver influencing Google’s decisions to crawl and index a webpage is its quality. High-quality content is more likely to be crawled frequently and subsequently indexed. This overarching theme of ‘quality’ influences a myriad of aspects, from crawl frequency to indexing decisions.

In conclusion, I hope you found this article enlightening. While I’ve endeavored to ensure accuracy, it’s worth noting that the landscape of SEO is perpetually evolving. Some concepts discussed, especially those extracted from a 2020 article, may have witnessed changes. However, based on my experiences and in light of the recent September 2023 podcast episode, these concepts still make sense.

Your feedback is invaluable. I’d love to hear your thoughts on the article. Do drop your insights in the comments section below!