How Google Search Works: Crawling, Indexing

How Google Crawls the Web

Google's discovery process begins with Googlebot, an automated web crawler that systematically browses the internet to find and download web pages. According to Google's official Search Central documentation, crawling is the first stage of a three-part process: crawling, indexing, and serving results.

Googlebot starts from a list of known URLs, then follows links on those pages to discover new content. Google maintains a crawl queue — a prioritised list of URLs waiting to be crawled — managed by a process called crawl scheduling. Pages with higher quality signals, more frequent updates, or more inbound links receive more frequent crawling.

Daily Crawl Volume

20B+

URLs crawled per day globally

Index Size

100B+

Pages in Google's search index

Crawl Budget

≤48hrs

Typical crawl interval for popular pages

How Googlebot Discovers URLs

Google uses several methods to discover new pages:

Sitemaps. Webmasters submit XML sitemaps through Google Search Console to directly inform Google of all important URLs on a site.
Link following. Googlebot follows hyperlinks on pages it has already crawled, which is why internal linking structure matters for crawl coverage.
URL submission. Webmasters can request individual URLs for crawling via the URL Inspection tool in Google Search Console.
Redirects. When a crawled URL redirects to another, Google may add the destination URL to its crawl queue.

Crawl Budget and Crawl Rate

Google uses the concept of crawl budget — the number of URLs Googlebot will crawl on a given site within a given time frame. According to Google's official documentation, crawl budget is determined by two main factors: crawl rate limit (how fast Googlebot can crawl without overwhelming a server) and crawl demand (how much Google wants to re-crawl pages based on their popularity and freshness).

Official guidance on crawl budget

Google's documentation states crawl budget matters most for sites with more than one million unique URLs, sites with rapid content turnover, or sites with large numbers of low-quality or duplicate pages. For smaller sites, crawl budget is rarely a limiting factor.

What Googlebot Renders

Modern Googlebot renders pages using a headless version of Chromium. This means JavaScript-dependent content can be indexed, though Google notes that JavaScript rendering is done in a second wave — HTML content is typically indexed faster than content that requires JavaScript execution. Critical content should be accessible in the HTML source for fastest indexing.

Indexation: How Google Understands Content

After crawling a page, Google processes its content to understand what the page is about and whether it deserves to be added to the index. The index is a massive database of processed page information, not a copy of the web. According to Google, the index stores information about words, their locations on pages, and many other signals about the page's content and quality.

The Indexing Pipeline

1
Content extraction
Googlebot downloads the page HTML and renders any JavaScript. Text, links, and structured data are extracted.
2
Language detection
Google identifies the primary language of the page using its language detection systems.
3
Content analysis
Google processes the text to understand entities, topics, relationships, and the overall subject of the page.
4
Canonicalisation
Google selects the canonical URL when multiple pages have similar or identical content, consolidating signals to one preferred version.
5
Index decision
Based on quality signals, Google decides whether to add the page to the index. Low-quality pages may be excluded.

Canonicalisation

When multiple URLs serve the same or very similar content (for example, HTTP vs HTTPS versions, www vs non-www, or pages with and without trailing slashes), Google selects one as the canonical — the preferred version. This is critical because Google consolidates link signals, indexing, and ranking signals to the canonical URL.

Webmasters can signal their preferred canonical using the rel="canonical" link element, the canonical URL in XML sitemaps, or by setting up proper 301 redirects. Google treats these as signals, not directives — it may override them if other evidence contradicts the declaration.

Key Indexing Insight

Google's documentation distinguishes between crawled pages and indexed pages. A page can be crawled without being indexed. Google may exclude pages from the index due to thin content, duplicate content, noindex directives, or quality issues identified during processing.

Structured Data and Rich Results

Structured data markup — implemented using JSON-LD, Microdata, or RDFa — helps Google understand the specific type and meaning of content. Schema.org vocabulary is the recommended standard. Structured data can enable Rich Results (formerly Rich Snippets) in the SERP, such as star ratings for reviews, recipe information, FAQ accordions, and event listings. Google's Rich Results Test tool allows webmasters to validate their structured data markup.

Google's Ranking Algorithm

When a user types a query into Google, the system retrieves potentially relevant pages from the index and ranks them using hundreds of signals to determine which results best answer the query. Google has repeatedly stated that its ranking algorithm uses more than 200 signals, though only a subset are publicly confirmed.

Google's search algorithm is not a single system but a collection of systems working in sequence. Key systems include the retrieval system (finding candidate pages), the ranking system (ordering them), and numerous quality filters and classifiers that run on top of raw ranking scores.

Core Algorithm Components

Query Understanding

Google interprets the meaning and intent behind a query — not just the keywords — to determine what type of result is most useful.

Learn about search intent →

Relevance Scoring

Content relevance is assessed by matching page topics, entities, and language patterns to the user's query and intent.

Learn about ranking factors →

Quality Assessment

Google assesses E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) as a framework for quality evaluation.

Learn about E-E-A-T →

RankBrain and Neural Matching

Google confirmed in 2015 that RankBrain, a machine learning component, helps Google understand queries it has not seen before. RankBrain maps queries and pages to a vector space of concepts, allowing Google to find relevant results even when query words don't appear literally on a page. Google later introduced Neural Matching (2018), BERT (2019), and MUM (2021) — progressively more powerful language understanding systems that now influence a significant proportion of all searches.

BERT processes almost every query

Google announced in 2019 that BERT (Bidirectional Encoder Representations from Transformers) now processes nearly all queries in English and many other languages. BERT understands the nuanced context of words within a sentence, improving Google's ability to match conversational queries with the most relevant pages.

Key Ranking Factors Explained

While Google does not publish a complete list of ranking factors, several have been confirmed directly by Google's documentation, official statements, or verified through the 2024 API leak that exposed internal documentation. Below are the most significant confirmed factors.

Factor	Type	Confirmation Source
Relevance to query intent	Content	Google Search Central documentation
PageRank / link authority	Links	Original PageRank patent (1998); confirmed ongoing
E-E-A-T signals	Quality	Google Quality Rater Guidelines (public)
Core Web Vitals (LCP, INP, CLS)	Technical	Google official ranking signal announcement (2021)
HTTPS security	Technical	Google confirmed ranking signal in 2014
Mobile-friendliness	Technical	Google mobile-first indexing documentation
Page content comprehensiveness	Content	Google Search Quality Rater Guidelines
Anchor text of inbound links	Links	Original Google research papers; confirmed in documentation

E-E-A-T: Experience, Expertise, Authoritativeness, Trustworthiness

Google's Search Quality Rater Guidelines — a public document used to train human quality raters who evaluate Google's search results — are built around the E-E-A-T framework. Google added the first "E" (Experience) in 2022 to existing E-A-T, recognising that first-hand experience with a topic is a quality signal.

Trust is the most important component. Google's documentation explicitly states that trust is the most important member of the E-E-A-T family. An untrustworthy page has low E-E-A-T regardless of its apparent expertise.

E-E-A-T is not a direct ranking factor

Google has clarified that E-E-A-T itself is not a direct algorithmic ranking signal. Rather, it is a framework for human quality raters to evaluate pages, and this evaluation data trains machine learning systems that do influence rankings. The distinction matters: you cannot "optimise" for E-E-A-T as a metric — you need to genuinely demonstrate it.

SERP Features: Beyond the Blue Links

Modern Google search results pages (SERPs) are far more complex than a simple list of blue links. Google serves numerous special result formats called SERP features, each designed to answer different query types more directly.

Featured Snippets

A selected search result displayed at the top of the SERP, above the organic results ("position zero"). Can be paragraph, list, or table format. Triggered when Google determines a page directly answers a question query.

Knowledge Panels

Information boxes on the right side (desktop) or top (mobile) of the SERP, drawn from Google's Knowledge Graph. Triggered for well-known entities: people, organisations, locations, films, and more.

Local Pack

A set of three local business results, typically with a map, shown for queries with local intent. Rankings are based on relevance, distance, and prominence in Google's local algorithm.

Technical Quality Signals

Technical signals affect whether pages can be properly crawled and indexed, and whether they meet Google's quality bar for ranking. Google has confirmed several technical factors as direct ranking signals.

Core Web Vitals

Google's Core Web Vitals became a ranking signal in June 2021 as part of the Page Experience update. The three current Core Web Vitals metrics are:

Largest Contentful Paint (LCP): Measures loading performance. Google's target: under 2.5 seconds for a "Good" score.
Interaction to Next Paint (INP): Measures responsiveness to user interactions. Replaced First Input Delay in March 2024. Google's target: under 200ms.
Cumulative Layout Shift (CLS): Measures visual stability — how much the page layout shifts unexpectedly. Google's target: under 0.1.

HTTPS

Google confirmed HTTPS as a ranking signal in 2014. All pages served over HTTPS receive a lightweight boost compared to equivalent HTTP pages. With HTTPS now ubiquitous, this is largely a baseline requirement rather than an advantage — sites not on HTTPS are at a disadvantage.

Mobile-First Indexing

Google switched to mobile-first indexing by default for all new sites in 2020, and completed the transition for all sites by 2023. This means Google predominantly uses the mobile version of a page for indexing and ranking decisions. Pages that serve significantly different content to mobile and desktop users may have indexing issues if the mobile version has less content or fewer links than the desktop version.

Confirmed vs. Speculative Factors

There are many claimed SEO ranking factors that circulate in the industry. The factors covered in this guide are those confirmed by official Google documentation, Google's own statements, or academic research. Practitioners should be sceptical of claimed factors that lack primary source confirmation, especially those from SEO tool vendors with a commercial interest in promoting certain metrics.

Authentic Sources Used in This Guide

Only official documentation, academic research, and verified industry data. No blogger opinions.

Official Google Search Central Documentation

The primary official source for how Google Search works, crawling, indexing, and structured data.

Academic The Anatomy of a Large-Scale Hypertextual Web Search Engine

Brin & Page (1998), Stanford University. The original Google research paper describing PageRank and the search architecture.

Official Google Search Quality Rater Guidelines

Public document outlining E-E-A-T and the criteria used to evaluate search result quality.

Official Core Web Vitals — web.dev

Google's official guidance on Core Web Vitals metrics, measurement, and optimization.

How Google Search Works · The Complete Guide

What You'll Learn