How Google Search Works: Crawling, Indexing

How Google Crawls the Web

Google's discovery process begins with Googlebot, an automated web crawler that systematically browses the internet to find and download web pages. According to Google's official Search Central documentation, crawling is the first stage of a three-part process: crawling, indexing, and serving results.

Googlebot starts from a list of known URLs, then follows links on those pages to discover new content. Google maintains a crawl queue — a prioritised list of URLs waiting to be crawled — managed by a process called crawl scheduling. Pages with higher quality signals, more inbound links, or more frequent updates receive more frequent crawling.

Daily crawl volume

20B+

URLs crawled per day globally

Index size

100B+

Pages in Google's search index

How Googlebot Discovers URLs

Google uses several methods to discover new pages:

Sitemaps. Webmasters submit XML sitemaps through Google Search Console to directly inform Google of all important URLs on a site.
Link following. Googlebot follows hyperlinks on pages it has already crawled — which is why internal linking structure matters for crawl coverage.
URL submission. Individual URLs can be requested for crawling via the URL Inspection tool in Google Search Console.
Redirects. When a crawled URL redirects, Google may add the destination URL to its crawl queue.

Crawl Budget

Google uses the concept of crawl budget — the number of URLs Googlebot will crawl on a given site within a given time frame. According to Google's official documentation, crawl budget is determined by crawl rate limit (how fast Googlebot can crawl without overwhelming a server) and crawl demand (how much Google wants to re-crawl pages based on popularity and freshness).

When crawl budget matters

Google's documentation states crawl budget is most significant for sites with more than one million unique URLs, sites with rapid content turnover, or sites with large numbers of low-quality or duplicate pages. For most smaller sites, crawl budget is rarely a limiting factor.

What Googlebot Renders

Modern Googlebot renders pages using a headless version of Chromium. JavaScript-dependent content can be indexed, though Google notes that JavaScript rendering happens in a second wave — HTML content is indexed faster. Critical content should be accessible in the HTML source for fastest indexing.

Indexation: How Google Understands Content

After crawling a page, Google processes its content to understand what the page is about and whether it deserves to be added to the index. The index is a massive database of processed page information — not a copy of the web. Google stores information about words, their locations on pages, and many other quality signals.

The Indexing Pipeline

1
Content extraction
Googlebot downloads the HTML and renders any JavaScript. Text, links, and structured data are extracted.
2
Language detection
Google identifies the primary language using its language detection systems.
3
Content analysis
Google processes the text to understand entities, topics, relationships, and the overall subject of the page.
4
Canonicalisation
Google selects the canonical URL when multiple pages have similar or identical content, consolidating signals to one preferred version.
5
Index decision
Based on quality signals, Google decides whether to add the page to the index. Low-quality pages may be excluded.

Key insight

Google distinguishes between crawled pages and indexed pages. A page can be crawled without being indexed. Google may exclude pages due to thin content, duplicate content, noindex directives, or quality issues identified during processing.

Structured Data and Rich Results

Structured data markup — implemented using JSON-LD, Microdata, or RDFa — helps Google understand the specific type and meaning of content. Schema.org vocabulary is the recommended standard. Structured data can enable Rich Results in the SERP, such as star ratings, recipe information, FAQ accordions, and event listings.

Google's Ranking Algorithm

When a user types a query, Google retrieves potentially relevant pages from the index and ranks them using hundreds of signals to determine which results best answer the query. Google has stated its ranking algorithm uses more than 200 signals, though only a subset are publicly confirmed.

Query Understanding

Google interprets meaning and intent behind a query — not just keywords — to determine what type of result is most useful.

Learn about search intent →

Relevance Scoring

Content relevance is assessed by matching page topics, entities, and language patterns to the user's query and intent.

See ranking factors →

Quality Assessment

Google assesses E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) as a framework for evaluating quality.

Learn about E-E-A-T →

RankBrain, BERT, and Neural Matching

Google confirmed in 2015 that RankBrain, a machine learning component, helps understand queries it has not seen before. Google later introduced Neural Matching (2018), BERT (2019), and MUM (2021) — progressively more powerful language understanding systems now influencing a significant proportion of all searches.

BERT processes almost every query

Google announced that BERT now processes nearly all queries in English and many other languages. BERT understands nuanced context of words within a sentence, improving Google's ability to match conversational queries to the most relevant pages.

Key Ranking Factors Explained

While Google does not publish a complete list of ranking factors, several have been confirmed through official documentation, Google's public statements, or verified internal documentation. Below are the most significant confirmed factors.

Factor	Type	Source
Relevance to query intent	Content	Google Search Central docs
PageRank / link authority	Links	Original PageRank patent (1998)
E-E-A-T signals	Quality	Google Quality Rater Guidelines
Core Web Vitals (LCP, INP, CLS)	Technical	Google ranking signal announcement 2021
HTTPS security	Technical	Google confirmed ranking signal 2014
Mobile-friendliness	Technical	Google mobile-first indexing docs
Anchor text of inbound links	Links	Original Google research papers

E-E-A-T is not a direct algorithmic signal

Google has clarified that E-E-A-T itself is not a direct ranking metric. It is a framework used by human quality raters whose evaluations train machine learning systems that do influence rankings. You cannot optimise for E-E-A-T as a score — you need to genuinely demonstrate it.

SERP Features: Beyond the Blue Links

Modern Google SERPs are far more complex than a simple list of links. Google serves numerous special result formats called SERP features, each designed to answer different query types more directly.

Featured Snippets

A selected result shown above organic listings ("position zero"). Triggered when Google determines a page directly answers a question query.

Knowledge Panels

Information boxes drawn from Google's Knowledge Graph. Triggered for well-known entities: people, organisations, locations, films.

Local Pack

Three local business results with a map, shown for queries with local intent. Ranked by relevance, distance, and prominence.

Technical Quality Signals

Technical signals affect whether pages can be properly crawled and indexed, and whether they meet Google's quality bar for ranking. Google has confirmed several technical factors as direct ranking signals.

Core Web Vitals

Google's Core Web Vitals became a ranking signal in June 2021. The three current metrics are:

Largest Contentful Paint (LCP): Loading performance. Google's target: under 2.5 seconds for a "Good" score.
Interaction to Next Paint (INP): Responsiveness to user interactions. Replaced First Input Delay in March 2024. Target: under 200ms.
Cumulative Layout Shift (CLS): Visual stability — how much the page layout shifts unexpectedly. Target: under 0.1.

HTTPS and Mobile-First Indexing

Google confirmed HTTPS as a ranking signal in 2014 — HTTPS is now a baseline requirement, not a competitive advantage. Google switched to mobile-first indexing by default for all new sites in 2020, completing the full transition by 2023. Pages that serve significantly different content to mobile vs desktop users may have indexing issues if the mobile version is less complete.

Confirmed vs speculative factors

There are many claimed SEO ranking factors that circulate in the industry. Every factor covered in this guide is confirmed by official Google documentation, Google's own statements, or academic research. Be sceptical of claimed factors from SEO tool vendors without primary source confirmation.

Authentic Sources Used in This Guide

Only official documentation, academic research, and verified data. No blogger opinions.

OfficialGoogle Search Central Documentation

Primary official source for crawling, indexing, and structured data.

AcademicThe Anatomy of a Large-Scale Hypertextual Web Search Engine

Brin & Page (1998), Stanford University. Original Google research paper.

OfficialGoogle Search Quality Rater Guidelines

Public document outlining E-E-A-T and quality evaluation criteria.

OfficialCore Web Vitals — web.dev

Google's official guidance on Core Web Vitals metrics and optimisation.

How Google Search Works · The Complete Guide

What You'll Learn