What You'll Learn
- How Googlebot discovers and crawls web pages across the internet
- How Google processes content and builds its search index
- The core principles behind Google's ranking algorithm
- Key ranking factors confirmed by official Google documentation
- How SERP features (Featured Snippets, Knowledge Panels) are generated
- What technical signals Google uses to evaluate page quality
How Google Crawls the Web
Google's discovery process begins with Googlebot, an automated web crawler that systematically browses the internet to find and download web pages. According to Google's official Search Central documentation, crawling is the first stage of a three-part process: crawling, indexing, and serving results.
Googlebot starts from a list of known URLs, then follows links on those pages to discover new content. Google maintains a crawl queue — a prioritised list of URLs waiting to be crawled — managed by a process called crawl scheduling. Pages with higher quality signals, more inbound links, or more frequent updates receive more frequent crawling.
Daily crawl volume
URLs crawled per day globally
Index size
Pages in Google's search index
How Googlebot Discovers URLs
Google uses several methods to discover new pages:
- Sitemaps. Webmasters submit XML sitemaps through Google Search Console to directly inform Google of all important URLs on a site.
- Link following. Googlebot follows hyperlinks on pages it has already crawled — which is why internal linking structure matters for crawl coverage.
- URL submission. Individual URLs can be requested for crawling via the URL Inspection tool in Google Search Console.
- Redirects. When a crawled URL redirects, Google may add the destination URL to its crawl queue.
Crawl Budget
Google uses the concept of crawl budget — the number of URLs Googlebot will crawl on a given site within a given time frame. According to Google's official documentation, crawl budget is determined by crawl rate limit (how fast Googlebot can crawl without overwhelming a server) and crawl demand (how much Google wants to re-crawl pages based on popularity and freshness).
Google's documentation states crawl budget is most significant for sites with more than one million unique URLs, sites with rapid content turnover, or sites with large numbers of low-quality or duplicate pages. For most smaller sites, crawl budget is rarely a limiting factor.
What Googlebot Renders
Modern Googlebot renders pages using a headless version of Chromium. JavaScript-dependent content can be indexed, though Google notes that JavaScript rendering happens in a second wave — HTML content is indexed faster. Critical content should be accessible in the HTML source for fastest indexing.
Indexation: How Google Understands Content
After crawling a page, Google processes its content to understand what the page is about and whether it deserves to be added to the index. The index is a massive database of processed page information — not a copy of the web. Google stores information about words, their locations on pages, and many other quality signals.
The Indexing Pipeline
- 1Content extraction
Googlebot downloads the HTML and renders any JavaScript. Text, links, and structured data are extracted.
- 2Language detection
Google identifies the primary language using its language detection systems.
- 3Content analysis
Google processes the text to understand entities, topics, relationships, and the overall subject of the page.
- 4Canonicalisation
Google selects the canonical URL when multiple pages have similar or identical content, consolidating signals to one preferred version.
- 5Index decision
Based on quality signals, Google decides whether to add the page to the index. Low-quality pages may be excluded.
Key insight
Google distinguishes between crawled pages and indexed pages. A page can be crawled without being indexed. Google may exclude pages due to thin content, duplicate content, noindex directives, or quality issues identified during processing.
Structured Data and Rich Results
Structured data markup — implemented using JSON-LD, Microdata, or RDFa — helps Google understand the specific type and meaning of content. Schema.org vocabulary is the recommended standard. Structured data can enable Rich Results in the SERP, such as star ratings, recipe information, FAQ accordions, and event listings.
Google's Ranking Algorithm
When a user types a query, Google retrieves potentially relevant pages from the index and ranks them using hundreds of signals to determine which results best answer the query. Google has stated its ranking algorithm uses more than 200 signals, though only a subset are publicly confirmed.
Query Understanding
Google interprets meaning and intent behind a query — not just keywords — to determine what type of result is most useful.
Learn about search intent →Relevance Scoring
Content relevance is assessed by matching page topics, entities, and language patterns to the user's query and intent.
See ranking factors →Quality Assessment
Google assesses E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) as a framework for evaluating quality.
Learn about E-E-A-T →RankBrain, BERT, and Neural Matching
Google confirmed in 2015 that RankBrain, a machine learning component, helps understand queries it has not seen before. Google later introduced Neural Matching (2018), BERT (2019), and MUM (2021) — progressively more powerful language understanding systems now influencing a significant proportion of all searches.
Google announced that BERT now processes nearly all queries in English and many other languages. BERT understands nuanced context of words within a sentence, improving Google's ability to match conversational queries to the most relevant pages.
Key Ranking Factors Explained
While Google does not publish a complete list of ranking factors, several have been confirmed through official documentation, Google's public statements, or verified internal documentation. Below are the most significant confirmed factors.
| Factor | Type | Source |
|---|---|---|
| Relevance to query intent | Content | Google Search Central docs |
| PageRank / link authority | Links | Original PageRank patent (1998) |
| E-E-A-T signals | Quality | Google Quality Rater Guidelines |
| Core Web Vitals (LCP, INP, CLS) | Technical | Google ranking signal announcement 2021 |
| HTTPS security | Technical | Google confirmed ranking signal 2014 |
| Mobile-friendliness | Technical | Google mobile-first indexing docs |
| Anchor text of inbound links | Links | Original Google research papers |
Google has clarified that E-E-A-T itself is not a direct ranking metric. It is a framework used by human quality raters whose evaluations train machine learning systems that do influence rankings. You cannot optimise for E-E-A-T as a score — you need to genuinely demonstrate it.
SERP Features: Beyond the Blue Links
Modern Google SERPs are far more complex than a simple list of links. Google serves numerous special result formats called SERP features, each designed to answer different query types more directly.
Featured Snippets
A selected result shown above organic listings ("position zero"). Triggered when Google determines a page directly answers a question query.
Knowledge Panels
Information boxes drawn from Google's Knowledge Graph. Triggered for well-known entities: people, organisations, locations, films.
People Also Ask
Expandable box of related questions, each revealing a Featured Snippet-style answer. Google dynamically adds more questions as users interact.
Local Pack
Three local business results with a map, shown for queries with local intent. Ranked by relevance, distance, and prominence.
Technical Quality Signals
Technical signals affect whether pages can be properly crawled and indexed, and whether they meet Google's quality bar for ranking. Google has confirmed several technical factors as direct ranking signals.
Core Web Vitals
Google's Core Web Vitals became a ranking signal in June 2021. The three current metrics are:
- Largest Contentful Paint (LCP): Loading performance. Google's target: under 2.5 seconds for a "Good" score.
- Interaction to Next Paint (INP): Responsiveness to user interactions. Replaced First Input Delay in March 2024. Target: under 200ms.
- Cumulative Layout Shift (CLS): Visual stability — how much the page layout shifts unexpectedly. Target: under 0.1.
HTTPS and Mobile-First Indexing
Google confirmed HTTPS as a ranking signal in 2014 — HTTPS is now a baseline requirement, not a competitive advantage. Google switched to mobile-first indexing by default for all new sites in 2020, completing the full transition by 2023. Pages that serve significantly different content to mobile vs desktop users may have indexing issues if the mobile version is less complete.
Confirmed vs speculative factors
There are many claimed SEO ranking factors that circulate in the industry. Every factor covered in this guide is confirmed by official Google documentation, Google's own statements, or academic research. Be sceptical of claimed factors from SEO tool vendors without primary source confirmation.
Authentic Sources Used in This Guide
Only official documentation, academic research, and verified data. No blogger opinions.
Primary official source for crawling, indexing, and structured data.
Brin & Page (1998), Stanford University. Original Google research paper.
Public document outlining E-E-A-T and quality evaluation criteria.
Google's official guidance on Core Web Vitals metrics and optimisation.