What You'll Learn
- What crawl budget is and when it actually matters for your site
- How to write and audit a robots.txt file correctly
- How to create and submit XML sitemaps that help Google prioritise crawling
- How canonicalisation works and how to avoid self-inflicted duplicate content problems
- When to use noindex and how it differs from blocking crawling
- How to find and fix crawl coverage issues in Google Search Console
Crawl Budget: What It Is and When It Matters
Crawl budget is the number of URLs on your site that Googlebot will crawl and process within a given time frame. Google's official documentation defines it as the product of two factors: crawl rate limit (how fast Googlebot can crawl without overloading your server) and crawl demand (how much Google wants to crawl your pages based on their perceived value and freshness).
The term "crawl budget" is frequently misunderstood in the SEO industry. Google has been explicit that for the vast majority of websites, crawl budget is not a concern. According to Google's official Search Central blog, crawl budget only becomes a meaningful factor for sites with more than one million unique URLs, sites where large portions of the site are updated on a rapid basis, or sites with significant numbers of low-quality, duplicate, or redirected pages consuming crawl resources without adding index value.
When budget matters
URLs before crawl budget is a real concern
Crawl rate limit
Typical pause between Googlebot requests
Re-crawl frequency
Average interval for established pages
What Consumes Crawl Budget Inefficiently
Google identifies several categories of URLs that waste crawl budget without contributing to indexing or ranking:
- Faceted navigation URLs. E-commerce sites often generate thousands of filter combination URLs (e.g.
/shoes?colour=red&size=42&brand=nike). Most of these pages are near-duplicates and provide little unique value. - Session IDs in URLs. If your site appends session tokens to URLs, Googlebot sees each session as a unique URL even though the content is identical.
- Soft 404 pages. Pages that return a 200 HTTP status but display "no results found" or similar messages trick Googlebot into treating empty pages as crawlable content.
- Redirect chains. Long chains of redirects consume crawl budget at each hop. Google recommends keeping redirects to a single hop where possible.
- Low-quality or thin content pages. Pages with very little unique content that Google is unlikely to index consume budget without return.
Google's Gary Illyes stated in 2017: "If you're a small or medium site (say, a few thousand URLs), you likely don't need to worry about crawl budget." The concern is primarily for large-scale sites. Source: Google Search Central Blog.
How to Improve Crawl Efficiency
Even if crawl budget is not a crisis for your site, improving crawl efficiency is good practice. The most effective actions are: consolidating duplicate content through canonicalisation and redirects, blocking URL parameters that don't produce unique content via Google Search Console's URL parameters tool, reducing redirect chains to single hops, and ensuring your server responds quickly (slow server response rates cause Googlebot to crawl less aggressively to avoid overloading it).
robots.txt: Controlling Crawl Access
The robots.txt file is a plain text file placed at the root of a domain (e.g. https://example.com/robots.txt) that tells crawlers which parts of the site they should not access. It is governed by the Robots Exclusion Protocol, first introduced in 1994 and formalised as an IETF proposed standard (RFC 9309) in 2022.
It is critical to understand what robots.txt does and does not do. It controls crawling, not indexing. If you block a URL in robots.txt, Googlebot will not crawl it — but if that URL is linked to from other pages, Google may still index it (as a URL without content). This distinction is important: blocking a page in robots.txt does not guarantee it will not appear in search results. To prevent a page from appearing in search results, you need a noindex directive.
robots.txt Syntax
A robots.txt file consists of one or more groups, each starting with a User-agent line followed by Disallow or Allow directives. Google supports these directives:
# Block all crawlers from the /admin/ directory
User-agent: *
Disallow: /admin/
# Block Googlebot specifically from staging content
User-agent: Googlebot
Disallow: /staging/
# Allow all crawlers access to everything
User-agent: *
Disallow:
# Point to the XML sitemap location
Sitemap: https://example.com/sitemap.xml
A common historical mistake was blocking CSS and JavaScript directories to save crawl budget. Google now renders pages with Chromium and needs access to these resources to properly evaluate and render your pages. Blocking them causes Google to see a broken version of your page, which can negatively impact rankings. Source: Google Search Central documentation.
Common robots.txt Mistakes
- Blocking the entire site. A
Disallow: /directive blocks all crawling. This is sometimes accidentally left in place from development environments. - Case sensitivity. robots.txt path matching is case-sensitive on case-sensitive server systems.
Disallow: /Admin/does not block/admin/. - Using robots.txt for confidential pages. robots.txt is publicly readable. Do not list confidential URLs there hoping to hide them — this is counterproductive. Use server authentication instead.
- Blocking pages you want ranked. Any page you want to appear in Google search results must be crawlable. Verify your robots.txt does not accidentally block important content.
Testing your robots.txt
Google Search Console includes a robots.txt Tester tool that allows you to check how Googlebot interprets your robots.txt file and test individual URLs against your current rules. Use it after any change to your robots.txt.
XML Sitemaps: Helping Google Discover and Prioritise
An XML sitemap is a file that lists the URLs on your site you want search engines to crawl and index. It serves as a direct communication channel to Google — telling it which pages exist, when they were last updated, and optionally how frequently they change. The XML sitemap format is standardised at sitemaps.org and is supported by Google, Bing, and other search engines.
Sitemaps are particularly valuable for large sites with deep URL structures, new sites with few external inbound links (which would otherwise limit Google's ability to discover pages through crawling), and sites with pages that are not easily discovered through internal linking (such as pages accessible only via search forms or heavy use of JavaScript).
XML Sitemap Structure
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/seo/how-google-works/</loc>
<lastmod>2026-04-04</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
What to Include and Exclude
Your sitemap should only include URLs you want indexed. This means:
- Include: Canonical versions of your important pages, paginated pages (the first page at minimum), and alternate language versions using hreflang.
- Exclude: Pages blocked by robots.txt (a contradiction Google flags as an error), pages with a noindex meta tag, redirect URLs (list the final destination only), URLs with parameters that produce duplicate content, and pages you do not want indexed.
Sitemap Index Files
Sites with more than 50,000 URLs in a single sitemap, or sitemaps exceeding 50MB uncompressed, should use a sitemap index file — a sitemap of sitemaps. This allows you to split your sitemap into logical groups (e.g. one for blog posts, one for product pages) and submit a single index file to Google Search Console.
Submitting Your Sitemap
Submit your sitemap via Google Search Console under Indexing → Sitemaps. You should also include a Sitemap: directive in your robots.txt file pointing to the sitemap location — this allows any crawler, not just Google, to discover it automatically.
Google treats sitemaps as a signal about what you want crawled, not as a guaranteed crawl list. Including a URL in your sitemap does not guarantee it will be crawled or indexed. Exclusion signals (noindex, robots.txt disallow, low-quality content) can still prevent indexing even for sitemapped pages.
Canonicalisation: Solving Duplicate Content
Canonicalisation is the process of selecting the preferred version of a URL when multiple URLs serve the same or very similar content. Google uses the canonical URL to consolidate ranking signals — links, PageRank, indexing — to a single preferred version. Without canonicalisation, sites with duplicate content dilute their own link equity and send confusing signals to Google about which page should rank.
Common Sources of Duplicate Content
- HTTP vs HTTPS versions of the same page
- www vs non-www versions (e.g.
www.example.comvsexample.com) - Trailing slash vs no trailing slash (
/page/vs/page) - URL parameters that don't change content (e.g. tracking parameters like
?utm_source=email) - Printer-friendly or AMP versions of pages
- Paginated content where page 1 and the /page/1/ URL are both accessible
- Product pages accessible under multiple category paths
Canonicalisation Signals Google Uses
Google uses multiple signals to determine the canonical URL. In order of typical influence:
- 1301/308 redirects
The strongest signal. If you permanently redirect one URL to another, Google will treat the destination as canonical.
- 2rel="canonical" link element
A
<link rel="canonical" href="...">tag in the<head>of a page. Google treats this as a strong hint but not an absolute directive. - 3Sitemap inclusion
URLs listed in your sitemap are treated as preferred versions. Including only canonical URLs in your sitemap reinforces the signal.
- 4Internal link consistency
The URL you most frequently link to internally is treated as a signal that it is the preferred version.
Google has explicitly stated that the canonical tag is treated as a "hint" rather than a strict instruction. Google may choose a different canonical if other signals (such as internal linking patterns or external link profiles) contradict the stated canonical. If you need to enforce canonicalisation, use 301 redirects.
Self-Referential Canonicals
Best practice is to include a self-referential canonical tag on every page — a page pointing to itself as the canonical. This prevents Googlebot from treating URL parameter variants as separate pages if your server accidentally serves them. A self-referential canonical also makes your canonical declarations consistent and easier to audit.
Noindex and Blocking: Keeping Pages Out of the Index
Noindex is a directive that tells Google not to include a page in its search index. Unlike robots.txt, which controls crawling, a noindex directive requires Google to crawl the page — it reads the directive and then removes or declines to add the page to the index. This is an important distinction: if you both block a page in robots.txt and add a noindex meta tag, Google cannot read the noindex tag because it cannot crawl the page.
Methods for Implementing Noindex
| Method | Where Added | Use Case |
|---|---|---|
<meta name="robots" content="noindex"> | HTML <head> | Most common. Prevents indexing of an individual page. |
X-Robots-Tag: noindex | HTTP response header | For non-HTML files (PDFs, images) or when you can't edit the HTML. |
| Google Search Console removal tool | GSC interface | Temporary removal of URLs already in the index (lasts 6 months). |
When to Use Noindex
Noindex is appropriate for pages that should exist on your site but should not appear in search results:
- Thank-you pages after form submissions
- Internal search results pages
- Login and account management pages
- Staging or preview pages accessible publicly but not intended for search
- Low-value tag or archive pages on blog platforms
- Paginated pages beyond page 2 (optional — depends on content value)
Adding noindex to a page that is already indexed does not immediately remove it. Googlebot must re-crawl the page, read the directive, and process the removal. For large sites this can take weeks. For urgent removal, use the URL Removal tool in Google Search Console alongside a noindex tag.
Index Coverage Issues: What They Mean and How to Fix Them
Google Search Console's Index Coverage report (now called the Pages report under Indexing) shows how Google has classified all discovered URLs on your site. Understanding these status categories is essential for diagnosing crawlability and indexation problems.
| GSC Status | Meaning | Action |
|---|---|---|
| Indexed | Page is in Google's index and eligible to appear in search results | No action needed |
| Crawled – currently not indexed | Google crawled the page but chose not to index it (quality signal) | Improve page quality or consolidate with canonical |
| Discovered – currently not indexed | Google knows the URL exists but hasn't crawled it yet | Improve internal linking; check crawl budget |
| Duplicate, Google chose different canonical | Google selected a different URL as the canonical than what you declared | Audit canonical signals; check redirect chains and internal linking |
| Excluded by noindex | Page was not indexed because of a noindex directive | Intentional — verify it is correct; remove noindex if page should be indexed |
| Blocked by robots.txt | Googlebot was prevented from crawling by robots.txt | Verify intentional; if not, update robots.txt |
| Not found (404) | Page returned a 404 error | Redirect to relevant page if content has moved; leave as 404 if genuinely deleted |
| Soft 404 | Page returns 200 but displays no useful content | Return proper 404/410, or add meaningful content |
The "Crawled — Currently Not Indexed" Problem
This is one of the most common and misunderstood GSC statuses. When Google crawls a page but declines to index it, the most common reasons are: the content is too thin or provides little value beyond other indexed pages, the page is near-duplicate of another already-indexed page, the page has very few or no inbound internal links, or Google's quality systems have flagged the page as low-quality.
The fix depends on the cause. For thin content pages, either substantially improve the content quality or consolidate multiple thin pages into one comprehensive page. For near-duplicates, implement canonical tags pointing to the preferred version. For pages with poor internal linking, add contextually relevant internal links from high-authority pages on your site.
Running a Crawl and Indexation Audit
A crawl and indexation audit gives you a complete picture of how Google sees your site's structure. Here is a systematic approach using free tools — primarily Google Search Console.
- 1Check your robots.txt
Visit
yourdomain.com/robots.txtdirectly. Use GSC's robots.txt tester to verify no critical pages or resources are accidentally blocked. - 2Review the GSC Pages report
Under Indexing → Pages, review the breakdown of indexed vs non-indexed URLs. Export the full list and categorise the reasons for non-indexing.
- 3Audit your sitemap
Submit your sitemap in GSC and check the "Submitted" vs "Indexed" count. A significant gap signals either quality issues or canonicalisation problems.
- 4Check for canonical conflicts
Look for pages where GSC shows "Duplicate, Google chose different canonical than user." These indicate a mismatch between your declared canonicals and the signals Google is reading.
- 5Crawl your site with a crawler
Tools like Screaming Frog (free up to 500 URLs) or Sitebulb allow you to crawl your site and identify redirect chains, broken internal links, missing canonical tags, and pages returning incorrect status codes.
- 6Check server response time
Slow server responses reduce crawl rate. Use GSC's Crawl Stats report (Settings → Crawl Stats) to see Google's average response time for your server over the past 90 days.
GSC crawl stats report
Google Search Console's Crawl Stats report (under Settings) shows exactly how many pages Googlebot crawled per day over the last 90 days, the average response time, and the breakdown of crawl requests by file type. This is the most direct way to observe your actual crawl budget in use.
Authentic Sources Used in This Guide
Official documentation, academic standards, and verified technical sources only.
Official documentation on Googlebot, crawl rate, and crawl budget.
Official guidance on robots.txt syntax and best practices.
Official sitemap documentation including format and submission guidance.
Official guidance on duplicate content and canonical URL selection.
IETF proposed standard formalising the robots.txt specification (2022).