XML Sitemaps & Robots.txt: Complete Technical

What You Will Learn

When XML sitemaps genuinely help and when they do not
Correct XML sitemap format including optional tags
How to submit sitemaps to Google Search Console and monitor errors
Robots.txt syntax, how Googlebot reads it, and common mistakes
All robots.txt directives — Allow, Disallow, Crawl-delay, Sitemap
The critical difference between robots.txt disallow and noindex meta tag

XML Sitemaps

An XML sitemap is a file listing URLs on your site that you want Google to crawl and index. Submitting a sitemap does not guarantee indexing — Google decides which submitted URLs to crawl and index based on its own quality and relevance assessments. What a sitemap does is ensure Googlebot knows these URLs exist and has a direct path to discover them.

When sitemaps genuinely help

Large sites. Sites with thousands of pages benefit most from sitemaps — Googlebot may not discover all pages through link crawling alone, especially deep pages with few internal links.
New sites. A new site with few external backlinks may not be discovered quickly through normal crawling. A sitemap submitted to Search Console accelerates initial discovery.
Sites with pages not linked internally. Orphan pages — those with no internal links — will not be discovered through crawling. A sitemap listing them gives Google a path to them (though fixing the orphan status is better long-term).
Sites with frequently updated content. News sites, blogs with high publishing frequency, and e-commerce with frequently changing inventory use sitemaps to signal freshness and prioritise recrawl.

When sitemaps are less important

For small sites (under 500 pages) with solid internal linking structures and existing external backlinks, Google typically discovers and crawls all pages without a sitemap. A sitemap adds no harm but has minimal incremental value in these cases.

Sitemap Format

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  
  <url>
    <loc>https://www.example.com/page/</loc>
    <lastmod>2026-04-04</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>

  <url>
    <loc>https://www.example.com/another-page/</loc>
    <lastmod>2026-03-15</lastmod>
  </url>

</urlset>

Tag notes: <loc> (required) — must be the canonical URL including trailing slash if used consistently; <lastmod> (recommended) — ISO 8601 format; use actual last-modified dates, not today's date for all pages; <changefreq> and <priority> (optional) — Google largely ignores these in practice.

Sitemap index files for large sites

A single sitemap file can contain up to 50,000 URLs and must be under 50MB. Large sites use a sitemap index file that points to multiple individual sitemaps:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://www.example.com/sitemap-posts.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://www.example.com/sitemap-products.xml</loc>
  </sitemap>
</sitemapindex>

Submission and Monitoring

Submit your sitemap to Google via Search Console: Settings → Sitemaps → Enter sitemap URL → Submit. Once submitted, Google shows the sitemap's discovered URL count, indexed URL count, and any errors. The gap between discovered and indexed URLs is informative — a large gap suggests some URLs are being rejected due to quality issues, canonical conflicts, or crawl budget constraints.

Sitemap best practices

Only include canonical, indexable URLs — do not include noindex pages, URLs with canonical tags pointing elsewhere, or URLs blocked by robots.txt
Keep sitemaps current — remove deleted pages, add new pages promptly
Reference your sitemap location in robots.txt: Sitemap: https://www.example.com/sitemap.xml
Keep the sitemap URL consistent — changing it requires resubmission

Robots.txt

Robots.txt is a plain text file located at the root of your domain (https://www.example.com/robots.txt) that uses the Robots Exclusion Protocol to communicate crawling instructions to web robots including Googlebot. Googlebot fetches and reads robots.txt before crawling any other page on a site.

# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /search?
Allow: /public/

User-agent: Googlebot
Disallow: /staging/

User-agent: Googlebot-Image
Disallow: /proprietary-images/

Sitemap: https://www.example.com/sitemap.xml

Robots.txt is read line by line. User-agent: * applies to all crawlers. Specific user agents (Googlebot, Googlebot-Image, Googlebot-Video) can have separate rules. Rules apply to the longest matching prefix — Disallow: /admin/ blocks all URLs starting with /admin/.

Robots.txt Directives

Directive	Supported by Google	Meaning
User-agent	Yes	Specifies which crawler the following rules apply to
Disallow	Yes	Blocks the crawler from accessing URLs matching this path
Allow	Yes	Explicitly allows access to a path within a broader disallowed path
Sitemap	Yes	Points crawlers to the site's XML sitemap
Crawl-delay	No	Not supported by Google — use Search Console crawl rate settings instead
Noindex	No longer supported	Was informally supported; Google officially dropped support in September 2019

Noindex vs Disallow — A Critical Distinction

This is one of the most commonly confused technical SEO concepts — and getting it wrong can have serious consequences:

robots.txt Disallow blocks crawling. Googlebot will not request the blocked URL. However, if external sites link to a disallowed URL, Google can still know about the URL and show it in search results — with no title or description (just the URL), because it cannot crawl the page to understand its content.
noindex meta tag blocks indexing. The page can be crawled normally, but Google will not include it in its search index. The page will not appear in search results. However, Googlebot must be able to crawl the page to see the noindex tag — if the page is also blocked by robots.txt, the noindex tag cannot be read.

Never disallow pages you want to noindex

A page that is both blocked by robots.txt and has a noindex meta tag creates a contradiction: Google cannot read the noindex tag because robots.txt prevents access. The page may still appear in search results as a URL-only result because external links reveal its existence. The correct approach: allow crawling (do not disallow in robots.txt) and use noindex in the page's meta robots tag.

Authentic Sources

OfficialGoogle Search Central — Sitemaps

Official sitemap documentation including format, submission, and best practices.

OfficialGoogle Search Central — Robots.txt Introduction

Official documentation on robots.txt syntax, supported directives, and how Googlebot reads it.

OfficialGoogle Search Central — Robots Meta Tag

The noindex meta tag and how it differs from robots.txt disallow.

OfficialSitemaps.org — Sitemap Protocol

The official sitemap protocol maintained by Google, Microsoft, Yahoo, and Ask.

XML Sitemaps & Robots.txt · Complete Technical Guide

What You Will Learn

XML Sitemaps

When sitemaps genuinely help

When sitemaps are less important

Sitemap Format

Sitemap index files for large sites

Submission and Monitoring

Sitemap best practices

Robots.txt

Robots.txt Directives

Noindex vs Disallow — A Critical Distinction

Authentic Sources

600 guides. All authentic sources.

XML Sitemaps & Robots.txt · Complete Technical Guide

What You Will Learn

XML Sitemaps

When sitemaps genuinely help

When sitemaps are less important

Sitemap Format

Sitemap index files for large sites

Submission and Monitoring

Sitemap best practices

Robots.txt

Robots.txt Directives

Noindex vs Disallow — A Critical Distinction

Authentic Sources

Related Guides

600 guides. All authentic sources.