What You Will Learn
- When XML sitemaps genuinely help and when they do not
- Correct XML sitemap format including optional tags
- How to submit sitemaps to Google Search Console and monitor errors
- Robots.txt syntax, how Googlebot reads it, and common mistakes
- All robots.txt directives — Allow, Disallow, Crawl-delay, Sitemap
- The critical difference between robots.txt disallow and noindex meta tag
XML Sitemaps
An XML sitemap is a file listing URLs on your site that you want Google to crawl and index. Submitting a sitemap does not guarantee indexing — Google decides which submitted URLs to crawl and index based on its own quality and relevance assessments. What a sitemap does is ensure Googlebot knows these URLs exist and has a direct path to discover them.
When sitemaps genuinely help
- Large sites. Sites with thousands of pages benefit most from sitemaps — Googlebot may not discover all pages through link crawling alone, especially deep pages with few internal links.
- New sites. A new site with few external backlinks may not be discovered quickly through normal crawling. A sitemap submitted to Search Console accelerates initial discovery.
- Sites with pages not linked internally. Orphan pages — those with no internal links — will not be discovered through crawling. A sitemap listing them gives Google a path to them (though fixing the orphan status is better long-term).
- Sites with frequently updated content. News sites, blogs with high publishing frequency, and e-commerce with frequently changing inventory use sitemaps to signal freshness and prioritise recrawl.
When sitemaps are less important
For small sites (under 500 pages) with solid internal linking structures and existing external backlinks, Google typically discovers and crawls all pages without a sitemap. A sitemap adds no harm but has minimal incremental value in these cases.
Sitemap Format
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.example.com/page/</loc>
<lastmod>2026-04-04</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://www.example.com/another-page/</loc>
<lastmod>2026-03-15</lastmod>
</url>
</urlset>
Tag notes: <loc> (required) — must be the canonical URL including trailing slash if used consistently; <lastmod> (recommended) — ISO 8601 format; use actual last-modified dates, not today's date for all pages; <changefreq> and <priority> (optional) — Google largely ignores these in practice.
Sitemap index files for large sites
A single sitemap file can contain up to 50,000 URLs and must be under 50MB. Large sites use a sitemap index file that points to multiple individual sitemaps:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://www.example.com/sitemap-posts.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.example.com/sitemap-products.xml</loc>
</sitemap>
</sitemapindex>
Submission and Monitoring
Submit your sitemap to Google via Search Console: Settings → Sitemaps → Enter sitemap URL → Submit. Once submitted, Google shows the sitemap's discovered URL count, indexed URL count, and any errors. The gap between discovered and indexed URLs is informative — a large gap suggests some URLs are being rejected due to quality issues, canonical conflicts, or crawl budget constraints.
Sitemap best practices
- Only include canonical, indexable URLs — do not include noindex pages, URLs with canonical tags pointing elsewhere, or URLs blocked by robots.txt
- Keep sitemaps current — remove deleted pages, add new pages promptly
- Reference your sitemap location in robots.txt:
Sitemap: https://www.example.com/sitemap.xml - Keep the sitemap URL consistent — changing it requires resubmission
Robots.txt
Robots.txt is a plain text file located at the root of your domain (https://www.example.com/robots.txt) that uses the Robots Exclusion Protocol to communicate crawling instructions to web robots including Googlebot. Googlebot fetches and reads robots.txt before crawling any other page on a site.
# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /search?
Allow: /public/
User-agent: Googlebot
Disallow: /staging/
User-agent: Googlebot-Image
Disallow: /proprietary-images/
Sitemap: https://www.example.com/sitemap.xml
Robots.txt is read line by line. User-agent: * applies to all crawlers. Specific user agents (Googlebot, Googlebot-Image, Googlebot-Video) can have separate rules. Rules apply to the longest matching prefix — Disallow: /admin/ blocks all URLs starting with /admin/.
Robots.txt Directives
| Directive | Supported by Google | Meaning |
|---|---|---|
| User-agent | Yes | Specifies which crawler the following rules apply to |
| Disallow | Yes | Blocks the crawler from accessing URLs matching this path |
| Allow | Yes | Explicitly allows access to a path within a broader disallowed path |
| Sitemap | Yes | Points crawlers to the site's XML sitemap |
| Crawl-delay | No | Not supported by Google — use Search Console crawl rate settings instead |
| Noindex | No longer supported | Was informally supported; Google officially dropped support in September 2019 |
Noindex vs Disallow — A Critical Distinction
This is one of the most commonly confused technical SEO concepts — and getting it wrong can have serious consequences:
- robots.txt Disallow blocks crawling. Googlebot will not request the blocked URL. However, if external sites link to a disallowed URL, Google can still know about the URL and show it in search results — with no title or description (just the URL), because it cannot crawl the page to understand its content.
- noindex meta tag blocks indexing. The page can be crawled normally, but Google will not include it in its search index. The page will not appear in search results. However, Googlebot must be able to crawl the page to see the noindex tag — if the page is also blocked by robots.txt, the noindex tag cannot be read.
A page that is both blocked by robots.txt and has a noindex meta tag creates a contradiction: Google cannot read the noindex tag because robots.txt prevents access. The page may still appear in search results as a URL-only result because external links reveal its existence. The correct approach: allow crawling (do not disallow in robots.txt) and use noindex in the page's meta robots tag.
Authentic Sources
Official sitemap documentation including format, submission, and best practices.
Official documentation on robots.txt syntax, supported directives, and how Googlebot reads it.
The noindex meta tag and how it differs from robots.txt disallow.
The official sitemap protocol maintained by Google, Microsoft, Yahoo, and Ask.