Sitemap + Robots.txt

Sitemaps are a key part of SEO, but they are often misunderstood and blamed for issues they do not actually control.

This document explains What a sitemap actually is, how we handle sitemaps and robots.txt in WordPress (Yoast) and Sanity + Astro and where dev responsibility ends and SEO responsibility begins

What is a sitemap

A sitemap is a machine-readable list of canonical URLs, usually an XML file, that help search engines discover URLs faster.

Typical sitemap URLs include:

/sitemap.xml
/sitemap_index.xml

A sitemap should only contain:

URLs that are meant to be indexed
Pages returning a 200 status code
Final canonical versions of each URL

What is `robots.txt`?

The robots.txt file is used to control crawling, not indexing. It tells search engine bots which parts of a site they are allowed or not allowed to access and crawl.

If a URL is blocked in robots.txt, Google will not crawl it, even if:

The URL appears in the sitemap
The URL is linked internally
The URL has external links pointing to it

In WordPress

Yoast plugin

Sitemap

Robots

The Yoast plugin automatically generates sitemaps based on WordPress configuration and the SEO rules defined in the admin.

Yoast does not decide what should rank or be indexed, but it reflects the decisions already made in WordPress and SEO settings.

It generates:

sitemap_index.xml
One sitemap per content type
A virtual robots.txt file

Regarding behavior:

URLs marked as noindex are automatically excluded from the sitemap
Indexable URLs are included using their canonical versions
Basic allow / disallow crawling rules are added to robots.txt

PHP errors that break the sitemap

A very common scenario is the following: a change is deployed and, suddenly, the sitemap URL no longer returns valid XML.

Sitemap error

In WordPress, the sitemap is generated dynamically. This means that any PHP fatal error, even in unrelated code, can completely break the sitemap output and result in invalid XML.

How to debug

Start by remembering that a sitemap is an XML file, not an HTML page. Always check the raw source of the response instead of the rendered view.

From there, debug step by step:

review recently changed functions
check the server error logs for PHP fatal errors
temporarily disable plugins to rule out conflicts

In Sanity + Astro

Sitemap

Create the sitemap

In Astro, sitemaps are generated manually. In Astro, if you create a file called sitemap.xml.ts, that TypeScript file is responsible for generating the XML sitemap output.

01. `sitemap.js`

Create a function that returns an array with the CPTs we know that should be indexed. Here’s an example:

Sitemap

02. `sitemap.xml.ts`

Create the file sitemap.xml.ts. Astro will automatically expose a /sitemap.xml URL on the site, with whatever content you generate in that file.

This file references the functions defined in sitemap.js:

sitemapTypes(), which returns the array of all CPTs to include in the sitemap.
getSitemapPages(), the function that calls Sanity to fetch all pages/documents that should be indexed

Sitemap

03. GROQ query

Loop all the CPTs and call the Sanity Studio to get the correct information of each page. This will return all pages that should be indexed, within their slug and last modified date.

Sitemap

Create the `robots.txt`

Robots

01. Sanity Configuration

In Sanity, we create a textarea field under Settings / SEO where we define the required robots.txt configuration.

Sitemap

02. Astro

In Astro, we read that information from a file called robots.txt.ts, that is located on the root. This file fetches the data from Sanity and uses it to dynamically generate the robots.txt file.

Robots

Upload the sitemap into GSC

Submitting a sitemap in Google Search Console helps Google discover and re-crawl URLs more efficiently.

Google automatically revisits submitted sitemaps periodically, but manually submitting or re-submitting one can help accelerate the crawling process after significant structural changes.

Instructions

Log in to Google Search Console using the Terra dev account.
Select the correct project (property) from the top-left selector.
Go to Sitemaps
Enter the sitemap slug sitemap.xml
Click Submit

If the sitemap was already submitted, you can resubmit it to signal Google to review it again.

Google Search Console

Best practices

Early decide whether a content should be indexable or not, so you can include or exclude it and avoid future issues
Don’t use sitemaps to “fix” bad linking
Don’t include in the sitemap: filtered URLs, pagination, tracking params, non-canonical URLs or using sitemaps as a workaround for bad linking
A sitemap should reflect intentional decisions, not defaults.
Use robots.txt to control crawling, not indexing
Avoid blocking URLs that are meant to be indexed, even temporarily
Do not include URLs in the sitemap that are blocked in robots.txt
Keep robots.txt rules minimal and explicit to avoid accidental over-blocking
Avoid blocking assets (CSS, JS, images) required to correctly render indexable pages
Always check the final rendered robots.txt and sitemap URLs after a deploy
In practice:
- Devs control what is exposed
- SEO decides what should be indexed
- Regenerating the sitemap rarely fixes the problem —the fix is usually in logic or indexability

Knowledge Check

Test your understanding of this section

Loading questions...

Sitemap + Robots.txt

What is a sitemap

What is robots.txt?

In WordPress

Yoast plugin

PHP errors that break the sitemap

How to debug

In Sanity + Astro

Create the sitemap

01. sitemap.js

02. sitemap.xml.ts

03. GROQ query

Create the robots.txt

01. Sanity Configuration

02. Astro

Upload the sitemap into GSC

Instructions

Best practices

Knowledge Check

What is `robots.txt`?

01. `sitemap.js`

02. `sitemap.xml.ts`

Create the `robots.txt`