The documentation for the DatoCMS Search API says that it supports sitemap index files. What I’m trying to figure out is the best way to make sure Dato’s crawler is aware of the location and name of my site’s sitemap index file. The location for my site is /sitemap-index.xml
and the following link tag is added to my site’s head tag: <link rel="sitemap" type="application/xml" href="/sitemap-index.xml">
. However, the number of pages indxed during a crawl is way lower than the number of pages in my sitemap. Any guidance on how I can ensure that the crawler sees my sitemap index file and crawls all the pages there?
Hey @sdunham,
Sorry for the lack of clarity there! Could you please try making the index with the same sitemap.xml
name, like our own https://www.datocms.com/sitemap.xml ?
In it, you can still link to other sitemaps, like:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://www.datocms.com/sitemap-blog.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.datocms.com/sitemap-docs.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.datocms.com/sitemap-partners.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.datocms.com/sitemap-product-updates.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.datocms.com/sitemap-static.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.datocms.com/sitemap-marketplace.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.datocms.com/sitemap-academy.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.datocms.com/sitemap-compare.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.datocms.com/sitemap-user-guides.xml</loc>
</sitemap>
</sitemapindex>
Does that work?
A dev looked into the situation, and you’re right that default is only to look at sitemap.xml
and not any variants.
However, you should also be able to specify your sitemap(s) in your robots.txt, like:
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap2.xml
Could you please let me know if that works?
Great, thanks for the clarification. I have the update adding our sitemap URL to our robots.txt file on deploy pretty much done. Once it’s through code review I can try a deploy and see the result. I’ll report back.
In case you’re able to check before I’m able to deploy and test: My assumption is that I can add a single Sitemap directive which points to my sitemap index file, and that I don’t need to add each individual sitemap URL. Is that a correct assumption?
Checking for you! I’ll report back as soon as I find out
Just merged my update and did a test build/crawl.
Pages indexed before sitemap added to robots.txt: 355
Pages indexed after sitemap added to robots.txt: 996 (pretty much the number of pages in the sitemap)
Seems like good evidence that the crawler is picking up the sitemap data now. Thanks for the help with this!
Great, thank you @sdunham! The devs just confirmed that behavior too, but you already figured it out