Site search crawling - limit to a few pages instead of complete site?

david.desloovere · September 11, 2023, 2:32pm

We build applications with many dynamic pages (product catalogues) combined with regular CMS content (blog, news, guides). For some sites there could be 50 CMS pages vs 10.000 product pages.

The DatoCMS search indexer will crawl all pages, respecting the robots.txt settings.
If we can’t create rules in the robots.txt file (because content vs product urls maybe aren’t distinguishable), is there another way of limiting what pages get crawled/indexed? We could add data-datocms-noindex to the pages, but then the crawler would still have to do the work.
I wouldn’t want the crawler to have to check 10.050 pages, and only keeping 50 pages.

I can’t find anything about this in the documentation, so it’s probably not an option, but I wanted to double check. Having the crawler point to a specific sitemap for CMS content only would be great, while disabling recursive crawling obviously.

m.finamor · September 11, 2023, 2:46pm

Hello @david.desloovere

As you said unfortunately the only way to exclude pages is through the robots.txt file or adding that tag on the individual page
If the page is not in the robots.txt and it is either in the sitemap.xml file or is linked somewhere from a page that is indexed, then the robot will check it

david.desloovere · September 15, 2023, 12:02pm

Got it. Thanks.
In this case we would have to roll our own search.