We build applications with many dynamic pages (product catalogues) combined with regular CMS content (blog, news, guides). For some sites there could be 50 CMS pages vs 10.000 product pages.
The DatoCMS search indexer will crawl all pages, respecting the robots.txt settings.
If we can’t create rules in the robots.txt file (because content vs product urls maybe aren’t distinguishable), is there another way of limiting what pages get crawled/indexed? We could add data-datocms-noindex to the pages, but then the crawler would still have to do the work.
I wouldn’t want the crawler to have to check 10.050 pages, and only keeping 50 pages.
I can’t find anything about this in the documentation, so it’s probably not an option, but I wanted to double check. Having the crawler point to a specific sitemap for CMS content only would be great, while disabling recursive crawling obviously.