One a single page is getting indexed after search spidering

mohamed.younes · January 31, 2024, 5:01pm

Whenever I trigger a reindexing through the DatoCMS API (/reindex) I get a success in the build trigger log, however, the number of indexed pages IS ALWAYS 1
any explanation on why this happens?

I believe this is a duplicate of Get only one page after search spidering - #2 by m.finamor
But as the solution hasn’t been published to the community, I am asking again here
thanks.

roger · January 31, 2024, 11:02pm

Hi @mohamed.younes, welcome to the forum and sorry about that! This is the kind of thing we’d need to investigate on a case-by-case basis. Could you please email us at support@datocms.com with the site URL you’re trying to index (or just post it here if you don’t mind sharing it with the public)?

We’ll look into it for you after that. Thanks!

mohamed.younes · February 1, 2024, 8:47am

OK, now I got informed through Email (thanks for quick support) that the origin of the issue was a subtle redirect.
In fact, I set https://www.my-website.com/en and it was redirected to https://www.my-website.com/en/ and that trailing slash required an 301 and “blocked” the crawler from working

My suggestion is:

Couldn’t the crawler be improved so that it ignores 301 as long as the domain stays the same?
In all cases, I think it would be nice to include this detail in documentation
thanks again

roger · February 1, 2024, 10:29pm

Sorry, it looks like I got back to this thread late and Marcelo already helped you there, so I’m not sure how it was set up previously. Do you mean our crawler failed to follow a legitimate 301 that should’ve taken it to the right place?

…or did you mean that the crawler successfully followed a 301, but that took it somewhere else and it couldn’t find your sitemap after that?

I don’t think “ignore 301s to the same domain” is a sensible rule, because sometimes people do use that to redirect within the same site (just a different page).

Maybe I’m misunderstanding what you meant?

mohamed.younes · February 5, 2024, 3:05pm

well right now the missing trailing slash invokes a 301 but also causes the crawler to stop/fail and ends up only indexing a single page

what I would’ve expected is that it follows the redirect as keeps indexing as long and the redirected resources is a subroute of the initial route

roger · February 5, 2024, 10:07pm

Gotcha, thanks for clarifying! I’ll report it to the devs.

roger · February 6, 2024, 3:24pm

@mohamed.younes,

The devs looked into it and said that we do follow redirects, as long as it’s within the same domain/subdomain. They believe that in your case, it was redirecting from my-example.com (no www) to www.my-example.com (with the www), which is a bit different than what you said in post #3 (adding the www would be different than just modifying the trailing slash).

Can you please confirm if that was indeed the case (i.e., whether you also redirected to the www subdomain)?

Technically, my-example.com would be a different host from www.my-example.com in most implementations, including ours, and that’s probably not something we would change, because there are some cross-site security concerns here. However, if you redirect from www.my-example.com/page1 to www.my-example.com/page2, we should be able to follow that.

I hope that clarifies this behavior? If we are mistaken and you weren’t redirecting across hosts, please let us know!