Connection issue with Dato from a specific application instance

jaakko.jokinen · April 16, 2024, 12:19pm

Hello,

We’ve been dealing with a connection issue we haven’t been able to track down on our own. Would anyone from Dato’s team be able to help us debug the issue?

We are currently using the same content model with multiple projects. Each project serves two different application instances. Currently one instance for one project fails (instance A) when it attempts to form a connection with Dato. We don’t get very descriptive details as to why and we’ve not been able to increase logging to highlight a cause.

More specifically, we are generating static pages with Next. The static generation fails by timing out and it appears it’s due to requests we are making to Dato. Next does not list exact causes, but Dato is the only source we use for static pages.

We are having a hard time reproducing the issue in any other context than instance A. We’ve attempted to build the Docker image locally, we’ve rerun our test environments and as said, the build worked for instance B, the other instance that uses the same Dato project as instance A. We can’t get the content/model combination to break in any other context. For these reasons, we are beginning to think that the issue is not in the content model or the content, but maybe in the pipes that move the data.

Are there scenarios where Dato would block requests from a source per environment? In this case, instance A would be blocked from requesting data from the new version of the primary environment? This seems unlikely. Would it be possible to check logs from Dato’s end in case they would highlight issues we are not seeing?

roger · April 16, 2024, 10:04pm

Hey @jaakko.jokinen,

Hmm, this is not something we’ve had reports of so far. Have you tried turning on logging in the JS client yet? DatoCMS JavaScript Client - Logging Requests & Responses

Using LogLevel.BODY_AND_HEADERS should make the client print out verbose messages that might help you troubleshoot what’s going on.

Or inside your containers, are you able to make a simple CURL request to the endpoints, outside of the Next contexts? (You can drop into a shell to run a command manually, or just make a simple JS fetch request from command-line node or a simple script). Lastly, can you access a shell with the same IP address as the Docker container, but outside the container itself (i.e. can you run a regular shell/laptop from the same WiFi?) That should at least let you determine whether it’s an issue with Next, something in the container, or something in the network layer.

Oh, and… you did set the API keys in all the containers, right? (Just making sure there wasn’t a forgotten env var or such).

If you’re able to give us a few specific IP addresses (that you’re trying to reach us from), we can also check to see if any of them were inadvertently blocked by a firewall or such. We don’t often hear of that, but we can check for you. Also, knowing which specific endpoints you’re trying to reach would also help (our asset CDN? GraphQL CDA API? REST CMA API? some combination of them?)

Thanks!

jaakko.jokinen · April 17, 2024, 5:00am

Hi Roger, thanks for taking this on.

We have logging enabled for the requests we are making to Dato. Normally errors do get posted in the logs when requests to Dato error out. However, in this specific instance they seem to be eaten up by Next’s build process. We are in a difficult position because instance A is a production instance and as such we have limited ability to do discovery with it. We’ve been trying our hardest to reproduce the issue in some other environment where we would have free reign to test out ideas.

We are making requests to the GraphQL CDA through our own client. The client handles rate limiting.

One detail I didn’t include which I now understand is relevant is that it’s only a specific version (“newest version”) of the content model that doesn’t seem to work. Currently, we have rolled back the content model of the project instance A depends on. The older content model works. With the new content model requests fail. This would suggest that API keys and other configurations should be correct in principle. As said, we were quite convinced that the issue was in the content or the model, but we just haven’t been able to use the combination of the content and content model to reproduce the issue anywhere else than instance A so we are trying to look at other causes.

Unfortunately we do not track the IP address of the build pipeline. We likely can’t find it after the fact. Would it be possible to provide some other means by which you could find relevant logs? We have details about the requests such as the environment or useragent, key times and so forth. Apologies, I do understand that this is a long shot and a tricky ask.

jaakko.jokinen · April 17, 2024, 10:35am

We have possibly discovered one symptom. When the content and data model is accessed as a sandbox (in other words, the environment is not set as the primary environment), the problem does not occur. The problem only seems to occur when the environment is set as the primary environment.

We were thinking whether this could indicate that the combination of content and the content model may be in some way incompatible with primary environments specifically? We are under the impression that there are some key differences between the primary environment and the sandboxes when it comes to how requests are handled. One of which is how the content is cached. Does this sound at all palatable? As pure speculation, we see it as possible that instance B managed to build correctly because it was the first to request the data and as such got fresh non-caced data. Instance A asked for the data later and received a cached response that’s in some way unexpected in our end.

We are still working to create a reproduction which we hope will reveal more details.

jaakko.jokinen · April 17, 2024, 1:38pm

We were able to reproduce the error in another production like instance. In an attempt to create a reproduction completely separate of our production services, we duplicated the project instance A was hooked up to, migrated it to match the latest content model and attempted to build while being integrated to it. ‘Unfortunately’ the build worked.

For now we’ve been able to reproduce the issue in with two different instances, with two different environments (that used the same content model and had very similar content). Based on our experience, the build only fails when we request content from the primary environment.

When using the same content and content model through another project, the build no longer seems to fail.

jaakko.jokinen · April 17, 2024, 2:38pm

We’ve engaged the support team through email with this issue as it seems to be fickle.

roger · April 17, 2024, 6:36pm

Thank you, jaakko. We’ll follow up over email!