Weâve been dealing with a connection issue we havenât been able to track down on our own. Would anyone from Datoâs team be able to help us debug the issue?
We are currently using the same content model with multiple projects. Each project serves two different application instances. Currently one instance for one project fails (instance A) when it attempts to form a connection with Dato. We donât get very descriptive details as to why and weâve not been able to increase logging to highlight a cause.
More specifically, we are generating static pages with Next. The static generation fails by timing out and it appears itâs due to requests we are making to Dato. Next does not list exact causes, but Dato is the only source we use for static pages.
We are having a hard time reproducing the issue in any other context than instance A. Weâve attempted to build the Docker image locally, weâve rerun our test environments and as said, the build worked for instance B, the other instance that uses the same Dato project as instance A. We canât get the content/model combination to break in any other context. For these reasons, we are beginning to think that the issue is not in the content model or the content, but maybe in the pipes that move the data.
Are there scenarios where Dato would block requests from a source per environment? In this case, instance A would be blocked from requesting data from the new version of the primary environment? This seems unlikely. Would it be possible to check logs from Datoâs end in case they would highlight issues we are not seeing?
Using LogLevel.BODY_AND_HEADERS should make the client print out verbose messages that might help you troubleshoot whatâs going on.
Or inside your containers, are you able to make a simple CURL request to the endpoints, outside of the Next contexts? (You can drop into a shell to run a command manually, or just make a simple JS fetch request from command-line node or a simple script). Lastly, can you access a shell with the same IP address as the Docker container, but outside the container itself (i.e. can you run a regular shell/laptop from the same WiFi?) That should at least let you determine whether itâs an issue with Next, something in the container, or something in the network layer.
Oh, and⌠you did set the API keys in all the containers, right? (Just making sure there wasnât a forgotten env var or such).
If youâre able to give us a few specific IP addresses (that youâre trying to reach us from), we can also check to see if any of them were inadvertently blocked by a firewall or such. We donât often hear of that, but we can check for you. Also, knowing which specific endpoints youâre trying to reach would also help (our asset CDN? GraphQL CDA API? REST CMA API? some combination of them?)
We have logging enabled for the requests we are making to Dato. Normally errors do get posted in the logs when requests to Dato error out. However, in this specific instance they seem to be eaten up by Nextâs build process. We are in a difficult position because instance A is a production instance and as such we have limited ability to do discovery with it. Weâve been trying our hardest to reproduce the issue in some other environment where we would have free reign to test out ideas.
We are making requests to the GraphQL CDA through our own client. The client handles rate limiting.
One detail I didnât include which I now understand is relevant is that itâs only a specific version (ânewest versionâ) of the content model that doesnât seem to work. Currently, we have rolled back the content model of the project instance A depends on. The older content model works. With the new content model requests fail. This would suggest that API keys and other configurations should be correct in principle. As said, we were quite convinced that the issue was in the content or the model, but we just havenât been able to use the combination of the content and content model to reproduce the issue anywhere else than instance A so we are trying to look at other causes.
Unfortunately we do not track the IP address of the build pipeline. We likely canât find it after the fact. Would it be possible to provide some other means by which you could find relevant logs? We have details about the requests such as the environment or useragent, key times and so forth. Apologies, I do understand that this is a long shot and a tricky ask.
We have possibly discovered one symptom. When the content and data model is accessed as a sandbox (in other words, the environment is not set as the primary environment), the problem does not occur. The problem only seems to occur when the environment is set as the primary environment.
We were thinking whether this could indicate that the combination of content and the content model may be in some way incompatible with primary environments specifically? We are under the impression that there are some key differences between the primary environment and the sandboxes when it comes to how requests are handled. One of which is how the content is cached. Does this sound at all palatable? As pure speculation, we see it as possible that instance B managed to build correctly because it was the first to request the data and as such got fresh non-caced data. Instance A asked for the data later and received a cached response thatâs in some way unexpected in our end.
We are still working to create a reproduction which we hope will reveal more details.
We were able to reproduce the error in another production like instance. In an attempt to create a reproduction completely separate of our production services, we duplicated the project instance A was hooked up to, migrated it to match the latest content model and attempted to build while being integrated to it. âUnfortunatelyâ the build worked.
For now weâve been able to reproduce the issue in with two different instances, with two different environments (that used the same content model and had very similar content). Based on our experience, the build only fails when we request content from the primary environment.
When using the same content and content model through another project, the build no longer seems to fail.