404 on sandbox environment from CI

We’re experiencing some weird error which happens sometimes.
Our theory is that it must be related to some cache layer but hopefully you guys can provide some more insights.

We’re building websites using Gatsby and gatsby-source-graphql because we prefer to keep the schema provided by DatoCMS.
We create sandboxes from our CI process which is running on private runners on Gitlab on each Merge request.

Partial build log:

gatsby build --log-pages
🎩 Connecting to DatoCMS environment: https://graphql.datocms.com/environments/chore-ch6609/
success open and validate gatsby-configs - 0.035s
/bin/sh: lscpu: not found
/bin/sh: lscpu: not found
success load plugins - 0.847s
success onPreInit - 0.051s
success initialize cache - 0.007s
success copy gatsby files - 0.024s
success onPreBootstrap - 0.547s
/bin/sh: lscpu: not found
success createSchemaCustomization - 0.009s
error "gatsby-source-graphql" threw an error while running the sourceNodes lifecycle:
Response not successful: Received status code 404
  ServerError: Response not successful: Received status code 404
  
  - index.ts:114 Object.exports.throwServerError
    [website]/[apollo-link-http-common]/src/index.ts:114:17
  
  - index.ts:145 
    [website]/[apollo-link-http-common]/src/index.ts:145:11
  
  - task_queues.js:97 processTicksAndRejections
    internal/process/task_queues.js:97:5
  
not finished source and transform nodes - 0.461s

But when running gatsby build from my local machine I don’t get this error and when navigating with my browser to https://graphql.datocms.com/environments/chore-ch6609/ I get the expected auth error.

After some time and many retries of the pipelines it does start working. This makes me believe the issue is not related to our setup.

Any insights into why this is happening will be greatly appreciated.

hey @ramon.gebben not sure if we can get back to you today, but we’ll be in touch as soon as possible!

hey @ramon.gebben,

  • If the API token is invalid, you would get a 401 error (INVALID_AUTHORIZATION_HEADER error);
  • If the API token has not the permissions to access the environment, you would get a 401 error (INSUFFICIENT_PERMISSIONS);
  • If environment exists but its not ready to be used yet (that is, the fork operation is still pending and content is still being copied), you would get a 401 error (ENVIRONMENT_NOT_READY error);

If you’re receiving 404 it just means that the environment does not exist :confused:

You’re using our CLI to create the sandbox environment? Could you paste the exact command you’re executing within the CI and it’s output please?

@s.verna Thanks for the context but as you can read in the build log we’re actually getting 404’s on sandboxes I confirmed existed.

The exact command used to create sandboxes:

dato migrate --destination=$CI_COMMIT_REF_SLUG

CI_COMMIT_REF_SLUG is a “slug-safe” variable provided by Gitlab which is just the branch name. So feature/ch1234 would become feature-ch1234.
The command without variable:

dato migrate --destination=feature-ch1234

# ✔ Creating a fork of `main` called `feature-ch1234`...
# ✔ Running <migration-file-name>.js...
# Done!
# ✨  Done in 20.66s.

Then we point the gatsby-source-graphql to URL https://graphql.datocms.com/environments/feature-ch1234/ and when we start, sometimes, we get 404.

Today I discovered that if I change the URL to be postfixed with /preview/, making the full URL https://graphql.datocms.com/environments/feature-ch1234/preview/, I’m not getting the 404 anymore.

@ramon.gebben, this is even more strange now. are you able to obtain the body of the 404 response? Should be something like

{"data":[{"id":"XXX","type":"api_error","attributes":{"code":"YYY","details":{}}}]}%

Hopefully this will help us figure out what’s happening.

Sadly no, I’m not able to get the body from the 404 response.
When I use curl to make a request to the sandbox it works fine. But when we try to connect to it using gatsby-source-graphql we, sometimes, get the error.

The sometimes part is also making it more difficult to get this information. I modified my local version of gatsby-source-graphql to print error responses but now I’m not able to “force” this condition.

I can look up the sandbox id’s and the timestamps for when these 404 happened maybe you’ll be able to see something in the logs?

did you tried to run a simple query like this one within your CI?

curl -v -X POST 'https://graphql.datocms.com/environments/feature-ch1234/' -H 'Authorization: Bearer XXX' -d '{ "query": "{ _site { favicon { url } } }" }'

This should give us what we need

Also, just printing out the env variable would help us make sure that it’s properly filled in:

echo $CI_COMMIT_REF_SLUG

This printed string from the initial log already shows us that the variable is working as expected.

But the output is chore-ch6609

Also:

Thanks @ramon.gebben, I just wanted to make sure there were no surprises.

You’re making the request within your CI job, just after launching the migrations through the CLI? I wait for the response output then, so we can continue investigating!

1 Like

@s.verna It happened again this morning and I was able to fire a curl at it. Sadly this output suggests that the environment exists but we still get the 404 from the source plugin.

$ curl -X POST 'https://graphql.datocms.com/environments/feature-ch5474/preview/' -H 'Authorization: Bearer XXX' -d '{ "query": "{ _site { favicon { url } } }" }' -v
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 104.18.39.139...
* TCP_NODELAY set
* Connected to graphql.datocms.com (104.18.39.139) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-ECDSA-CHACHA20-POLY1305
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=CA; L=San Francisco; O=Cloudflare, Inc.; CN=sni.cloudflaressl.com
*  start date: Jul  6 00:00:00 2020 GMT
*  expire date: Jul  6 12:00:00 2021 GMT
*  subjectAltName: host "graphql.datocms.com" matched cert's "*.datocms.com"
*  issuer: C=US; O=Cloudflare, Inc.; CN=Cloudflare Inc ECC CA-3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fd5cf008200)
> POST /environments/feature-ch5474/preview/ HTTP/2
> Host: graphql.datocms.com
> User-Agent: curl/7.64.1
> Accept: */*
> Authorization: Bearer xxxxx
> Content-Length: 44
> Content-Type: application/x-www-form-urlencoded
>
* We are completely uploaded and fine
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
< HTTP/2 200
< date: Mon, 14 Sep 2020 08:07:51 GMT
< content-type: application/json; charset=utf-8
< set-cookie: __cfduid=d94b57309922c16d4696d5b69787420fb1600070871; expires=Wed, 14-Oct-20 08:07:51 GMT; path=/; domain=.datocms.com; HttpOnly; SameSite=Lax; Secure
< cf-ray: 5d289de1ea2be65c-LHR
< access-control-allow-origin: *
< age: 64
< cache-control: no-cache, no-store, must-revalidate
< expires: 0
< vary: Authorization, Accept-Encoding
< via: 1.1 vegur, 1.1 varnish
< cf-cache-status: DYNAMIC
< access-control-allow-credentials: true
< access-control-allow-headers: authorization, content-type, x-environment, x-site-domain, x-api-version, user-agent, x-session-id
< access-control-allow-methods: GET, POST, PUT, OPTIONS, DELETE
< access-control-expose-headers: x-ratelimit-limit, x-ratelimit-remaining, x-ratelimit-reset
< access-control-max-age: 1728000
< cf-request-id: 052d41012c0000e65c5ab1e200000001
< expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
< pragma: no-cache
< x-cache: HIT
< x-cache-hits: 1
< x-complexity: 12
< x-environment: feature-ch5474
< x-queue-time: 4ms
< x-request-id: 5a87a7e4-d67b-4bcf-88b3-6d749f7bfd16
< x-runtime: 0.077614
< x-served-by: cache-lcy19256-LCY
< x-timer: S1600070871.356496,VS0,VE0
< x-worker-method: get
< server: cloudflare
<
* Connection #0 to host graphql.datocms.com left intact
{"data":{"_site":{"favicon":{"url":"https://www.datocms-assets.com/33413/1598427034-favicon.gif"}}}}* Closing connection 0

I’m trying to add a sleep now to see if waiting for the sandbox might fix it. Maybe the sandbox isn’t ready by the time we try to connect to it.

If this theory hold any merits we should consider waiting for the environment to be ready before completing the command.

Adding sleep 20 seems to resolve the issue. So indeed I think the sandbox did not register as “ready” yet.
I think we can add a check to the migration process to wait for the sandbox to hit a ready state before completing the command.
@s.verna What do you think?

Update: talking privately with @ramon.gebben we found that our GraphQL API correctly returns the right data, with the right timing… the issue seems to be related to gatsby-source-graphql… we’re still investigating, will keep this topic updated!

1 Like

After a lot of debugging @s.verna found 2 bugs related to stale cache.
When I sandbox got removed the cache didn’t get updated properly.

Fixes have been applied and we are monitoring the problem but all factors indicate the problem has been resolved. :tada:

2 Likes