[Update: Postmortem added Oct 23]: CMA and UI outage (Oct 8 - 9, 2024)

Latest update:

Heroku claims the new issue is resolved (but that’s what they said yesterday too, before it happened again). We are continuing to monitor and explore workarounds.

We will write our own postmortem as soon as we can.

Status updates are always at https://status.datocms.com/


EDIT 2024-10-09:

The incident has happened again today, despite a false resolution yesterday. We have reopened the incident and are continuing to monitor it:

https://status.datocms.com/incidents/2024-10-08-high-response-times-on-cma-and-cda-admin-interface-not-loading/


Previous post:

We are currently investigating an outage affecting the CMA and admin area UI:

https://status.datocms.com/incidents/2024-10-08-high-response-times-on-cma-and-cda-admin-interface-not-loading/

We will report back as soon as we have more information.

1 Like

We are affected by an upstream provider outage, and are working with them on this: https://status.heroku.com/incidents/2684

Still awaiting a fix from our upstream provider. We are sorry :frowning:

Still waiting. We are trying to get any sort of ETA and will report back as soon as we get one.

Services (CMA, UI, CDA) appear to be coming back up and are accessible at normal performance again.

However, we still have not yet received an official confirmation of resolution from the upstream provider. We will continue to monitor until an official resolution is announced.

The upstream issue has been resolved and services are fully restored. Their last update:

Starting at 4:27 PM UTC on October 8th , 2024, customers experienced increased error rates and latencies on customer applications hosted on common runtime in EU Region. Heroku engineers investigated and mitigated the impact at 9:52 PM UTC on October 8th , 2024. All application have fully recovered by 10:10 PM UTC.

Hey, so I don’t want to be a pain, but I did want to just ask the question about what is going to be done to avoid a problem like this in the future. It seems like some redundancy or resiliency to outages has to be planned for architecturally.

Luckily, we are in the middle of transitioning to Dato with a bridge to our legacy system. So, we were able to quickly fall-back to our old system to publish timely content for amazon prime day coverage. If we had not had this bridge in place, this could have cost us 10’s of thousands of dollars just from the content production side alone.

If we were on our new site we are developing that relies on the CDA, it could have been even worse. Overall, it is nice to hear what caused the problem, but we need to understand how this kind of issue is being mitigated for the future. I get if all cloud services are down due to a broad attack or something, but it appeared isolated to Heroku. It seems like architecturally, all the eggs are in that basket.

@nroth… not a pain at all. 100% agreed with everything you said. This is unacceptable.

It is happening again today =(

We are extremely displeased with the way Heroku is handling this, but we also 100% take responsibility for our own architectural limitations. Our developers are currently racing to implement & test a possible workaround. I’ll share info whenever I can.

Sorry, this is definitely an all-hands-on-deck situation for us, and the devs have been working around the clock to try to get something working.

Longer-term (if we can resolve the immediate crisis) we absolutely want to provide a proper postmortem and discuss how we can improve reliability going forward. But they need to put out the current fire first. We are very sorry.

Will update as soon as I can.

1 Like

Heroku is claiming the incident is resolved (again). We are still awaiting a postmortem.

Our incident: https://status.datocms.com/incidents/2024-10-08-high-response-times-on-cma-and-cda-admin-interface-not-loading/

Theirs: https://status.heroku.com/incidents/2685


We are looking at the situation on our end too to see what we can do about it ourselves, regardless of Heroku. We will update with our own postmortem as soon as we can.

Sorry for the delay on this. We haven’t forgotten, we’re still in a back-and-forth with Heroku and trying to get an understanding of what actually happened.

Heroku has finally posted their own postmortem on this: https://status.heroku.com/incidents/2684

Click to expand Heroku postmortem

TLDR it was a DDoS on one of their regions. They say they are taking steps to prevent future occurrences. That’s all the detail we have.

On our side, we have no new architectural changes to report yet. We examined a few potential workarounds during the outage, but none that were ultimately suitable.

I’ll report back here if any plans change or we have something actionable to announce…