[Update: Postmortem added Oct 23]: CMA and UI outage (Oct 8 - 9, 2024)

roger · October 8, 2024, 5:21pm

Latest update:

Heroku claims the new issue is resolved (but that’s what they said yesterday too, before it happened again). We are continuing to monitor and explore workarounds.

We will write our own postmortem as soon as we can.

Status updates are always at https://status.datocms.com/

EDIT 2024-10-09:

The incident has happened again today, despite a false resolution yesterday. We have reopened the incident and are continuing to monitor it:

https://status.datocms.com/incidents/2024-10-08-high-response-times-on-cma-and-cda-admin-interface-not-loading/

We will report back as soon as we have more information.

roger · October 8, 2024, 6:15pm

We are affected by an upstream provider outage, and are working with them on this: https://status.heroku.com/incidents/2684

roger · October 8, 2024, 7:16pm

Still awaiting a fix from our upstream provider. We are sorry

roger · October 8, 2024, 8:12pm

Still waiting. We are trying to get any sort of ETA and will report back as soon as we get one.

roger · October 8, 2024, 10:14pm

Services (CMA, UI, CDA) appear to be coming back up and are accessible at normal performance again.

However, we still have not yet received an official confirmation of resolution from the upstream provider. We will continue to monitor until an official resolution is announced.

roger · October 8, 2024, 11:01pm

The upstream issue has been resolved and services are fully restored. Their last update:

Starting at 4:27 PM UTC on October 8th , 2024, customers experienced increased error rates and latencies on customer applications hosted on common runtime in EU Region. Heroku engineers investigated and mitigated the impact at 9:52 PM UTC on October 8th , 2024. All application have fully recovered by 10:10 PM UTC.

nroth · October 9, 2024, 11:54am

Hey, so I don’t want to be a pain, but I did want to just ask the question about what is going to be done to avoid a problem like this in the future. It seems like some redundancy or resiliency to outages has to be planned for architecturally.

Luckily, we are in the middle of transitioning to Dato with a bridge to our legacy system. So, we were able to quickly fall-back to our old system to publish timely content for amazon prime day coverage. If we had not had this bridge in place, this could have cost us 10’s of thousands of dollars just from the content production side alone.

If we were on our new site we are developing that relies on the CDA, it could have been even worse. Overall, it is nice to hear what caused the problem, but we need to understand how this kind of issue is being mitigated for the future. I get if all cloud services are down due to a broad attack or something, but it appeared isolated to Heroku. It seems like architecturally, all the eggs are in that basket.

roger · October 9, 2024, 4:31pm

@nroth… not a pain at all. 100% agreed with everything you said. This is unacceptable.

It is happening again today =(

We are extremely displeased with the way Heroku is handling this, but we also 100% take responsibility for our own architectural limitations. Our developers are currently racing to implement & test a possible workaround. I’ll share info whenever I can.

Sorry, this is definitely an all-hands-on-deck situation for us, and the devs have been working around the clock to try to get something working.

Longer-term (if we can resolve the immediate crisis) we absolutely want to provide a proper postmortem and discuss how we can improve reliability going forward. But they need to put out the current fire first. We are very sorry.

Will update as soon as I can.

roger · October 9, 2024, 4:56pm

Heroku is claiming the incident is resolved (again). We are still awaiting a postmortem.

Our incident: https://status.datocms.com/incidents/2024-10-08-high-response-times-on-cma-and-cda-admin-interface-not-loading/

Theirs: https://status.heroku.com/incidents/2685

We are looking at the situation on our end too to see what we can do about it ourselves, regardless of Heroku. We will update with our own postmortem as soon as we can.

roger · October 11, 2024, 5:06pm

Sorry for the delay on this. We haven’t forgotten, we’re still in a back-and-forth with Heroku and trying to get an understanding of what actually happened.

roger · October 23, 2024, 5:36pm

Heroku has finally posted their own postmortem on this: https://status.heroku.com/incidents/2684

Click to expand Heroku postmortem

Summary - What Happened?

On October 8, 2024 from 16:27 Coordinated Universal Time (UTC) to 22:10 UTC, and again on October 9, 2024 from 13:06 UTC to 16:36 UTC, a Distributed Denial of Service (DDoS) caused routing failures in a single partition of Heroku’s Common Runtime infrastructure in the EU Region. This resulted in increased error rates and latencies for some clients connecting to some customer applications hosted in that partition. The Salesforce Technology team worked with our upstream infrastructure provider to mitigate the immediate impact of this event, and to put additional network-level protections in place to improve resilience.

How Did this Issue Impact Heroku’s Services?

Heroku Customers can choose to deploy their Common Runtime applications in Europe (EU) or the United States (US). Heroku deploys applications on one of several partitions across these regions. During both periods of impact, some clients of some Heroku customer applications using custom domains and hosted on a single partition in the EU region saw increased error rates and latencies. Heroku Customers whose applications were hosted on all other partitions were unaffected. Web dynos of impacted customer applications continued to run, but received fewer requests than expected as some clients had difficulty connecting.

Connections via default ({app}.herokuapp.com) domains were not impacted. Customer applications hosted on other Common Runtime partitions in the EU region, in dedicated Private Spaces in the EU, or in regions outside the EU were similarly not impacted.

Technical Details

Detection and Initial Impact

On October 8, 2024 at 16:42 UTC, monitoring alerted the Salesforce Technology team about errors connecting to some Heroku-deployed applications hosted on one Common Runtime partition in the EU region. Routing capacity in the partition automatically scaled up, leading to a temporary recovery, before alerts fired again at 17:10 UTC. Metrics revealed a sharp increase in both the rate and total number of connections to the network routers handling custom domain traffic for the partition beginning at 16:27 UTC. This increase in connections saturated the network, causing connection attempts to time out or fail.

Remediation

During the course of the initial event, the Salesforce Technology team and our upstream infrastructure provider worked together to add capacity to the impacted partition, as well as to identify and mitigate the source(s) of the increase in connections. At 22:10 UTC, following implementation of one such mitigation, the rate of new connection attempts returned to baseline levels and the impacted partition recovered, resolving the impact.

The following day, on October 9 at 13:06 UTC, another Distributed Denial of Service (DDoS) was observed, escalating quickly and leading to repeat impact. Once again, the Salesforce Technology team was automatically alerted, and again worked with our upstream infrastructure provider to increase capacity and to identify and mitigate the new source of the connections. At 16:36 UTC, following implementation of one such mitigation, the rate of new connection attempts again returned to baseline levels, and impact subsided.

Root Cause Analysis

The Salesforce Technology team’s post-incident investigation and analysis determined that the incident was triggered by a flood of connections to the network for the impacted Common Runtime partition from a widely-distributed set of source IP addresses, sufficient to overload the partition’s network.

The Technology team is addressing this by continuing to work with our upstream infrastructure provider to proactively bring additional traffic inspection and shaping capabilities online worldwide to ensure we can mitigate similar events more quickly in the future.

Next Steps

To maintain the performance level that our customers expect from Salesforce and to prevent this defect from recurring, our focus is on continuous improvement. The Technology team has identified and are implementing the following actions:

In Progress: Increasing global edge network capabilities to further improve network traffic performance for all regions.

In Progress: Enhance network measures used to facilitate scaling to further improve load spike responsiveness.

We sincerely apologize for the impact this incident may have caused you and your business; Salesforce is fully committed to minimizing downtime when incidents do occur. We also continually assess and improve our tools, processes, and architecture to provide you with the best service possible. If your application is the victim of a Distributed Denial of Service (DDoS) attack or any other attack, we encourage you to open a support ticket or contact your Salesforce account team.

TLDR it was a DDoS on one of their regions. They say they are taking steps to prevent future occurrences. That’s all the detail we have.

On our side, we have no new architectural changes to report yet. We examined a few potential workarounds during the outage, but none that were ultimately suitable.

I’ll report back here if any plans change or we have something actionable to announce…