Summary - What Happened?
On October 8, 2024 from 16:27 Coordinated Universal Time (UTC) to 22:10 UTC, and again on October 9, 2024 from 13:06 UTC to 16:36 UTC, a Distributed Denial of Service (DDoS) caused routing failures in a single partition of Heroku’s Common Runtime infrastructure in the EU Region. This resulted in increased error rates and latencies for some clients connecting to some customer applications hosted in that partition. The Salesforce Technology team worked with our upstream infrastructure provider to mitigate the immediate impact of this event, and to put additional network-level protections in place to improve resilience.
How Did this Issue Impact Heroku’s Services?
Heroku Customers can choose to deploy their Common Runtime applications in Europe (EU) or the United States (US). Heroku deploys applications on one of several partitions across these regions. During both periods of impact, some clients of some Heroku customer applications using custom domains and hosted on a single partition in the EU region saw increased error rates and latencies. Heroku Customers whose applications were hosted on all other partitions were unaffected. Web dynos of impacted customer applications continued to run, but received fewer requests than expected as some clients had difficulty connecting.
Connections via default ({app}.herokuapp.com
) domains were not impacted. Customer applications hosted on other Common Runtime partitions in the EU region, in dedicated Private Spaces in the EU, or in regions outside the EU were similarly not impacted.
Technical Details
Detection and Initial Impact
On October 8, 2024 at 16:42 UTC, monitoring alerted the Salesforce Technology team about errors connecting to some Heroku-deployed applications hosted on one Common Runtime partition in the EU region. Routing capacity in the partition automatically scaled up, leading to a temporary recovery, before alerts fired again at 17:10 UTC. Metrics revealed a sharp increase in both the rate and total number of connections to the network routers handling custom domain traffic for the partition beginning at 16:27 UTC. This increase in connections saturated the network, causing connection attempts to time out or fail.
Remediation
During the course of the initial event, the Salesforce Technology team and our upstream infrastructure provider worked together to add capacity to the impacted partition, as well as to identify and mitigate the source(s) of the increase in connections. At 22:10 UTC, following implementation of one such mitigation, the rate of new connection attempts returned to baseline levels and the impacted partition recovered, resolving the impact.
The following day, on October 9 at 13:06 UTC, another Distributed Denial of Service (DDoS) was observed, escalating quickly and leading to repeat impact. Once again, the Salesforce Technology team was automatically alerted, and again worked with our upstream infrastructure provider to increase capacity and to identify and mitigate the new source of the connections. At 16:36 UTC, following implementation of one such mitigation, the rate of new connection attempts again returned to baseline levels, and impact subsided.
Root Cause Analysis
The Salesforce Technology team’s post-incident investigation and analysis determined that the incident was triggered by a flood of connections to the network for the impacted Common Runtime partition from a widely-distributed set of source IP addresses, sufficient to overload the partition’s network.
The Technology team is addressing this by continuing to work with our upstream infrastructure provider to proactively bring additional traffic inspection and shaping capabilities online worldwide to ensure we can mitigate similar events more quickly in the future.
Next Steps
To maintain the performance level that our customers expect from Salesforce and to prevent this defect from recurring, our focus is on continuous improvement. The Technology team has identified and are implementing the following actions:
- In Progress: Increasing global edge network capabilities to further improve network traffic performance for all regions.
- In Progress: Enhance network measures used to facilitate scaling to further improve load spike responsiveness.
We sincerely apologize for the impact this incident may have caused you and your business; Salesforce is fully committed to minimizing downtime when incidents do occur. We also continually assess and improve our tools, processes, and architecture to provide you with the best service possible. If your application is the victim of a Distributed Denial of Service (DDoS) attack or any other attack, we encourage you to open a support ticket or contact your Salesforce account team.