Yesterday we have experienced a downtime:
A little postmortem to give you as many details as I can.
Yesterday we started seeing timeouts happening in many endpoints at irregular intervals. We started monitoring it and at some point it seemed to stop. But then we found out that was an automatic user’s script trying to update a lot of items.
The script was under the rate limitations but one query that was launched in the update was still running when the following requests appeared. This was causing a queue of calls to build up and progressively bring the DB to a halt.
We are starting to see issues related to the high number of record versions that we are storing and this is a new challenge for us. Normally we see slow queries and we try to optimise them. The difference in this instance was the high number of calls hitting slow queries for a relatively long amount of time.
We’ve added a smaller timeout for queries in order to minimse the effect of these issues in the future temporarily fixing the issue for everyone. Then we optimised the query so that the issue was fixed also for that particular script and for similar instances in the future.
We are very sorry for that and we try our best to prevent these events from happening again.
We are here in case you need more clarifications on this.