On Thursday, 3:28AM MST, our engineers noticed a sharp increase in traffic on one of our API clusters in the EU region. This resulted in API downtime for customers on that shared cluster. At 3:34AM MST, this issue was resolved automatically by our self-healing architecture.
We use an auto-scaling mechanism that instantiates new servers as the traffic increases. However, the configuration was not able to anticipate the rapid increase in traffic, and new servers were added at a rate slower than required to maintain uptime. As a result, customers on that shared cluster faced API downtime for a few minutes. We have now tweaked our auto-scaling mechanism to initiate much quicker to avoid such an issue in the future.