All times within this document are PDT
At approximately 14:01, a Redis instance acting as the primary for a highly-available cluster used by our API services was migrated automatically by Google’s Cloud Platform. This migration caused the node to incorrectly drop offline, forcing the cluster to rebalance and trigger known issues with the way our API instances handle Redis failover. After resolving this partial outage, unnoticed issues on other services caused a cascading failure through Discord’s real time system. These issues caused enough critical impact that Discord’s engineering team was forced to fully restart the service, reconnecting millions of clients over a period of 20 minutes.
Sequence of Events
- 14:01 - Discord engineers are alerted to anomalies via our real time metrics and monitoring
- 14:01 - Initial investigation points to a Redis instance for a highly available cluster having failed.
- 14:02 - Engineers identify that the previous primary for this Redis cluster is completely unavailable, and that the cluster has properly failed over to the secondary node.
- 14:02 - Engineers identify that the nodes issues were caused by an automatic migration executed by Google’s Cloud Platform and escalate a P1 ticket with GCP support.
- 14:03 - Engineers identify that some API instances are not properly handling the failover and begin investigation.
- 14:11 - The decision is made to execute a rolling restart of all API instances in an attempt to resolve Redis connection issues.
- 14:15 - The rolling restart of API instances completes, and engineers inform our other teams that issues should be resolved.
- 14:16 - Engineers continue to observe abnormal latency on our API endpoints, and continue to investigate.
- Between 14:18 and 14:57:
- A misconfigured edge caching rule for an expensive API route is discovered, causing clients to aggressively call this route, degrading overall API performance.
- The misconfiguration is corrected.
- Engineers discover that the number of requests to this API endpoint has caused a non-critical Cassandra cluster to fall over under load and begin working to fix it.
- The API endpoint causing problems is disabled on our upstreams edge network, as its contents cannot be cached while the endpoint is unavailable.
- The Cassandra cluster is recovered, and resumes normal service.
- Engineers yet again execute a rolling restart of all API instances to resolve latency issues experienced due to the Cassandra cluster issues.
- 14:57 - All previous known issues are resolved and the API returns to a nominal state.
- 15:41 - Another anomaly is noticed on monitoring data and engineers yet again begin to investigate the cause.
- 15:42 - End user reports are escalated to engineering as they investigate some servers appearing offline.
- 15:46 - A node from the “guilds” cluster is identified to be misbehaving, causing a portion of all Discord’s “guilds” to appear offline for users. Engineers begin to investigate the cause and possible mitigations for this node.
- 15:50 - Engineers forcefully remove the node from the services hashring, which they hope will force the server to properly decommission itself.
- 15:52 - Engineers decide to forcefully restart the misbehaving node, which succeeds.
- 15:55 - Engineers discover that some nodes from another cluster, the “sessions” cluster appear to be misbehaving, and being to investigate.
- 16:07 - Engineers decide that the best path forward is to institute a full service restart, forcing all users to reconnect and resetting the invalid state various services are experiencing.
- 16:09 - Engineers begin to prepare various components of Discord’s system for a full reboot.
- 16:10 - Engineers being the process of rebooting various components of Discord’s system.
- 16:19 - Engineers are alerted to another system failure in a component called “push” and begin to attempt fixing through multiple methods;
- One engineer begins to investigate the problems with the system, and attempt to restore service.
- Another engineer begins to blacklist various pieces of traffic that rely on the system in an attempt to reduce load.
- Between 16:20 and 16:32
- Engineers track the performance of our API while they expect service to recover.
- Engineers realize that due to human error, an important API setting for allowing users to quickly and safely reconnect to Discord is improperly set.
- This improperly configured API setting is corrected.
- 16:33 - Service begins to slowly recover as engineers monitor and address instabilities in various system components.
- 16:52 - Service is fully restored at this point.
Investigation and Analysis
The root cause of this issue was something we expect our services to handle correctly—an outage of one node within a highly available cluster. Despite this expectation, the node failure triggered a bug that we’ve known about for sometime, and something we’ve previously scheduled for fixing in the next few weeks. This bug caused the overall integrity of our API service to degrade heavily, and uncovered a previously unknown cache misconfiguration which was resolved during the incident.
The exact cause of why this initial API partial outage cascaded into a full system failure is not known at this time. We do know that the initial failure caused some nodes of other services to behave incorrectly, and eventually caused those nodes to run out of memory and fail. These nodes triggered a cascading failure within the system, despite the various safeguards we have in place. While we have various theories and will continue to investigate the root cause, there are other goals and objectives we have planned which should resolve these problems in the future.
Action Items / Response
- We’ve increased the priority of fixing the known issues with Redis failovers. We hope to accomplish this within the coming weeks to avoid further problems caused by individual node failures.
- We've modified the behavior of the improperly cached route to handle this type of error properly in the future, without degrading other services.
- We’ve added monitoring around some of the signals that would have alerted us to the cascading failure with enough time for engineers to investigate and prepare, possibly avoiding any issues altogether.
- We continue to plan for improving the overall reliability and failure handling of the services that were affected by this outage.
As always with our post-mortems, we like to stress the fact that the uptime and availability of Discord is our highest priority. We know and experience the pain and frustration that comes along with the instability or downtime of a communication platform. While we try our best to avoid all outages, we’re proud of the fact that we’re able to completely restart and reconnect Discord’s millions of concurrent users within the span of twenty minutes (when things get real hairy). We’ll always be working to improve the reliability of Discord’s platform, and we’re thankful for the support and feedback we get during these outages. Finally, we’d like to apologize for any inconvenience Today's outages caused you, and thank everyone for continuing to fly Discord.
After engineers were alerted to similar issues on Saturday (by alerts added due to the findings of this post-mortem), they were able to safely decommission and investigate a node experiencing the behavior which caused this outage. After investigation we discovered potential flaws with a library this service uses, and we’re currently working on rolling out an update we believe should resolve the problems. While we’re still not entirely sure why the flaws in this library caused the cascading failures we saw, we’ll continue to investigate and are confident these fixes should improve the situation for the future.