At Discord, one of our primary concerns is the uptime and reliability of our service. We’ve all experienced the frustration and annoyance that comes along with the failure of services we rely and depend on day to day. We’re extremely sorry for the issues anyone using our service experienced Monday, and we’ve set out various action items which we believe will help avoid incidents like this in the future. What follows is a transparent and encompassing review of the incident.
All times within this document are PDT
At 11:38 on March 20, a server for one of the internal services (presence) at Discord which tracks real time state and presence information for all connected users, disconnected from its other cluster members. Shortly thereafter the nodes of another service (sessions) which handles the real time state and processing for all websocket connections to Discord attempted a reconnection, triggering a massive “thundering herd” towards the existing members of the presence cluster. Due to an implicit requirement in sessions on data in presence, this caused 1/3rd of all connected sessions on Discord to effectively stall, queuing events in memory (an intended behavior of our sessions service, see below). Once the event queues on these session nodes reached a limit, they ran out of available memory and crashed. After engineers recovered the situation around 13:56, a subsequent disconnection on the presence service caused the same incident to occur again at 14:52.
At 11:39 Discord engineers notice a dramatic degradation in service and immediately began investigation. By 11:50, while engineers continue to investigate the issue, members of the sessions cluster began to OOM (run out of memory) and crash. Around 11:51 engineers alert our support team to a high-impact incident and the status page is updated to investigating. At 12:30 after a period of investigation, engineers work on two code patches to our API servers. These patches introduce limitations on connections and message sending meant to assist in “rebooting” and reconnecting all of Discord’s clients. Due to an issue in newly deployed configuration code, the process of creating and deploying these patches takes longer than expected. At 13:34 the presence server which initially caused the outage experiences more soft-lock issues, which trigger more cascading failures through Discord’s real time infrastructure. By 13:35 rollout of the aforementioned API patches has completed, and the decision to fully “reboot” Discord was made. At 13:41 engineers globally disable sending messages on Discord. At 13:44 our status page is updated to reflect our decision to “reboot” Discord, and at 13:45 engineers began the reboot process. By 13:56 service has mostly recovered, the majority of clients have reconnected, and engineers were able to enable sending of messages. At 14:07 the status page is updated to say that service has recovered. At 14:52 the presence server which had been previously acting up yet again experiences a series of CPU soft-locks causing it to disconnect and netsplit. At 14:54 engineers yet again notice issues with memory usage on our sessions servers, and the broken state of our presence cluster. The decision is made to ignore the presence cluster for the time being, and work on correcting the state of sessions. For the most part, Discord functions for users at this point. Due to a side effect of the way direct messages work, DMs are able to be sent but are not delivered to recipients at this time. At 15:47 our status page is updated to reflect the current issues with DM sending. At 16:04 engineers attempt to reboot and recover the presence service, however this causes further issues within the sessions cluster, effectively repeating the same issues from the first incident. Engineers yet again globally disable message sending in an attempt to shed load. At 16:10 the issue with the presence service is isolated to the CPU soft locks experienced on the one server. This misbehaving server is rebooted, which forces the VM to land on another physical host and resolves the soft-lock issues. At 16:23 engineers yet again prepare to fully “reboot” and reconnect all of Discord’s clients, and start this process at 16:24. By 16:40 service is mostly recovered and message sending is enabled. The status page is updated at 16:54 to indicate the incident has been resolved.
The initial issue which caused this outage can be boiled down to the CPU soft lockups that we observed on the individual presence server. Although we’ve previously seen these CPU soft lockups on other servers and services, this incident caused around 40 20-30 second lockups. Although we had previously opened an issue which Google has actively been investigating, they’ve escalated it to P0 after this incident. These soft lockups fully stalled the network stack of that machine for upwards of 20 seconds. Unfortunately due to a bug that was fixed but not yet deployed, the presence cluster did not properly handle the lost node, and instead caused the entire cluster to split brain.
The failure of the individual presence node caused all the Erlang processes living in our session service to attempt a reconnection to the aforementioned presence service. The effect of millions of processes reconnecting quickly overloaded the remaining presence servers and slowed the system to a halt. Our session servers were built to help users maintain persistent connections over choppy or unreliable internet connections. As such, they are built to buffer and retransmit messages to clients when a client disconnects for brief periods. This buffering mechanism works great, and is one of the reasons Discord is so responsive even when participating in many large servers. However, during this incident ⅓ of all user connections halted the sending of messages to clients due to the dependency on our presence service. The events and messages for these connections quickly filled the in-memory buffers of our session servers and caused many of them to run out of memory and fail. Due to the way Erlang works, the throughput and contention on these servers also prevented our engineers from being able to thoroughly investigate or mitigate these effects.
While we’ve invested immense effort and engineering in scaling and improving the performance of many core services and database over the past months, we’ve also seen immense growth. As such, our engineers were concerned with simply letting the thundering herd that is millions of Discord clients reconnect to our service at once. This caused us to spend time during the incident to add and roll out changes to our API which limited the potential of this effect. Although we’re pleased with the result and believe the initial choice was correct, issues with recently added configuration code caused this process to take much longer than expected.
During the outage, engineers were able to roll out upgrades to our presence server (which had previously been untouched for almost a year). These changes fix the split-brain bug previously mentioned, and include other planned upgrades. We’ve added a hard limit to the number of in-flight connections from our sessions cluster to our presence server, and have also added a fast-fail mechanism to this dependency which should prevent a cascading failure like this from occurring in the future while also resolving the problems developers experienced accessing debug information. These changes were deployed at 1AM on Tuesday. As previously mentioned, we’re actively working with Google Cloud Platform’s hypervisor team to ensure the root cause of the CPU soft lockups is investigated and resolved. We’re working on adding additional tooling which should help automate the process of detecting and alerting around anomalous and hard to track problems like CPU soft lockups.
The last time Discord had an outage which forced us to effectively “reboot” Discord as a whole, was September 2016. Since then the improvements and time we’ve invested in improving the performance of our core services and databases caused this experience to be much less painful. While we’re generally unhappy with any outage, and always strive to do better, our services took under 16 minutes to reconnect millions of concurrent Discord users.