We take all incidents that affect the reliability and quality of Discord’s service with a high degree of seriousness. We’d like to apologize for any issues experienced by our users today. We hope the transparency we provide in our public post-mortems leaves users with confidence that we experience the same frustration and disappointment during outages, and we work hard to improve our service as a whole every day.
All times within this document are PDT
At 15:28 a node for a service (the “guilds” service) which manages the real-time state and data for Discords millions of “guilds” (user owned servers) experienced a host error, which caused it to immediately reboot. This reboot caused many millions of events to be triggered and sent to another service (the “sessions” service) cluster. Despite backpressure and circuit-breakers within the code for this service, the influx of events triggered what we believe to be a bug within the Erlang VM. This bug ground the lower-level functions of the VM to a halt, causing nodes of the sessions cluster to become completely unresponsive.
The initial issue causing this outage can be attributed to the guilds node experiencing a host error causing it to reboot. Despite the efforts we go to when building services to ensure they remain fault tolerant and reliable in the face of individual node failures, this individual node issue caused a cascading failure within another cluster. The eventual decision to fully reboot the sessions cluster and reconnect all users was based on experience received from previous incidents, and allowed service to recover completely.
One of our engineering team’s primary focuses is the constant effort of scaling our internal services to maintain the immense growth we’ve experienced. This incident exposes what we believe to be a bug with the Erlang VM which is only triggered at extremely high event velocity, such as that experienced when a node dies forcing a cluster to rebalance. In this case we believe the best path forward is to refactor code to avoid using low-level Erlang utilities that we cannot control or limit with traditional circuit breakers and backpressure.
In total this outage lasted for around 20 minutes, with around 30 minutes of service degradation in various forms. As usual although we’re happy with the response and ability of our team to quickly diagnose and repair issues, we’re always looking to do better and will be improving our process and infrastructure around the above items.