At 4:26AM an internal service called push—used to dispatch APNS and GCM push notifications to our users—began to fail, causing all three servers of the cluster to become unavailable at the same time. This failure mode caused our API to erroneously begin returning a 500 HTTP status code along with an exception message, despite messages being sent successfully. Due to this error code, our clients attempted to retry sending the message which caused duplicate messages to be sent across Discord.
At 4:28 a page was dispatched to our on call engineer at the time (JH), who was notified that our push service had failed. Shortly thereafter at 4:29 users reported issues directly to another engineer (AZ). At 4:31 our engineers had confirmed the issue causing message duplicates, and updated our status page to reflect this. By 4:32 both engineers had coordinated and confirmed that the issues with our push service (which they noted had been deployed the previous day at 22:30) were for some reason causing send-message API requests to return failure status codes. At 4:34 engineers decided to reboot two of the failing push servers, which caused a temporary resolution until service degraded again at 4:36. It’s confirmed at this time that our push service continues to report a failed status, despite being restarted. At 4:38 Engineers attempted to force the push services to report a healthy status, in theory allowing the API to talk to them again. After multiple attempts, at 4:39 this process succeeded and the incident is fully resolved. Engineers continued to work on rolling back the previous push deploy to ensure the issue is not triggered again, which was completed at 4:46.
Further investigation after the incident revealed that two bugs in separate services (one in push, one in our API) triggered this sequence of events and caused the duplicated message issues. The first bug was related to a recent upgrade to the aforementioned push service. The upgrade which had been deployed at 22:30 on January 5th was related to code that monitors and reports the health of our push service, so that it can be discovered and used by our API. A bug in this code caused the health check to get permanently stuck, de-announcing it from our service discovery system. Because this bug was deployed to all three of our redundant/clustered push instances, they failed at almost the same time, causing a full service outage for push.
Generally, we build our interdependent services to continue working throughout failures where possible. However due to a known bug in a portion of our message-sending API code (which was slated to be fixed over the next couple days) the issues with our push service caused exceptions to be raised within our API, returning 500s to the clients, causing those clients to retry sending of the message.
The reliability of Discords service is of utmost importance to us, and we treat all issues—even those causing little to no service impact—with high priority. Both bugs causing these issues were fixed shortly after we fully resolved the incident. Thanks for flying Discord, and we apologize for any issues this incident caused.