At around 12 noon yesterday Google began executing a rollback of an update they had recently pushed to the GCE networking stack. Google believes this change is unrelated to the networking issues Discord was seeing. At approximately 3PM yesterday the rollback of this change completed. Since this point we have not observed a single networking issue. At this time we believe the change may have either directly or indirectly been causing the issues Discord saw, and thus we'll be tentatively calling an all clear.
We'll be continuing to dive into this issue both with Google and within our own engineering team to make sure we understand the full scope of the problem, and work to push out updates that will improve Discords resiliency as a whole. We hope to follow up with a full post-mortem in the next few days.
As always, we're very sorry for any interruption these issues brought you and we'd like to thank everyone for their patience and understanding as we work through these issues with Google.
Nov 19, 12:38 PST
At this time we believe the majority of service has recovered for users. That said, we'd like to provide a more in-depth update on the issues users have been experiencing over the past few days.
We're currently working with Google on a priority 0 ticket for their Google Cloud Platform (which we use to bring you Discord) related to networking. Over the past day we've observed multiple major network partitions and issues on the nodes of our real time system responsible for keeping your Discord clients up to date. These networking "blips" are causing issues within various layers of our software, and many of the issues we've diagnosed will require development and testing to improve our resiliency (something we will be focusing on).
Unfortunately despite the dialog we've had with Google throughout this process, they currently haven't narrowed it down to a clear root cause. We deem the quality of service our users are getting through this process unacceptable, and have communicated this to Google's support and SRE teams. We're working around the clock to ensure Google properly diagnoses and resolves the issues we're seeing, while also monitoring and supporting our infrastructure in the hopes we can quickly catch and prevent these issues from spreading.
As always, apologies for the interruptions you've experienced and thanks for using Discord in your day to day, We hope you understand how much the performance and reliability of our service matters to us, and we hope you see improvements as we work through these issues with Google.
Nov 18, 12:56 PST
We've restarted some core services to assist in getting users online, and we're simultaneously working on implementing and deploying some code changes that should improve the reconnect process for users. Additionally we're actively communicating with members of Google's SRE team while they diagnose and debug the networking problems we're seeing. Finally, we're hoping to have a full update for users within the next 30 minutes to help explain the severity and frequency of issues they've been seeing this week.
Nov 18, 12:28 PST
We're yet again investigating a major outage causing offline guilds and connection issues. We're still working both internally and externally with Google to resolve this issue.
Nov 18, 11:42 PST