All Systems Operational
API   Operational
90 days ago
99.64 % uptime
Today
Gateway   Operational
90 days ago
99.9 % uptime
Today
CloudFlare   ? Operational
Media Proxy   ? Operational
90 days ago
100.0 % uptime
Today
Voice Operational
EU West   Operational
EU Central   Operational
Singapore   Operational
Sydney   Operational
US Central   Operational
US East   Operational
US South   Operational
US West   Operational
Brazil   Operational
Hong Kong   Operational
Russia   Operational
Japan   Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
System Metrics Month Week Day
API Response Time
Fetching
Past Incidents
Jan 22, 2018
Resolved - Service is fully restored. We believe the root cause was a network layer event within Google Cloud and we are talking to their oncall engineers to try to understand what happened.
Jan 22, 00:13 PST
Monitoring - Services are restored and users have reconnected. We're monitoring the situation and working to understand the root cause.
Jan 21, 23:35 PST
Update - Services are coming back online. This will take a few minutes and things may be a little slow while the systems warm back up.
Jan 21, 23:23 PST
Identified - We're performing a restart of Discord core services. This will completely take Discord offline while we perform the restart.
Jan 21, 23:15 PST
Investigating - We're investigating another issue now. It's looking likely that we're going to need to perform a restart of core Discord services to get things back to a steady state.
Jan 21, 23:08 PST
Monitoring - We've identified the issue and taken steps to resolve it. Service should be returning to normal and all guilds should be back online.
Jan 21, 22:56 PST
Investigating - We're investigating some issues with two of our sessions nodes causing some impact to usage.
Jan 21, 22:50 PST
Jan 20, 2018
Resolved - At this time we're not aware of any further issues causing elevated latency or abnormal error rates.
Jan 20, 16:05 PST
Monitoring - We've isolated the issue and instituted a fix. We're continuing to monitor as error rates and latency recovers.
Jan 20, 15:58 PST
Investigating - We're investigating abnormal error rates and elevated latency.
Jan 20, 15:56 PST
Jan 19, 2018
Resolved - This won't be a long postmortem per our usual standards, but since this is the fourth time this has happened in the past two weeks I wanted to let you know what's been going on.

In summary, there is a known issue in our stack when certain user behavior happens. We end up seeing Cassandra performance tank on the nodes that are serving that partition. This leads to our API servers being busy sending very slow requests to those partitions. This ends up causing the API servers to back up with requests which ends up affecting even users who aren't on the slow partitions.

The fix for this generally is to implement what is called a 'circuit breaker': a timeout mechanism that blacklists the offending partition so further requests fail quickly and don't impact other users. We had rolled out some code to do that earlier in the week, but due to a bug with the implementation it didn't trigger during today's outage. We're fixing that now.

Furthermore, we're also updating our procedures to make sure that when we implement these kinds of things going forward we'll have a manual testing step to ensure it actually works as intended. We wrote unit tests and everything looked good, but nobody actually verified that the functionality worked.

We use Discord a lot and we know you do too. We're sorry for the interruptions that this has caused.
Jan 19, 10:37 PST
Monitoring - All graphs look normal. We're continuing to observe and will post more information about what happened when we've finished analyzing the root cause.
Jan 19, 10:02 PST
Identified - We've identified an issue with our Cassandra store for messages and are remediating. We expect service to be restored shortly.
Jan 19, 09:53 PST
Investigating - We're aware of an issue sending messages at the moment. The team is investigating.
Jan 19, 09:48 PST
Jan 18, 2018

No incidents reported.

Jan 17, 2018
Resolved - We're not aware of any other connection or offline server issues at this time.
Jan 17, 10:26 PST
Monitoring - We believe the aforementioned fix has resolved the majority of offline server issues. We're continuing to monitor as service stabilizes.
Jan 17, 07:53 PST
Identified - We've isolated the issue causing offline servers and have instituted a fix.
Jan 17, 07:44 PST
Investigating - We're aware of some users experiencing issues when connecting and observing an abnormal number of offline servers. We're currently investigating the issue.
Jan 17, 07:37 PST
Jan 16, 2018

No incidents reported.

Jan 15, 2018

No incidents reported.

Jan 14, 2018

No incidents reported.

Jan 13, 2018

No incidents reported.

Jan 12, 2018

No incidents reported.

Jan 11, 2018
Resolved - At this time we're not aware of any further issues with message latency or error rates. We're working to roll out a more permanent fix that should help the overall stability of messages.
Jan 11, 12:50 PST
Identified - We've isolated the problem causing high latency and error rates when sending and loading messages, and instituted a fix.
Jan 11, 12:17 PST
Investigating - We're aware of some issues sending messages and are currently investigating.
Jan 11, 12:12 PST
Jan 10, 2018

No incidents reported.

Jan 9, 2018

No incidents reported.

Jan 8, 2018

No incidents reported.