All Systems Operational
API   Operational
90 days ago
99.96 % uptime
Today
Gateway   Operational
90 days ago
100.0 % uptime
Today
CloudFlare   ? Operational
Media Proxy   ? Operational
90 days ago
100.0 % uptime
Today
Voice Operational
EU West   Operational
EU Central   Operational
Singapore   Operational
Sydney   Operational
US Central   Operational
US East   Operational
US South   Operational
US West   Operational
Brazil   Operational
Hong Kong   Operational
Russia   Operational
Japan   Operational
South Africa   Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
API Response Time
Fetching
Past Incidents
Feb 16, 2019

No incidents reported today.

Feb 15, 2019

No incidents reported.

Feb 14, 2019

No incidents reported.

Feb 13, 2019

No incidents reported.

Feb 12, 2019
Resolved - We've rolled out fixes that should address the platform instability. We observed a few interesting failure modes here, and have rolled out mitigations for each of them - however the root cause of the initial failures are still under investigation.

Discord runs a cluster of redis nodes which we use for caching of stuff, namely push notification settings and partial user objects. This cache cluster sees a throughput of peak in the order multi-million queries per second to power Discord. During peak hours today and throughout the past few hours, throughput on 4 of our cache nodes dropped, and continued to stay in a degraded state. We ejected some cache nodes from the cluster in an attempt to remediate the issue, however, after ejecting 2 and continuing to observe similar failures on other nodes, we decided not to continue to eject nodes, as a continued reduction in cache-cluster availability would have more adversely affected service availability.

Generally, when a cache node is degraded, or unavailable, we have circuit breakers in place which degrade low-priority tasks while preserving platform availability. However, there was a bug in our circuit breaking code, and 2 bugs in the downstreams of the circuit-breaker that caused a thundering herd on our upstream data-store of these objects (mongodb), causing it to become overloaded and flap in availability. We've since deployed fixes to our API service layer to prevent the thundering herd issue that adversely affected platform stability.

We are still however observing issues with our cache nodes that still causing intermittent blips - however - hopefully those blips should not net in service disruption/degraded performance. In concert, we are investigating the root cause of the redis node throughput issues with the Google Cloud Platform team.
Feb 12, 17:11 PST
Update - Service is restored, engineers are working to address the root cause of the failure.
Feb 12, 14:50 PST
Identified - Root cause has been identified and remediations are being applied
Feb 12, 13:42 PST
Update - Engineers are actively responding and restoring service.
Feb 12, 13:20 PST
Investigating - We are currently investigating an increased error rate and latency for API Requests
Feb 12, 12:41 PST
Feb 11, 2019

No incidents reported.

Feb 10, 2019

No incidents reported.

Feb 9, 2019

No incidents reported.

Feb 8, 2019

No incidents reported.

Feb 7, 2019

No incidents reported.

Feb 6, 2019

No incidents reported.

Feb 5, 2019

No incidents reported.

Feb 4, 2019

No incidents reported.

Feb 3, 2019

No incidents reported.

Feb 2, 2019

No incidents reported.