Resolved -
The root cause here was that one of our guild servers crashed unexpectedly and the automated recovery hit a bug we were fixing last week but haven't deployed the fix for yet. This issue stalled the recovery, leading to the extended degradation of service we saw here.
We will expedite deploying a fix for the recovery issue and will continue last week's investigation into the root cause that is leading a few of our systems to infrequently get into an escalating memory allocation spiral, ultimately leading to running out of memory.
Apr 23, 09:21 PDT
Monitoring -
We've identified the problem (a crashed server) and have brought it back online. Investigation will begin to understand why automatic remediation/failover didn't occur, and what caused these guilds to stay offline much longer than they ought to have.
Apr 23, 08:57 PDT
Investigating -
We are receiving reports that some guilds are showing as offline/temporary outage for people. We are investigating now.
Apr 23, 08:29 PDT