Turns out we didn't need to disable typing events (this time), we were able to get the cluster stable by just reducing the number of API servers by about 10%.
Presence is back up and running & happy, DMs are flowing like normal.
We've got a fix in the works for the underlying root cause (overstressing our service discovery system, which prevented the presence system from restarting with all nodes). We should have this deployed next week, but in the meantime, things should be stable.
Thank you very much for your patience as we worked through this, and we are super sorry for the inconvenience!
Posted Oct 05, 2019 - 14:57 PDT
We discovered an underlying issue with our etcd cluster proving to be overloaded, which was having knock-on effects on restarting our presence cluster.
In order to reduce load on etcd, we've had to scale down our API cluster, and to ensure we can still provide service we've had to globally disable typing events temporarily. We are working on a fix to the etcd load issue, but the good news is that DMs look to be recovering.
Posted Oct 05, 2019 - 14:45 PDT
Unfortunately we continue to have issues with the service. We've brought in more engineers (the Jake) and are working on it.
Posted Oct 05, 2019 - 14:28 PDT
The cluster has been restarted successfully and is recovering. This usually takes 10 to 15 minutes for all online users to re-establish connections to the presence cluster, at which point service will be restored.
Posted Oct 05, 2019 - 14:17 PDT
We are having issues with our presence cluster (which handles direct messages) and are currently in the process of restarting it.
Posted Oct 05, 2019 - 14:11 PDT
The team is aware of major impact sending direct messages right now. We are online and working to fix the issue.