Making sure Shortcut is there when you need it
Update April 1st @ 12PM ET
We have conducted the internal post-mortem of the five database connectivity issues that happened in March and are confident that we have identified the root cause of the database connectivity issues that plagued Shortcut throughout the month of March. Our technical response working group has a solution in place and additional protections that will prevent the set of circumstances that created the incidents. The technical response working group will now resume regular responsibilities.
What’s the summary of what happened and how you have resolved it?
During multiple high load periods, the database caching layer failed, causing read timeout and the app to become unusable. Getting the system back to a usable state took multiple hours over several incidents. The immediate issue was solved by creating a new database caching cluster with a larger capacity.
Who was affected? Was there any permanent loss of data?
All users were affected. Shortcut was unusable on-and-off during this time. Usage during this time would have been a painful experience and it would be clear that any attempt to save or update information would fail. There was no permanent loss of data for any customer.
What was the root cause of the failure?
We exceeded the network capacity of one database caching layer node. This caused peers to timeout on GET from cache, then FETCH from storage, and WRITE back to the cache layer. This further increased network load on the cache node beyond capacity such that the node could never recover as it was in a read/write stampede. It also caused storage usage to go above provisioning because storage reads increased more than 10x.
Can you share more details on what you actually did?
Completed:
- Reprovisioned caching layer.
- Added metrics and alarms for approaching the network threshold per node.
- Added a procedure for proactively reprovisioning if approaching the limit.
- Added more logging, metrics, and alarms related to network usage at all layers to observe network usage and max queue depth of connections.
- Tightened limits to shed load at the API layer to avoid unbounded spikes in open requests.
- Implemented touting changes to improve cache locality at the API layer.
- Updated internal processes to form a working group for root cause analysis on Severity 1 incidents.
Ongoing:
- Working with our database vendor on alternative caching layer strategies.
What did you learn? What went wrong?
- We should have formed the “Tactical Response Working Group” much sooner. As a result, we have changed our process to include the creation of the TRWG by default for Severity 1 incidents when we have an unresolved root cause.
- Advertised resources of an instance are not guaranteed to perform at full capacity under load conditions. Calculations of capacity are insufficient - observing and testing the actual capacities is the only way to know for sure. We have added metrics, alarms, and additional observability to ensure we have accurate information for provisioning and planning.
----------------------------------------------------
Update March 29th @ 5PM ET
As of this afternoon, our back-end engineering team has high confidence that they have uncovered the root cause of the database connectivity issues and have already put in place short-term solutions to reduce the likelihood of service disruptions. This builds upon the over-provisioning we started last week.
We have also put in-place new monitoring systems to help us identify any similar issues and should be able to respond more quickly than we have over the past few weeks. That said we are not ready to declare victory and want to wait and monitor performance over the next few days. We'll plan to provide another update at that point.
Finally, we want you to know that when the issue is fully resolved we're planning to do a full and public root cause analysis and will share even more of the nerdy details than we already have.
----------------------------------------------------
Update March 25th @ 5PM ET
The Shortcut app has experienced five database connectivity issues this month resulting in more than four hours of total downtime.
This kind of performance is unacceptable and we have re-routed significant portions of our engineering teams to determine the root cause and remedy the situation as quickly as possible.
As part of our commitment to transparency, we wanted to share what is happening and what we’re doing to make sure Shortcut performs as the rock-solid application we’ve all come to expect. (Heck we use Shortcut to build Shortcut!)
What are you doing to remedy the situation?
In the short term, our focus is on over-provisioning our servers and sub-systems to avoid having this situation come up again. As we do this, we’re undertaking work to ensure we handle sub-system capacity spikes more gracefully. We’ve increased the number of servers by 70%, doubled individual server capacity, and significantly increased the memory available to our databases.
We believe that these actions will resolve the symptoms (downtime) while we eradicate the root cause. To do this, we’ve allocated the majority of our back-end engineering team to address the issues. They are wholly dedicated to this problem until it is resolved. That means a handful of features we planned to ship will be slightly delayed, but they pale in comparison to application availability and we know that.
When will I know more?
As anyone who builds software fundamentally understands, it’s almost impossible to predict exactly when this issue will be resolved. What we can offer however is frequent and consistent communication.
Until this issue is fully resolved, we promise to update this page with every major status update. Transparency is one of our core values and that means not just internally but with you our customers, partners, and prospects as well. And just for good measure, even if we have nothing new to say - we’ll update this page every few days and just say that. So you can know with confidence you always have the most up to date information.
You can expect the next update on or before 5PM PT Monday March 29th.
What really happened? Give me the nerdy details!
Once per week in March so far, Shortcut has encountered a cascading server failure event that resulted in a loss of service for users. During each of these periods, server requests slowed, with many timing out, degrading service to the point that the application became unresponsive and unusable.
All these events share a common thread: the backend services reached an unusually high read from database disk capacity, which was followed by cascading sub-service failures. The read capacity spiked and oscillated from 4x to 270x normal throughput for similar load in similar periods. These read stampedes were brought about by a contributing condition that caused memory cache misses. The end result is that our API server p95+ response times increased dramatically and a significant portion of requests timed out with 50x errors. Many users experienced this as an extremely slow or unavailable service.
If you have specific questions, concerns or comments please reach out. We want to make sure everyone is confident Shortcut will be there when you need us.