(Note: This is the third article in a series about how we scaled Shortcut's backend. Here are the links to previous posts in the series: 1. How we didn't rewrite Shortcut; 2. Monolith, meet mono-repo.)
When it came to splitting services out of our backend monolith, one of the first areas we wanted to tackle was incoming webhooks. Shortcut makes extensive use of webhooks for integration with other services, most notably GitHub.
The original GitHub integration in Shortcut had webhooks arriving at the same cluster of machines as regular API traffic, a.k.a. the monolith. As discussed in the first article in this series, the monolith was convenient for Shortcut developers, especially in the early startup days. But as Shortcut grew and traffic increased, the webhook requests started to cause problems.
At first glance, a webhook seems much like any other API. It is, after all, just another HTTP request. But on closer examination, webhooks have a few important differences that set them apart. Shortcut is primarily a user-facing application: “Normal” API requests are driven by user interaction, whereas webhook requests are driven by events in other systems. Even though it’s all HTTP on the wire, normal APIs and webhooks have different behavior and expectations:
The unpredictable size and rate of webhook requests sometimes caused us problems when they were hitting the same servers as “normal” API requests. A single user action in GitHub — merging a pull request, say — might reference hundreds or even thousands of commits. Parsing that webhook consumes time and resources on the server that received it, leaving less for other customers. A flood of webhook events could briefly overwhelm a majority of the backend cluster, degrading performance for everyone.
Then there’s the problem of lost webhooks. Most webhook senders are fire-and-forget; if the request does not succeed, it is never retried. The backend API server is a complex beast that could fail for any number of reasons, from hardware in a datacenter all the way up to a bug in our code. If one of those failures happens while handling a webhook request, then that’s it, we’ve lost that webhook event for good. If the backend API is down (hey, it happens) then we lose all the webhook events sent during that period.
As a general rule, simpler systems are more reliable than complex ones. But here we have a great counter-example where adding more moving parts can, counterintuitively, make the overall system more reliable.
If we think about the constraints placed on us by incoming webhooks, we see they are mostly about availability: We need to be able to receive a webhook at all times, with minimal interruptions. But importantly, the response to a webhook does not need to include the results of processing that webhook. We can acknowledge the webhook request and then process it later.
So our webhook processing architecture can trade responsiveness for better availability and reduced data loss. Fortunately, there’s a well-known technology that already has these properties: queues! We decided to introduce a queue between the process that receives the webhook and the process that handles it. In our AWS-based infrastructure, it looks like this:
The “Webhook Receiver” process runs an HTTP server to receive webhook requests. It does almost nothing with the request, just writes it to S3, puts a message on an SQS queue, and responds with
HTTP 200 OK to the sender.
Another process on a different machine, “Webhook Writer,” pulls messages from the queue and processes them. Most of this code was lifted from the old, monolithic API Server. Essentially, we created a new process that emulated just enough of the monolith to support the webhook-processing code. Ultimately it was this shim that made the whole project feasible, enabling a major architectural change without the cost of rewriting everything.
Why is this design better? The introduction of Webhook Receiver, SQS, and Webhook Writer adds a set of guarantees that were not possible in the old architecture:
These guarantees move the burden of availability to components that are intrinsically more reliable. The Webhook Receiver is so simple that we almost never think about it: It just works. SQS and S3, of course, are backed by Amazon. The retry/DLQ behavior is configured in SQS, so the Webhook Writer doesn’t need any code to implement retries.
Our trust in the Receiver stack allows us to take more risks with Webhook Writer. We can deploy test new versions with low risk: If Webhook Writer goes down, messages just pile up in the queue until we fix the bug or roll back to a stable release. Having the messages available in the Dead Letter Queue and S3 is a huge help when debugging. This was valuable when working on a major addition to our webhook integrations, as my colleague Toby Crawley described in How we use clojure.spec, Datomic and webhooks to integrate with version control systems.
Best of all, collecting webhook traffic at the “edge” of our system means we can shut down everything else for maintenance and still not lose any incoming data. We have taken advantage of this to perform critical database maintenance — more about that later in this series.
As with any architecture choice, of course, there are still downsides. Our infrastructure is more complicated, with more things to deploy and more AWS permissions to sort out. Debugging misconfigured webhook integrations is slightly harder, because Webhook Receiver does not provide any feedback to the sender about whether the request was successfully processed.
Overall, though, the added complexity to our systems made our lives as developers easier. This architecture change both made our webhook handling more robust and reduced the load on other backend servers. Removing webhook integrations from the API server monolith allowed us to iterate more rapidly with less risk. Problems with webhook processing have less impact on other features. Finally, this architecture can be adapted to other kinds of asynchronous event processing. Adding support for new version control services (GitLab and Bitbucket) was just the first step. We plan to build more integrations with this architecture in the future, details to come.
Nothing to see here...