Earlier this year, Shortcut launched new integrations for both Bitbucket and GitLab. These integrations work similarly to our existing GitHub integration, allowing you to attach commits and branches with Stories within Shortcut, and transition those Stories between workflow states based on state changes of your merge/pull requests (see our Help Center for details). These integrations are powered by webhook events sent to us from Bitbucket/GitHub/GitLab on your behalf.
In this post, we’re going to take a look at the design of the subsystem that powers these integrations, and which will also serve as the basis for the next version of our GitHub integration (currently in development).
We'll cover some of the requirements that we considered when designing the subsystem that receives and processes the webhook events mentioned before, an overview of what we built, and some of the lessons that we learned in the process.
When designing this subsystem, we had a few non-functional requirements that we wanted to meet:
With our initial GitHub integration implementation, we handled all of the webhooks during a request cycle within our API server instance. There were a couple of notable issues with this approach:
We wanted to address these issues as part of the design of this new subsystem, so decided to decouple the receipt of the event from the processing of the event, with a mechanism to bring retries sufficiently under our control.
At a service level, this new design has this structure:
This service is relatively simple - it receives the webhook payload from the provider, writes it to an Amazon S3 bucket, then emits a message to an Amazon Simple Queue Service queue with some metadata about the payload, including the path to the S3 object that contains the payload itself.
We chose to use a queue here to gain a few benefits:
There is a trade-off for using a queue here, however: SQS provides at-least-once delivery (we will see every message), but does not guarantee at-most-once delivery (we may see the same message multiple times). To handle this, the subsystem is designed to process messages idempotently, so there is no harm in processing the same message multiple times.
We store the payload in S3 instead of including it in the SQS message because maximum allowed SQS message size is 256 KB, and we can't guarantee that all payloads will be less than that size, especially since we don't control the generation of those payloads (there are other benefits to using S3 for the payloads, as we'll see in a bit).
One drawback of this pattern is that, since we are no longer processing the event as part of the delivery request, we cannot signal the status of processing that event back to the provider that sent it. This means that you can’t use the provider’s UI to determine if the events are actually processing, since they will all have been enqueued successfully. In practice, this hasn’t been an issue since a processing failure would usually be a fault triggered by our implementation that we would detect and fix quickly.
The Handler service is where all the work happens. It is single threaded, and there is only ever one instance of the service running. This ensures that the same message is not processed twice, concurrently, removing the need for any coordination or locking mechanisms, simplifying the subsystem.
The Handler is divided into three layers:
Each layer takes data as input and emits data to the next layer (except for the Entity Processing layer, which writes to the database). The data inputs for each layer have a specification (via clojure.spec) that ensures the data is of the expected shape.
This layer receives the webhook payloads from the providers (with one implementation per provider), and converts the event to an internal, common format, using the provider's API to resolve any missing data. For example, Bitbucket and GitLab don't send all of the commits in a push webhook if the number of commits is over a certain threshold - in those cases, the translator uses the API to get the full list of commits.
We use a common format as the output of the translation layer because the Bitbucket and GitLab webhook events have different shapes, and represent some concepts in very different ways. Having a common format lets us have one implementation for the rules that define what actions we will take based on events, enabling us to implement new features once, with all three providers getting the feature at the same time. To help achieve this, we encapsulate all of the provider-specific code in the translation layer, and the layer's sole responsibility is converting the event into the common format.
Once the translator has all of the data it needs, it emits an event that is in the common format that is passed on to the next layer for further processing.
This layer implements the rules for processing events, converting the events it receives into a set of declarations about the state the database should be in once we are done processing the event. It includes a hint if the data should be transacted to the database always (an upsert) or only if the entity already exists (an update). It doesn't apply these declarations against the database; it just passes them on to the next layer for further processing.
There is a single implementation of this layer that is used for events from all of the providers, since the translators convert the webhook events to a common format.
This layer takes the declarations and turns them into a Datomic transaction, then applies that transaction. It takes advantage of a few handy Datomic features:
When this layer has finished processing an event, we're done with it, and we delete the message from SQS.
We learned a few things while building this subsystem:
This system is the first significant project on which we have used clojure.spec, and it proved useful in a few ways:
Overall, having specs has increased our confidence in making changes, especially our specs for the webhook payloads, since it is difficult to capture all the permutations of behavior of the provider's systems in tests.
However, we found a few trade-offs when using clojure.spec:
We use structured logging throughout our backend subsystems, and log both branches of crucial decision points within the subsystem. We tie all of the log messages for an event together with a key that is generated when we first receive the webhook payload so that we can see all the messages for a given payload with one query. That allows us to easily see how we processed a given payload, which is valuable to our support team when understanding behavior when a customer has a question about it.
Storing the payloads in S3 made debugging faults much more straightforward - when we logged an error or observed unexpected behavior, we could inspect payload S3 object that triggered the event to identify the cause. We could then turn the payload into an input to test to ensure the fault was corrected.
Much of the work here was an experiment with techniques and technologies that were new to our overall system. Overall, we consider this a successful experiment and have already applied some of the lessons learned to other subsystems within Shortcut. We also now have a subsystem that is easier to understand (both operationally and at the code level), easier to debug, and easier to maintain. If you have any thoughts or questions on this system, we'd love to hear them in the comments, or on Twitter!