Scaling Faire’s CI horizontally with Buildkite, Kubernetes, and multiple pipelines

15 min read Original article ↗

Breaking apart monolithic continuous integration in our Kotlin monorepo

Ben Poland

Press enter or click to view image in full size

Continuous Integration (CI) is a critical part of developers’ workflows, and at Faire it’s no different: the speed and reliability of our CI system directly affects how quickly our engineering team can deliver new features to our wholesale marketplace. As Faire has expanded to serve more customers, so has the Kotlin monorepo used by our backend engineers. The number of services in the repo and number of automated tests have both grown by over 10x over the past five years, and we’ve faced our share of CI problems trying to keep up. 😅

In this article I’ll dive into the challenges we were facing with our Jenkins CI instances and monolithic pipeline architecture, lay out the changes to process and tooling that we are implementing to solve them, look at the impact we’ve had so far on our developer experience (and ultimately, our engineering velocity), and share what we’ve learned while scaling our CI horizontally with Buildkite, Kubernetes, and multiple pipelines.

Our CI beginnings

In the early days at Faire, a single Jenkins instance was more than enough to handle things. We used its Kubernetes plugin to run jobs in a single cluster and had a single pipeline for our whole backend Kotlin monorepo to build, test, and lint the handful of services inside.

But monorepos don’t tend to get smaller as time goes on.

They just. Keep. Growing…

Press enter or click to view image in full size

📈 The number of commits in our backend repo (since a git history rewrite in early 2018).

Jenkins couldn’t keep up

Eventually, we hit the limits of what one Jenkins controller could handle — managing too many builds with too many Kubernetes pods at the same time slowed it down to a crawl. So we set up a second instance (one for main branch builds and another for pull requests), and eventually we needed a third one.

Sure, this worked — but more Jenkins controllers meant more updates to install and more downtime to schedule (since Jenkins isn’t highly available 😭). We built Terraform code to set up and manage these Jenkins instances, but it became clear that this wasn’t the right solution for us in the long term.

Our monster monolithic pipeline

At the same time, our one Jenkinsfile to rule them all had grown and grown. Making changes to it was complex and error-prone, and with more services and more checks it became harder and harder to follow what was going on in the Jenkins UI. When something did fail and our developers couldn’t tell what had happened, they had a habit of just retrying it and hoping for the best (okay, yes — we all do it from time to time, but it’s certainly not ideal 😉). And a breakage in one service would block deployments for all of them (😱).

Press enter or click to view image in full size

This is what our Jenkins pipeline looked like, and this is even after some simplifications (some steps blurred to protect the innocent). That’s a lot to sort through if something goes wrong…

Time for a change

We recognized that we needed to make some big changes to set ourselves up for the next five years and beyond. We set out to tackle two key issues:

  1. Level up our CI platform to provide a great developer experience and keep up with our growth for the foreseeable future, without the maintenance burden
  2. Break down our monolithic CI pipeline architecture into more manageable pieces to make it more efficient, understandable, and avoid shared fate (unlocking more frequent deployment of our services)

Let’s jump in and explore each one in more detail!

Project Farpoint

We knew Jenkins wasn’t cutting it, and kicked off a project to look at alternatives. The goal was to “pilot the next generation of CI at Faire” and as a big Star Trek fan, I named it Project Farpoint after “Encounter at Farpoint,” the pilot episode of Star Trek: The Next Generation.

The first step was to nail down all our requirements (as always, easier said than done 😀). We reached out to development teams across Faire to get input and ended up with a long list, but here are some of the highlights:

Technology/Infrastructure:

  • Good Kubernetes support (and Macs for our iOS builds)
  • Ability to run tasks in parallel and configure dependencies between them

Reliability/Maintainability:

  • Highly available; any updates can be performed with zero downtime
  • Scalable far beyond our current needs without manual intervention
  • Fault-tolerant, with auto-retry/self-healing capabilities
  • Configurable with infrastructure as code tools we already use (e.g. Terraform)
  • Good support available in case of issues

Observability

  • Metrics available for job/step times, ideally integrating easily with our existing observability tooling (Datadog)
  • Ability to instrument custom metrics and tagging

Developer Experience

  • Modern, customizable UI which makes it easy to find the cause of any failures
  • Good documentation and IDE support when writing/editing pipelines

Security

  • SSO support with differentiated access control
  • Good isolation of main builds from pull requests
  • Audit logging and change management controls

With a long list of criteria in hand, we started with the age-old question: build or buy? Faire isn’t (yet!) at the scale of the huge players out there who default to building it themselves — we tend to leave that to situations where what we need isn’t available, or when we need more control and customization than something off-the-shelf can provide. In this case, CI is a common enough need that we decided to focus on third-party solutions.

Buildkite

After taking a look at several options out there, we honed in on Buildkite as a great fit. It checked all our boxes and we’d heard good things from folks at companies like Shopify, Uber, Pinterest, Airbnb, and Pagerduty. We got a test repo set up to see how the pieces fit together, validate our assumptions, and make sure that Buildkite worked the way we expected.

We already used Kubernetes for CI with Jenkins, and decided to use a similar approach for Buildkite using their agent-stack-k8s. While not everything is the same, we got things working relatively quickly by reusing container images and some configuration from Jenkins.

Get Ben Poland’s stories in your inbox

Join Medium for free to get updates from this writer.

We got the test repo working well and we were excited to keep going on our journey. But big changes like this can be scary, and it’s important to be able to make incremental progress without a huge upfront investment in time and effort. So we started planning out a more in-depth proof of concept (POC), which brought us back to the second issue above: our massive single Jenkins pipeline.

The pipeline that couldn’t (any more)

We had already identified the monolithic CI pipeline as an issue, and knew we needed to break it apart into smaller pieces — which we named “poly-CI.” It made sense to tackle both problems by moving one small piece of the puzzle over to Buildkite as our POC. We chose a service in our monorepo to be the guinea pig (let’s call it foo), and got to work.

Our Kotlin monorepo uses Gradle as a build tool, and we first needed to reconfigure it and update our custom plugins to support a new poly-CI world: instead of building, testing, and linting everything all at once, we needed to be able to do it only for the foo service. We made use of Gradle's --project-dir to focus commands on a part of our repo where possible. For aggregate tasks, we set up a Gradle system property to accept filters and a Project extension function to check whether to include each project or not:

import org.gradle.api.Project

val Project.shouldIncludeFromMonorepo: Boolean
get() {
val directoryFilters: List<String> = providers
.systemProperty("monorepoDirectoryFilters")
.orElse("")
.get()
.split(",")
.filter { it.isNotEmpty() }

if (directoryFilters.isEmpty()) {
return true
}

val inclusions = directoryFilters.filter { !it.startsWith("-") }
val exclusions = directoryFilters.filter { it.startsWith("-") }
.map { it.removePrefix("-") }
val projectRelativePath = path.replace(":", "/").removePrefix("/")

// Exclusions take precedence
if (exclusions.any { projectRelativePath.startsWith(it) }) {
return false
}

if (inclusions.isNotEmpty()) {
// If there is any inclusion then the project must match
// one of the inclusions to be included.
return inclusions.any { projectRelativePath.startsWith(it) }
}

return true
}

By configuring the relevant tasks to check this filter, we could include only the foo service in Buildkite with a -DmonorepoDirectoryFilters=”foo” flag on our Gradle commands, and include everything EXCEPT foo in Jenkins with -DmonorepoDirectoryFilters=”-foo”. (Note: This filtering worked during the migration but we’re hoping to eliminate it after we’re done, since it can increase Gradle configuration time by bloating the number of tasks.)

Starting slow

With that out of the way, we started shadow runs of our new Buildkite pipeline alongside Jenkins. Although these builds were kicked off automatically, they didn’t report anything back to GitHub and nobody was watching the results except our team. This allowed us to validate the end-to-end flow, see how different parts of the system behaved under more load, and iterate quickly on the solution without any chance of blocking developers.

We also took this time to set up CI metrics and alerting that was in line or better than what we had with Jenkins, using Buildkite’s integration with Datadog for our observability tooling. This gave us a bunch of data out of the box, and allowed us to extend that with custom tags, measures, and spans.

We kept a close eye on all the metrics to ensure that non-code-related failures, or what we call “CI infrastructure problems,” were no higher than with Jenkins (in fact, they were significantly lower — more details below 😍). Once we were ready, we enabled reporting Buildkite status checks to GitHub but kept them non-blocking, still running in parallel with Jenkins. We asked the foo service developers to try Buildkite when tracking down any CI failures, and continued to make adjustments based on their feedback.

After ironing out the last kinks, we did a final check-in with the foo service developers and got strongly positive feedback. We made the decision to switch off Jenkins for the foo service and make Buildkite its primary CI system. This was an exciting milestone but we still had plenty of work ahead!

Ramping up, but staying focused

After a successful POC, we decided to roll out Buildkite to more and more of our services, implementing the poly-CI approach as we went. But since most code changes are only to a single service, running the poly-CI pipelines for all services on every change would be a waste of time and money.

So, we designed our pipelines with a single “trigger” build that gets kicked off by GitHub webhooks, which then:

  1. Determines which child pipelines are affected by the code change, using git merge-base and Gradle to handle changes to dependencies within the repo
  2. Triggers the necessary pipelines
  3. Aggregates the status of all those triggered pipelines into a single GitHub status check, which is required before PRs can merge

This provides a nice overview of all the pipelines that are running in the Buildkite UI, and allows diving into each one for more details. If one child fails due to flakiness (yes, it does still happen every once in a while…) then that single pipeline (or actually the single failing step within that pipeline) can be retried, and the trigger pipeline reflects the updated status automatically. This is a huge improvement from our monolithic Jenkins pipeline where any flaky failures caused the entire build to fail, meaning a rerun of everything (including things that had already passed).

Press enter or click to view image in full size

In Buildkite, each affected child pipeline is shown with its result. Clicking on one of them jumps into its build page, showing more details about that specific pipeline. Much simpler!

Hiccups along the way

The groundwork we had laid with the foo service allowed us to proceed relatively quickly, but (of course) it wasn’t all smooth sailing. While the rollout hit plenty of smaller issues that we were able to resolve relatively quickly, there were two more complex issues that we ran into: scaling problems with our Kubernetes cluster and rate limiting from GitHub. Here’s how we worked through both of those.

Scaling Kubernetes for a CI workload

We use a managed cloud Kubernetes offering and ended up stretching the limits of what a single cluster can handle. There are a ton of pods starting and stopping all the time, and the number of nodes needed varies quite a bit throughout the day and down to zero overnight. One of the consequences of the new poly-CI architecture on our Kubernetes usage is that we went from fewer large pods running the monolithic CI to many more small pods starting and stopping all the time. While this didn’t cause any issues during our POC with just the foo service, we did start to run into issues as we rolled Buildkite out more widely.

The first problem ended up being Kubernetes rate limiting itself — the sheer number of Kubernetes API requests from the flurry of activity at high load was slowing down the whole cluster and delaying pod creation. We started looking at the API Priority and Fairness options, but then we hit a space issue with the etcd database (which Kubernetes uses as its backing data store) in one of our clusters. As a result, we’ve reworked our poly-CI strategy a bit to reduce the number of pods and also spread the load across several clusters to get things in much better shape.

GitHub rate limiting

A clean checkout of our Kotlin monorepo is several gigabytes, and while we could help shrink it by rewriting history to remove large files or move them to git LFS, that’s disruptive and would come with bandwidth and storage costs to worry about. We used Buildkite’s git mirrors functionality to keep a local copy of our git repo on each Kubernetes node, but found that during high load, checking out the mirror copy of the repo was taking up to 15 minutes (!?) instead of the normal 60–90 seconds.

We started looking at performance bottlenecks inside our Kubernetes cluster — disk, network, et cetera. But we also reached out to GitHub to see if they had any data that could help track down the problem. As it turns out, we were hitting some hidden throttling on their side.

To solve this problem, we ended up keeping a copy of the repo in object storage (updated daily) and pre-seeding new Kubernetes nodes from there when they start. This allows the Buildkite mirror to start with 95% or more of the repo already present, and vastly reduces the amount of data fetched from GitHub to avoid the throttling entirely. Our median git mirror checkout time on new Kubernetes nodes is now 10 seconds, even faster than it was before the issue. 🥹

Impact

While we still have a few pieces of our CI left to migrate from Jenkins, we’re already seeing some great results from the combination of Buildkite and our poly-CI strategy. To start, our CI infrastructure failure rate, e.g. the number of failures unrelated to the code being changed, is significantly lower on Buildkite than Jenkins.

Press enter or click to view image in full size

Infrastructure failure percentage. Yes, Jenkins has had a few rough patches (and Buildkite did too when we hit the Kubernetes and GitHub rate limiting issues in October and December 🙁)

This means that when CI fails, our developers can start to believe that it might actually be due to their change. 😉

Another way to see this impact is in the number of manually retried builds, which is far lower on Buildkite than Jenkins (even though the volume of builds is far higher). Our developers are learning that just mashing the retry button isn’t a good first step anymore, and are better able to find the cause of the issue because the poly-CI pipeline is easier to understand.

Press enter or click to view image in full size

Number of manually retried builds. As we’ve shifted more things off Jenkins, its usage has dropped and the number of retries has dropped too.

We’re also seeing average pull request wait times on CI drop by 50% or more. This means a faster feedback loop and less sword fighting. ⚔️

Press enter or click to view image in full size

Average pull request CI wait time. Buildkite has increased a little as we’ve migrated some of our larger projects, but has less overhead and allows us to parallelize better!

Lastly, the feedback has been very positive, like this from one of our developers:

I’ve been blocked by bugs in Jenkins / waiting on Jenkins several times this week. Can we drop it yet 😅.

We’re almost there! We plan to hold a retirement party for Jenkins any week now. 😜

Lessons learned

Looking back, here are some big things we’re taking away from this project that may be helpful for other teams considering a similar migration:

CI pipelines always mean some emergent design. Without focused effort, they will become bloated and confusing over time as more things get bolted on.

Monorepo doesn’t mean monolithic. It’s important to define and enforce sensible divisions to keep developers productive by being able to build and test smaller pieces of the repo independently in CI (and locally too!).

Don’t be afraid to take big swings. Sometimes incremental change just isn’t good enough to keep up. Don’t be constrained by the sunk cost fallacy.

Get feedback on your changes as early as possible. Find a proof of concept or MVP that starts adding value early and builds confidence in the overall project.

While this article focused on our backend Kotlin monorepo, work is also well underway to move our frontend and mobile CI to Buildkite (following a similar poly-CI strategy). We’re excited to get these improvements in all our developers’ hands! Stay tuned for more updates from the Faire Engineering team coming soon. 📣

Big thanks to Jack Wang and our entire developer productivity team for all their contributions to this project!

Want to join an Engineering team working to develop solutions to complex, interesting problems, while also supporting the unique character of local neighborhoods? Apply for open roles at Faire: https://www.faire.com/careers