You Should Make Your Build Agents Bigger

Someone once told me that a lot of blog posts are about arguments the author already lost. This is one.

Here is my argument: If your build agents do not have enough surplus compute to handle your biggest pipeline, then it is a false economy to resist scaling them vertically. If even one team in your company is in need of heavy compute at build or deployment time, your entire company would be better served by vertically scaling your build agents than having that team allocate servers in order to deal with transient loads. Because idle machines meant to handle load spikes could instead be apportioned to the build agent cluster to speed up builds for all teams rather than one. And once you have two teams, you’ve reduced the number of autoscaling groups you need to tune to 1.

As a bit of background, my previous employers, who run a large scale SaaS system, had a reasonably large operations department. This company had been around since the DotCom Era, and as you might expect they had invented a lot of tools in house that had no industry answer at the time, or an incomplete one. Even after extensive consolidation there were about a dozen teams all sharing a rather dramatically large CI/CD installation. My department alone had a couple hundred build plans. It’s… a lot.

At my boss’s encouragement, I had been trying to slay a dragon for a while. Which turned into two, one of which was relatively easy to slay and one which was not. Our team was responsible for three clusters of servers, and we were interested in improving costs per request. One of them was cache. That was not going anywhere since it reduced the size of our main cluster (and several other teams’ clusters) and improved our SLAs. If anything, that cluster should have been growing and it wasn’t. Then there was our main cluster. I don’t feel the need to explain that much at all. And the last cluster was our accidental dragons.

The problem was that we had tried for years to decouple the UI and customization folks to be able to fix UI and typsetting concerns on their own timeline instead of being beholden to when operational and backend deployments where happening. Those sorts of signoffs don’t scale. The more people you need in the room to accept or veto a deployment, the worse it gets. So we had this cluster we built for them to avoid Thundering Herd problems if they deployed during peak or even moderate traffic, which could slam our systems calculating things like templates or CSS. The solution to these problems is always to calculate everything at build time that you can, deploy time if you can’t, and lazily if neither are an option. That is Plan C.

We were trying to get away from Plan C to Plan B, but that involved essentially generating our own Thundering Herd but offloading it to another cluster that wasn’t serving online traffic. Only through gyrations of intended but unsupportable functionality we ended up with a cluster that had one service serving live traffic and another serving this batch processing system. There are any number of problems with this. If a batch service takes out a production service that’s an outage, so the first thing I did was heavily traffic shape the batch process so that cluster was never overtaxed, and as a consequence it also guaranteed that peak load on other services was never more than 10% of their normal organic traffic. Which saved our butts when those teams started experimenting with autoscaling.

But the problem of mixing two classes of traffic on one machine was still there, and since the amount of data that service was actually returning was embarrassingly small in the first place, I ended up replacing the whole thing with a Consul lookup with a long poll, so that we weren’t fetching the data at a rate proportional to inbound requests. That also cut TTFB by 5% because many, many requests were predicated on the result of that one query, resulting in stalls every time the request was made.

Between those two operations I was able to shrink the cluster by 75%. And yet still it sat around using about 4% of our CPU count. Taunting us. Taunting me.

The cluster only existed for two reasons. First, while it was a relatively Embarrassingly Parallel problem, we needed a lot more CPUs than the build agent had. Second, the first step of the task involved downloading a fairly large tarball and having the contents available to answer requests, so Lambda just wasn’t going to cut it, and autoscaling would only kick in once the task was a third to a half done. Not a lot of benefit for a lot more churn. But even here, the code that needed to run was all in libraries. I could in theory run it on the build agent, only have one machine grabbing the files instead of a bunch, and the lack of network overhead and the improvements in cache consolidation would reduce the cost per job. So I tried. And tried.

I turned out to be right, but not right enough. One build agent had a fraction of the CPUs of my cluster, and once I eliminated some dumb mistakes that made the PoC look better than reality, it took about 2.5x as long to complete the task. I kept trying different tricks and tuning parameters but was never able to climb the hill. It barely budged. I needed those extra cores to hit the time limits built into our SLAs.

Operations indicated that they wouldn’t provide multiple kinds of build agents for the general compute pool because it was too much overhead and the size they had chosen was Good Enough.

Here’s the thing that I figured out too late. My cluster was essentially about the same size as the build agent pool. If I ‘donated’ those CPUs to build pool, everyone else’s builds could run either faster or start sooner, because if you had twice as many agents you’d have less queuing, or if you doubled the size of each agent they’d have more memory which would speed up builds even if they weren’t parallelized per core, and so we would have a Race to Idle situation. The cost of that hardware would be amortized across the entire division, rather than my team.

Advantages

A larger build instance will encourage people to do more work earlier in the build-test-deploy cycle, especially if they can parallelize the work. And in the limits, it may avoid the need for a team to spin up servers or lambdas in order to access more compute or memory than the build agent has. Instead those resources can be shared across multiple teams, making everyone’s build pipelines a bit faster instead of only one. Any batch job that runs exclusively in the build agent represents a substantial reduction in moving parts in the system. Such builds are more reproducible locally since all or at least much more of the activity happens on a single machine. And they look self-similar to their siblings which reduces the amount of Surprise in the system.

Build servers already know how to schedule tasks, to prevent multiple jobs running at once, to chain dependent jobs based on the completion of previous steps. To send email on failure. To send chat alerts when encountering issues. To track success rates. What are you gaining by squeezing a few measly dollars an hour and making people create their own bespoke implementation, or force a single task to become several in parallel that have to be juggled? Not a damn thing.

If I were still there, I would have had this conversation a long time ago, and likely would have won. Particularly if I could find anyone else in our same boat, hoarding servers they were using in fits and starts.