Introducing Faceoff

Faceoff 1.1

I created faceoff to fill the hole being left by the dormant benchmark-regression project.

The 1.1.0 release fulfills most of the goals I set out to achieve, though there are a few new things I hope to land in the next month or two.

Performance is a Feature

Faceoff and benchmark-regression before it are focussed on preventing regressions in pull requests by comparing the code in your branch to the same code in previous releases. Where Faceoff differs is that it can also compare a feature branch to release branches, or your trunk build. This is particularly handy on projects where releases collect a number of PRs into a single build instead of practicing Continuous Integration.

But collecting this data also has value after the fact. How often have you been told that the application got slower, “sometime in the last couple of weeks” and now you are tasked to figure out which of the last 45 commits or 120 dependency upgrades may have introduced the regression. Being able to look back through build history looking for evidence of the problem can take a half day investigation down to minutes.

The output looks a little something like this:

Summary (vs. baseline):
 ⇒ latest                    ▏█████████████████████████▕ 14,300,744 ops/sec | 13 samples (baseline)
 ⇒ trunk                     ▏██████████████████████▌──▕ 12,946,295 ops/sec | 11 samples (1.10x slower)
 ⇒ #perf/keys                ▏██████████████████████▌──▕ 12,965,191 ops/sec | 10 samples (1.10x slower)

constructors ⇒ new Counter()

Summary (vs. baseline):
 ⇒ latest                    ▏████████████████████████─▕ 1,137,460 ops/sec | 119 samples (baseline)
 ⇒ trunk                     ▏████████████████████████▌▕ 1,153,108 ops/sec | 96 samples (1.01x faster)
 ⇒ #perf/keys                ▏█████████████████████████▕ 1,170,779 ops/sec | 95 samples (1.03x faster)

util ⇒ LabelMap.keyFrom()

Summary (vs. baseline):
 ⇒ trunk                     ▏█████████████████████████▕ 7,195,488 ops/sec | 13 samples (baseline)
 ⇒ #perf/keys                ▏████████████████████████▌▕ 7,180,307 ops/sec | 11 samples (1.00x slower)


Inconclusive Tests:
------------------------

constructors ⇒ new Counter()
 ⇒ latest                    ▏████████████████████████─▕ 1,137,460 ops/sec | 119 samples (baseline)
 ⇒ trunk                     ▏████████████████████████▌▕ 1,153,108 ops/sec | 96 samples (1.01x faster)
 ⇒ #perf/keys                ▏█████████████████████████▕ 1,170,779 ops/sec | 95 samples (1.03x faster)


Performance Regressions:
------------------------

constructors ⇒ new Registry()
 ⇒ latest                    ▏█████████████████████████▕ 14,300,744 ops/sec | 13 samples (baseline)
 ⇒ trunk                     ▏██████████████████████▌──▕ 12,946,295 ops/sec | 11 samples (1.10x slower)
 ⇒ #perf/keys                ▏██████████████████████▌──▕ 12,965,191 ops/sec | 10 samples (1.10x slower)

Background

Last Winter, I became involved in the prom-client project. I did a lot of telemetry work at my last job, and while looking for ways to contribute back I noticed that many of the open issues in their backlog were about performance and memory issues. Things I know quite a bit about. So I started filing PRs to address some of these issues.

prom-client’s code base introduced me to benchmark-regression, which wraps benchmark.js, a tool I was quite familiar with, as it was instrumental in helping me break a performance log-jam that had existed for most of my tenure at my last job. With benchmark.js’s help, I was able to make new feature development substantially cheaper and let my team start making net improvements in response time, instead of constantly having our wins zeroed out by the next major initiative.

I quickly discovered that benchmark.js had been end-of-lifed in April of ‘24, and that benchmark-regression hasn’t landed a PR in 7 years. They both essentially come from a time before async code, and like many such libraries they struggled to adapt.

I didn’t mean to write a replacement for these tools, but as helpful as benchmark-regression was, I was running into logistical issues working onprom-client. The problem is that prom-client is a fairly mature project. Mature codebases don’t generally respond quickly to some new nutbar filing half a dozen PRs in as many weeks. And this experience was shining a pretty harsh spot-light on the areas wherebenchmark-regression was good, but not good enough.

Even though they graciously accepted a number of my PRs rather quickly, they weren’t issuing new releases. benchmark-regression assumes a world where we only want to compare our branch to a specific release of the module, whereas what we want to do on a project that sees many more commits than releases is compare our branch to HEAD. To the trunk build. That way we aren’t undoing work by our fellow contributors, nor are we double-counting their improvements. When you’re engaging in a rather enthusiastic campaign of optimizations like I was, I discovered very quickly that it was struggling to keep my PR descriptions objectively honest. It was getting harder and harder to keep track of how branch A affects performance versus branch B which has now been merged and my branch is rebased on top of to resolve any conflicts. These are the same problems any two or three contributors would encounter in trying to file PRs in parallel, and it wasn’t long before someone ran into just that problem.

These are the sorts of ergonomics issues that cause developers to simply stop trying. It’s a source of learned helplessness I’ve been trying to fight my entire career. It’s one of the reasons I lean so heavily on CI/CD, with an emphasis on being able to reproduce CI errors locally. You can clean up after yourself. You have all of the tools you need to do this. Please use them.

Not finding any alternatives to benchmark-regression that had feature parity, I figured it was now time for me to do something about it. I poked around for what other alternatives to benchmark.js there were and I stumbled upon bench-node, which was still under active development but already supported async tests, and was created by someone established in the Node.js ecosystem. It already had much better charts for my purposes, and looked like it could be bent to fit the model I was using. And as of a couple weeks ago I am now also a maintainer on bench-node. So I’ve got that going for me. Which is nice.

First Customer

I designed faceoff to run prom-client’s existing regression tests with only minor changes. It is a nearly drop-in replacement for benchmark-regression, and that has gone pretty well. A PR to use it for prom-client was merged in the day before Thanksgiving. It expands on benchmark-regression to support git urls for version numbers, and focuses on three-way comparisons between your working directory, the project’s trunk, and your current branch.

There were several areas where prom-client worked differently from my integration and smoke tests, and in fact 1.0 is a little broken in this regard. So I dropped my PR to upgrade prom-client to 1.0 and will file another for 1.1.0 later this week.

What’s New

1.1.0 fixes a few issues with ESM modules, supports the t-test feature added in bench-node 0.14, narrows the display a little bit to make the results scan easier, particularly in CI/CD tools that like to clip horizontal scrolling. It has some API changes to facilitate support for worker threads, which is still an experimental feature in bench-node and I am currently working to solidify. I hope to deliver that and faster t-test support in a 1.2 or 1.3 release.

Next Steps

As I mentioned above, most of the 1.2 feature set will be predicated on landing improvements to bench-node, making more use of the confidence intervals in both the output summary and in the results JSON file.