Introducing Faceoff
Faceoff 1.1
I created faceoff to fill the hole being left by the dormant benchmark-regression project.
The 1.1.0 release fulfills most of the goals I set out to achieve, though there are a few new things I hope to land in the next month or two.
Performance is a Feature
Faceoff and benchmark-regression before it are focussed on preventing regressions in pull requests
by comparing the code in your branch to the same code in previous releases. Where Faceoff differs is
that it can also compare a feature branch to release branches, or your trunk build. This is
particularly handy on projects where releases collect a number of PRs into a single build instead of
practicing Continuous Integration.
But collecting this data also has value after the fact. How often have you been told that the application got slower, “sometime in the last couple of weeks” and now you are tasked to figure out which of the last 45 commits or 120 dependency upgrades may have introduced the regression. Being able to look back through build history looking for evidence of the problem can take a half day investigation down to minutes.
The output looks a little something like this:
Summary (vs. baseline):
⇒ latest ▏█████████████████████████▕ 14,300,744 ops/sec | 13 samples (baseline)
⇒ trunk ▏██████████████████████▌──▕ 12,946,295 ops/sec | 11 samples (1.10x slower)
⇒ #perf/keys ▏██████████████████████▌──▕ 12,965,191 ops/sec | 10 samples (1.10x slower)
constructors ⇒ new Counter()
Summary (vs. baseline):
⇒ latest ▏████████████████████████─▕ 1,137,460 ops/sec | 119 samples (baseline)
⇒ trunk ▏████████████████████████▌▕ 1,153,108 ops/sec | 96 samples (1.01x faster)
⇒ #perf/keys ▏█████████████████████████▕ 1,170,779 ops/sec | 95 samples (1.03x faster)
util ⇒ LabelMap.keyFrom()
Summary (vs. baseline):
⇒ trunk ▏█████████████████████████▕ 7,195,488 ops/sec | 13 samples (baseline)
⇒ #perf/keys ▏████████████████████████▌▕ 7,180,307 ops/sec | 11 samples (1.00x slower)
Inconclusive Tests:
------------------------
constructors ⇒ new Counter()
⇒ latest ▏████████████████████████─▕ 1,137,460 ops/sec | 119 samples (baseline)
⇒ trunk ▏████████████████████████▌▕ 1,153,108 ops/sec | 96 samples (1.01x faster)
⇒ #perf/keys ▏█████████████████████████▕ 1,170,779 ops/sec | 95 samples (1.03x faster)
Performance Regressions:
------------------------
constructors ⇒ new Registry()
⇒ latest ▏█████████████████████████▕ 14,300,744 ops/sec | 13 samples (baseline)
⇒ trunk ▏██████████████████████▌──▕ 12,946,295 ops/sec | 11 samples (1.10x slower)
⇒ #perf/keys ▏██████████████████████▌──▕ 12,965,191 ops/sec | 10 samples (1.10x slower)
Background
Last Winter, I became involved in the prom-client project. I did a lot of telemetry work at my last job, and while looking for ways to contribute back I noticed that many of the open issues in their backlog were about performance and memory issues. Things I know quite a bit about. So I started filing PRs to address some of these issues.
prom-client’s code base introduced me to benchmark-regression, which wraps benchmark.js, a tool I was quite familiar with, as it was instrumental in helping me break a performance log-jam that had existed for most of my tenure at my last job. With benchmark.js’s help, I was able to make new feature development substantially cheaper and let my team start making net improvements in response time, instead of constantly having our wins zeroed out by the next major initiative.
I quickly discovered that benchmark.js had been end-of-lifed in April of ‘24, and that
benchmark-regression hasn’t landed a PR in 7 years. They both essentially come from a time before
async code, and like many such libraries they struggled to adapt.
I didn’t mean to write a replacement for these tools, but as helpful as benchmark-regression was,
I was running into logistical issues working onprom-client. The problem is that prom-client is
a fairly mature project. Mature codebases don’t generally respond quickly to some new nutbar filing
half a dozen PRs in as many weeks. And this experience was shining a pretty harsh spot-light on the
areas wherebenchmark-regression was good, but not good enough.
Even though they graciously accepted a number of my PRs rather quickly, they weren’t issuing new
releases. benchmark-regression assumes a world where we only want to compare our branch to a
specific release of the module, whereas what we want to do on a project that sees many more commits
than releases is compare our branch to HEAD. To the trunk build. That way we aren’t undoing work
by our fellow contributors, nor are we double-counting their improvements. When you’re engaging in a
rather enthusiastic campaign of optimizations like I was, I discovered very quickly that it was
struggling to keep my PR descriptions objectively honest. It was getting harder and harder to keep
track of how branch A affects performance versus branch B which has now been merged and my branch is
rebased on top of to resolve any conflicts. These are the same problems any two or three
contributors would encounter in trying to file PRs in parallel, and it wasn’t long before someone
ran into just that problem.
These are the sorts of ergonomics issues that cause developers to simply stop trying. It’s a source of learned helplessness I’ve been trying to fight my entire career. It’s one of the reasons I lean so heavily on CI/CD, with an emphasis on being able to reproduce CI errors locally. You can clean up after yourself. You have all of the tools you need to do this. Please use them.
Not finding any alternatives to benchmark-regression that had feature parity, I figured it was now
time for me to do something about it. I poked around for what other alternatives to benchmark.js
there were and I stumbled upon bench-node, which was still under active development but
already supported async tests, and was created by someone established in the Node.js ecosystem. It
already had much better charts for my purposes, and looked like it could be bent to fit the model I
was using. And as of a couple weeks ago I am now also a maintainer on bench-node. So I’ve got that
going for me. Which is nice.
First Customer
I designed faceoff to run prom-client’s existing regression tests with only minor changes. It
is a nearly drop-in replacement for benchmark-regression, and that has gone pretty well. A PR to
use it for prom-client was merged in the day before Thanksgiving. It expands on
benchmark-regression to support git urls for version numbers, and focuses on three-way comparisons
between your working directory, the project’s trunk, and your current branch.
There were several areas where prom-client worked differently from my integration and smoke tests,
and in fact 1.0 is a little broken in this regard. So I dropped my PR to upgrade prom-client to
1.0 and will file another for 1.1.0 later this week.
What’s New
1.1.0 fixes a few issues with ESM modules, supports the t-test feature added in bench-node 0.14,
narrows the display a little bit to make the results scan easier, particularly in CI/CD tools that
like to clip horizontal scrolling. It has some API changes to facilitate support for worker threads,
which is still an experimental feature in bench-node and I am currently working to solidify. I
hope to deliver that and faster t-test support in a 1.2 or 1.3 release.
Next Steps
As I mentioned above, most of the 1.2 feature set will be predicated on landing improvements to
bench-node, making more use of the confidence intervals in both the output summary and in the
results JSON file.