Background: Making the Most of Async in Node.js

[Author’s Note: This thread is about Javascript, but the principles are not about Javascript. I used Promises before Javascript had them, and I will use async/await long after. There’s a lot to cherry-pick here if your language has fibers, or infinite parallelism will lead to immediate resource exhaustion. It’s just couched in the language I’ve spent the most time in recently.]

If you’re trying to get a lot of unrelated tasks done in Node.js or similar languages, the normal way to write async code is often close enough to best case that most people don’t bother doing more than a little bit of tuning. But as the workload increases, and particularly if you are attempting to do any sort of offline (batch) processing in Node.js, the gap between simple and fast can reach into orders of magnitude and can be worth digging a lot deeper.

Background

This is the beginning of a series to go over both my existing knowledge and further lessons learned while working with batch processing tasks on a high availability SaaS service. Most of the lessons were learned on a service that ran on gobs of hardware and took half an hour to run. I got that to run in less than a third of the time on a quarter of the hardware (~15x faster). A second batch job was much simpler and saw around 25x improvement. Additionally, the first job had a long list of caveats about running it during the daytime, including interpreting Splunk queries to determine if the system could handle the load. It failed around 10% of the time, and occasionally caused brownouts of other services if one ignored the warnings. The redesigned system only had one caveat: Don’t run this job in the middle of a production outage.

Before we get going, I feel it’s important to discuss some key concepts.

Offline vs Online Processes

“Online” is communication that expects an immediate answer, while “offline” has a much more forgiving completion date. Consider, for instance, a fast food restaurant. The cashier taking your order is expected to enter it immediately, while the person washing dishes only needs to complete the task sooner rather than later. In fact if a bus of people show up at the store, the person washing dishes may be asked to pause that work to help out with making food.

As a general rule, offline tasks should not interfere with the timeliness of online ones. So you will need some sort of mechanism to throttle your requests.

Throttling

Throttling generally means limiting the number of requests per second that a client can fire at a service per unit of time. It can also in some cases limit the number of simultaneous requests, which is superior for this sort of work.

Throttling usually means that I can send for instance 50 requests per second at your service. This is another point of coordination however because the limit is frequently chosen either by the sender or by an ingress load balancer. This creates extra points of failure when the ambient load or the autoscaling settings change. That often won’t create production outages, but it can if the target services are not dialed in properly, and even if it doesn’t cause outages it can trip alarms, which creates a constant threat of logistics headaches for all of the on-call folks. And this will at least lead to a batch process that has a reputation of needing to be run only by very experienced engineers that can predict when firing up the service will cause problems. You shouldn’t have to consult a grafana dashboard to find out if a deployment task is safe to run at 2 in the afternoon. It should either work or be locked out, so that any senior team member can run it with impunity. As the number of tasks scale, this becomes critically important.

When the client chooses a number, it has to pick a compromise rate that won’t overload the system most of the time, but also runs slower than it could because it doesn’t go faster when the service has the capacity to spare. It removes any benefit of running the offline process during low traffic hours. And when things are very bad, the compromise number can still end up triggering alerts, because it’s not quite slow enough to never cause issues.

An alternative to this is to use back pressure - limit the number of requests to be a function of the ability of the service to process them.

Back Pressure

A lot of async code ends up getting used for talking to other services, which means a distributed system. Two of the more important features of performant distributed systems are back pressure and work stealing. Work stealing is about poaching tasks from the queues of other processes or threads in order to clear a work queue sooner. Without it, the very last task in a group can end up being started after all other tasks have been finished, greatly increasing the latency between starting a batch of tasks and finishing them. So when a process runs out of work it may try to remove tasks from the queue of another process in order to begin working on it sooner.

The other trick is back pressure. Back pressure is a way to force one step in a workflow to slow down to not go faster than a subsequent step. This can keep partially completed work from stacking up, which can create overhead that reduces the number of tasks that get completed per second, due to CPU or memory contention. Sometimes going slower allows you to go faster.

With back pressure, you send requests at a reasonable rate and something tells you when it is too much. This generally means that you might have 10 requests running in parallel, and if they retire in 100 ms apiece, then you get to make 100 requests/sec. If the service is slow, that may drop to 50 req/s. If load it light, that may jump to 130 req/s. If they upgrade the service to fix a performance bottleneck, then that might jump to 200 req/s, 300, more. You don’t have to do anything much on your end because the natural capacity of the system is built into the back pressure mechanism.

Memory Pressure

In general, a task that involves synthesizing data from multiple sources starts with a small amount of data that describes the task, a large amount of data to calculate the answer, and then a moderate amount of data as the answer. When this is true, any incomplete task is better off not being in progress until other tasks have finished. That way the system gets less bogged down trying to manage available memory. This also means you don’t create a Thundering Herd at the beginning of the process.

Next Steps

In the next installment, I will introduce the p-limit library and then enumerate techniques to leverage it for everything it’s worth. Stay Tuned.