Part 3: Making the Most of Async in Node.js

In Part 2 we discussed how to use p-limit to handle common errors and chains of simple parent-child requests. Today we will discuss heterogenous use cases, such as complex graphs and high-variance task costs.

Shuffling Tasks to Prevent Clustering

Pretty early on I discovered two problems with our bigger batch process. First, we had one customer group for whom the job generated about 10x as much work. And the way customers were handled, all of the customers in the same group were returned consecutively in the data (in our case, by a shared prefix on the account, but JOIN operations can also create a similar situation). More than 3 times out of four, if the batch job saturated our S3 services, it was in the middle of processing this group of customers.

All I had to do was sort the IDs by a different criteria. In this case, I pruned off the group prefix and sorted by the rest of the ID. The other nice thing about this is that since each group is an arbitrary size, you could not eyeball the console logs of the batch to determine how close you were to done. We are in the M’s now. Is that halfway done? A third? 60%? Who knows. The new sorting was closer to accurate.

Shared Dependencies

But the other problem was that customer groups tended to have shared dependencies. And when you’re load balancing a task to an entire cluster, coordination can go out the window. So firing two related tasks together means they get load balanced to different servers. Both of them see that a piece of data they need doesn’t exist, have no way to know that someone else is already working on it, and then have to duplicate the effort. Only when they get a PUT collision storing that result do they find out that work was for nothing. By shuffling the data, you put time between the first and last use of related data, allowing the first call to settle before additional calls check for the outputs.

When the problem is local, you can sometimes fix this with Promise Caching, but when you’re calling a service that does the expensive work, that’s not an option. Additionally, in production situations, where per user auth and telemetry are valid concerns, promises often have AsyncLocalStorage associated with them and if you’re not very careful about what objects you store there, you can end up leaking significant amounts of memory by doing so, if you don’t replace all of the promises with the resolved or values at the end of the operation. If reordering the requests fixes >90% of the problem, you can avoid the additional moving parts.

Managing Complex Call Graphs

Our SaaS system represented a lot of data as a graph. Several in fact, that were merged, to allow for defaults and customer preferences to override them. So a lot of times one call would result in three similar calls at some level of recursion down. It wasn’t a simple list comprehension, like the examples I used in part 2. In these cases you don’t know whether a task will make two calls or a dozen. Whether they will happen at depth:1 or depth:4.

If you can avoid these “n+1 problems” you should, but you don’t always have that option. And this is not the only situation where you need throttling at the process level instead of at the task level. In such cases, you will want to move the p-limit use into a helper function that is in charge of making requests to this particular service, rather than trying to enumerate them all in your top level function.

  import pLimit from "p-limit";

  //...

  const addressLimit = pLimit(ADDRESS_PARALLELISM);
  const userLimit = pLimit(USER_PARALLELISM);

  async function fetchUsers(criteria) {
    const users = await userLimit(() => fetch(/*...*/));
    
    for (let user of users) {
      const addresses = await fetchAddresses(user.id);
      
      // attach shipping info from service B to user data from service A.
    }
  }
  
  async function fetchAddresses(userId) {
    const addresses = await addressLimit(() => fetch(/*...*/));
    
    //... process the address responses
  }

These limits are declared at the file scope, so any number of people importing this file will share one single queue across the entire process, even if they each instantiate a separate instance of the class to carry context information for requests, telemetry, or logging purposes. We don’t want each task to make 10 requests, we want 10 requests across all tasks.

We benefitted a little from this pattern our batch processing, but most of its value came in with some feature toggle code duplication situations, and improving local reasoning issues. But when I began watching for opportunities to apply some of my new tricks back to our online processing situations, we ran into a demo page that was being used for some sort of benchmarking that started causing open circuits because it was firing too many requests at once. I applied the code pattern above with a fairly generous limit per process and the problem stopped happening. Those pages just took longer to load, but they did load and stopped triggering alerts. So while many of these tricks are not particularly conducive to online request situations, you will occasionally find a use for them and having them in your toolbag is useful.

Conclusions?

There will eventually be one more section to this, to cover some of the vaguaries of async logic and throughput. For now I will call this, if not an end, a pause here until a useful Part 4 gets fleshed out.

Special thanks to Sindre Sorhus, an open source author who demonstrates Single Responsibility Principle in his Node.js modules with a consistency I have rarely seen in the wild. Please check out his work, particularly p-limit and p-retry.