Frequently, I find myself defining APIs over network boundaries. Either service-to-service RPCs or brower-to-service HTTP calls. A question that comes up often is whether or not to design “batch” calls. In this post I aim to outline some of the factors to consider.

What does batching mean?

Buf first, what do I mean by batching? Take an imaginary WidgetDabase service, which has a single operation to lookup a widget by its ID.

By deciding to batch, we decide if we want our HTTP endpoint (or gRPC call, or Thrift call, etc.) to look do a single lookup (i.e. GET /widget/1234) or to look up multiple widgets (i.e. GET /widgets/1234,5678`).

Easy for the implementer, easy for the caller

Say we decide to the simple thing and support single lookups only. The big advantage is that this is easier to implement. No need to loop through each ID, and no need to worry about partial failures. In the single example, the implementer only needs to handle the case where the ID maps to widget, and the case where the ID doesn’t map a widget. In the batch version, the implementer needs to specify and test what happens when some of the IDs map to a widget and some don’t.

That may seem easy, but the three common options can both hurt API callers.

If the API returns an error whenever any widget is missing, callers won’t be able to behave intelligently when some IDs can’t be resolved.

If the call simply leaves the missing widgets out of the response, the caller is responsible for verifying each component of the request individually. And callers can make mistakes! For example, this code would break in a suprising way if the API returned partial results but the caller assumed it crashed.

let ids : list<ID> = ...
let widgets : map<ID, Widget> = getWidgets(ids)
for (id : ids) {
  print("widget id : " + id + " is " + widgets[id])
}

A third option is to change the return format to encode either an error or a result for each widget. For example:

The the call would look something like

interface ErrorOrWidget {
  e Error
  w Widget
}
let widgets : map<ID, ErrorOrWidget> = getWidgets()

This is a good option, and leaves a lot of room for extension. For example, you could embed information about the cause of the missing widget in your error. Was it a permissioning issue? A database that was down? A widget that simply doesn’t exist?

But it also asks more of a type system – which is especially relevant if defining the API with an interface description language like thrift or gRPC/protobuf. Say our widget service starts returning Gizmos and DooDads along with Widgets. We now need three new types: GizmoOrError, DooDadOrError, and WidgetOrError.

So the choice is clear — always use unbatched APIs, right?

Not necessarily, batched APIs can offer tons of opportunities for improving performance.

Performance

This section contains a lot of stuff to watch out for. Obviously, there’s no one size fits all solution and you should benchmark whenever you can to make intelligent decisions.

A common scenario is that each API call is gated behind some authorization — maybe each call needs to load a session out of the database, or maybe each call needs to do some cryptographic verification of an authentication token.

Maybe each Widget lookup requires joining against one of a small number of Gadgets.

By creating a single lookup, we’re going to incur that cost once per item looked up (instead of once per call if we were to use a batched lookup.)

Most importantly, by creating a single lookup API, it becomes the responsibility of the caller to parallelize calls. Say an API caller wants to lookup 100 widgets. The caller should kick off 100 separate calls in parallel and wait for them all to complete, instead of executing one call after the other. Depending on the languages expected to call the API, this can be tough. Languages that with library support for promises/futures can express this idiom naturally. Other languages make it harder.

There are also transport-level concerns when using single lookups. Many RPC protocols like Thrift’s binary protocol, and gRPC, support multiplexing — the ability to to execute multiple requests over a single network connection simultaneously. One of the big problems HTTP 2 solved was to remove the one-request-at-a-time limitation.

Why is this important? If you’re making 100 requests in parallel, you’ll need 100 separate connections. And in many protocols, a connection is expensive. If you’re using TLS over your network boundary (and you should be) each connection incurs the cost of a TLS handshake. Among other costs to a connection, on a Unix-like system, each connection requires a file-descriptor (of which there are finite.)

But there are advantages to single lookups too.

Each call naturally maps to a cache entry, so putting something like varnish cache in front of the server, or putting a cache like caffeine behind your client can get a big performance boost.

If your service is behind a proxy which can fan our requests, or has clients which are capable of selecting from one of several nodes of the same highly available service, you are more likely to have even load between all nodes of your service.

Uneven load can be a problem: imagine having one request which is badly behaved — maybe it tries to resolve 1,000,000 widgets — when most calls try to resolve three of four at a time. The big request may slow down other requests. By forcing each request to be serviced separately, you can ensure that no single requests brings down a server. By having each request use a predictable amount of resources (CPU, memory, database calls, etc.) find the number of concurrent requests your service can handle without slowing down and plug in a rate limiting library easily.

You might pick a limit to the batch size to avoid this. Say 100 widgets per call. This solves the problem of single calls bringing down a service, but comes with all the client-side complexity of single lookups with few of the advantages.

Monitoring

Speaking of performance, which choice is easier to monitor for performance? Single-lookup APIs have a big advantage here, because you can (usually) expect each single lookup to have uniform performance.

In Conclusion

In conclusion, there’s no single correct approach, and the particulars of your service are important. The backward compatibility guarantees you need to commit to, and the degree of control over clients you have are large factors in deciding what kind of API to design.