Wednesday, February 10, 2016

Guide to Multi-processing Network Server Models - Kapil Sharma

As someone who’s been writing high performance networking code for a number of years now (my doctoral dissertation was on the topic of a Cache Server for Distributed Applications Adapted to Multicore Systems), I see many tutorials on the subject that completely miss or omit any discussion of the fundamentals of network server models. This article is therefore intended as a hopefully useful overview and comparison of network server models, with the goal being to take some of the mystery out of writing high performance networking code.
Which Network Server Model Should I Choose
This article is intended for “system programmers”, i.e., back-end developers who will work with the low-level details of their applications, implementing network server code. This will usually be done in C++ or C, though nowadays most modern languages and frameworks offer decent low-level functionality, with various levels of efficiency.
I’ll take as common knowledge that since it’s easier to scale CPUs by adding cores, it’s only natural to adapt the software to use these cores as best it can. Thus, the question becomes how to partition software among threads (or processes) which can be executed in parallel on multiple CPUs.
I’ll also take for granted that the reader is aware that “concurrency” basically means “multitasking”, i.e. several instances of code (whether the same code or different, it doesn’t matter), which are active at the same time. Concurrency can be achieved on a single CPU, and prior to the modern era, usually was. Specifically, concurrency may be achieved by quickly switching between multiple processes or threads on a single CPU. This is how old, single-CPU systems managed to run many applications at the same time, in a way that the user would perceive as applications being executed simultaneously, although they really weren’t. Parallelism, on the other hand, means specifically that code is being executed at the same time, literally, by multiple CPUs or CPU cores.
Partitioning an Application (into multiple processes or threads)
Common Network Application Tasks and Network Server Models
  • Task #1: Establishing (and tearing down) of network connections
  • Task #2: Network communication (IO)
  • Task #3: Useful work; i.e., the payload or the reason why the application exists
  • MP: Multi-Process
  • SPED: Single Process, Event-Driven
  • SEDA: Staged Event-Driven Architecture
  • AMPED: Asymmetric Multi-Process Event-Driven
  • SYMPED: SYmmetric Multi-Process Event-Driven
The Multi-Process (MP) Model
The Single Process Event-Driven (SPED) Model
  • Ask the operating system if there are any new network “events” (such as new connections or incoming data)
  • If there are new connections available, establish them (Task #1)
  • If there is data available, read it (Task #2) and act upon it (Task #3)
  • Repeat until the server exits
  1. Since all three tasks are done sequentially in a single loop iteration, the payload work (Task #3) is done synchronously with everything else, meaning that if it takes a long time to compute a response to the data received by the client, everything else stops while this is being done, introducing potentially huge fluctuations in latency.
  1. Only a single CPU core is used. This has the benefit, again, of absolutely limiting the number of context switches required from the operating system, which increases overall performance, but has the significant downside that any other available CPU cores are doing nothing at all.

For the purpose of this discussion, it’s largely not relevant if we are talking about threads or full processes. Modern operating systems (with the notable exception of Windows) treat processes almost as lightweight as threads (or in some cases, vice versa, threads have gained features which make them as weighty as processes). Nowadays, the major difference between processes and threads is in the capabilities of cross-process or cross-thread communication and data sharing. Where the distinction between processes and threads is important, I will make an appropriate note, otherwise, it’s safe to consider the words “thread” and “process” in this section to be interchangeable.
This article deals specifically with network server code, which necessarily implements the following three tasks:
There are several general network server models for partitioning these tasks across processes; namely:
These are the network server model names used in the academic community, and I remember finding “in the wild” synonyms for at least some of them. (The names themselves are, of course, less important – the real value is in how to reason about what’s going on in the code.)
Each of these network server models is further described in the sections that follow.
The MP network server model is the one that everyone used to learn first, especially, when learning about multithreading. In the MP model, there is a “master” process which accepts connections (Task #1). Once a connection is established, the master process creates a new process and passes the connection socket to it, so there is one process per connection. This new process then usually works with the connection in a simple, sequential, lock-step way: it reads something from it (Task #2), then does some computation (Task #3), then writes something to it (Task #2 again).
The MP model is very simple to implement, and actually works extremely well as long as the total number of processes remains fairly low. How low? The answer really depends on what Tasks #2 and #3 entail. As a rule of thumb, let’s say the number of processes or threads should not exceed about twice the number of CPU cores. Once there are too many processes active at the same time, the operating system tends to spend way too much time thrashing (i.e., juggling the processes or threads around on the available CPU cores) and such applications generally end up spending almost all of their CPU time in “sys” (or kernel) code, doing little actually useful work.
Pros: Very simple to implement, works very well as long as the number of connections is small.
Cons: Tends to overburden the operating system if the number of processes grows too large, and may have latency jitter as network IO waits until the payload (computation) phase is over.
The SPED network server model was made famous by some relatively recent high-profile network server applications, such as Nginx. Basically, it does all three tasks in the same process, multiplexing between them. To be efficient, it requires some fairly advanced kernel functionality like epoll and kqueue. In this model, the code is driven by incoming connections and data “events”, and implements an “event loop” which looks like this:
All of this is done in a single process, and it can be done extremely efficiently because it completely avoids context-switching between processes, which usually kills the performance in the MP model. The only context switches here come from system calls, and those are minimized by only acting on the specific connections which have some events attached to them. This model can handle tens of thousands of connections concurrently, as long as the payload work (Task #3) isn’t overly complicated or resource intensive.
There are two major downsides, though, of this approach:
It is for these reasons that more advanced models are required.
Pros: Can be highly performant and easy on the operating system (i.e., requires minimal OS intervention). Only requires a single CPU core.
Cons: Only utilizes a single CPU (regardless of the number that are available). If the payload work is not uniform, results in non-uniform latency of responses.

The Staged Event-Driven Architecture (SEDA) Model

The SEDA network server model is a bit intricate. It decomposes a complex, event-driven application into a set of stages connected by queues. If not implemented carefully, though, its performance can suffer from the same issue as the MP case. It works like this:
  • The payload work (Task #3) is divided into as many stages, or modules, as possible. Each module implements a single specific function (think “microservices” or “microkernels”) which resides in its own separate process, and these modules communicate with one another via message queues. This architecture can be represented as a graph of nodes, where every node is a process, and the edges are message queues.
  • A single process performs Task #1 (usually following the SPED model), which offloads new connections to specific entry point nodes. Those nodes can be either pure network nodes (Task #2) which pass the data to other nodes for computation, or can implement the payload processing (Task #3) as well. There is usually no “master” process (e.g., one that collects and aggregates responses and sends them back over the connection) since every node can respond by itself.
In theory, this model can be arbitrarily complex, with the node graph possibly having loops, connections to other similar applications, or where the nodes are actually executing on remote systems. In practice, though, even with well-defined messages and efficient queues, it can become unwieldy to think, and reason about the system’s behavior as a whole. The message passing overhead can destroy the performance of this model, compared to the SPED model, if the work being done at each node is short. The efficiency of this model is significantly lower than that of the SPED model, and so it’s usually employed in situations where the payload work is complex and time-consuming.
Pros: The ultimate software architect’s dream: everything is segregated into neat independent modules.
Cons: Complexity can explode just from the number of the modules, and message queuing is still much slower than direct memory sharing.

The Asymmetric Multi-Process Event-Driven (AMPED) Model

The AMPED network server is a tamer, easier-to-model version of SEDA. There are not as many different modules and processes, and not as many message queues. Here’s how it works:
  • Implement Tasks #1 and #2 in a single “master” process, in the SPED style. This is the only process doing network IO.
  • Implement Task #3 in a separate “worker” process (possibly started in multiple instances), connected to the master process with a queue (one queue per process).
  • When data is received in the “master” process, find a under-utilized (or idle) worker process and pass the data to its message queue. The master process is messaged by the process when a response is ready, at which point it passes the response through to the connection.
The important thing here is that the payload work is performed in a fixed (usually configurable) number of processes, which is independent of the number of connections. The benefits here are that the payload can be arbitrarily complex, and it won’t affect network IO (which is good for latency). There’s also a possibility for increased security, since only a single process is doing network IO.
Pros: Very clean separation of network IO and payload work.
Cons: Utilizes a message queue for passing data back and forth between processes, which, depending on the nature of the protocol, may become a bottleneck.

The SYmmetric Multi-Process Event-Driven (SYMPED) Model

The SYMPED network server model is in many respects the “holy grail” of network server models, because it’s like having multiple instances of independent SPED “worker” processes. It is implemented by having a single process accepting connections in a loop, then passing them on to the worker processes, each of which has a SPED-like event-loop. This has some very favorable consequences:
  • CPUs are loaded for exactly the number of processes spawned, which at every point in time are either doing network IO or payload processing. There is no way to escalate CPU utilization further.
  • If the connections are independent (such as with HTTP), there is no interprocess communication between the worker processes.
This is, in fact, what newer versions of Nginx do; they spawn a small number of worker processes, each of which runs an event loop. To make things even better, most operating systems provide a function by which multiple processes can listen for incoming connections on a TCP port independently, eliminating the need for a specific process dedicated to working with network connections. If the application you are working on can be implemented in this way, I recommend doing so.
Pros: Strict upper CPU usage ceiling, with a controllable number of SPED-like loops.
Cons: Since each of the processes has a SPED-like loop, if the payload work is non-uniform, latency can again vary, just like with the normal SPED model.

Some Low-level Tricks

In addition to selecting the best architectural model for your application, there are some low-level tricks that can be used to further increase network code performance. Here’s a brief list of some of the more effective ones:
  1. Avoid dynamic memory allocation. As an explanation, simply look at the code for the popular memory allocators – they use complex data structures, mutexes, and there’s simply so much code in them (jemalloc, for example is around 450 KiB of C code!). Most of the models above can be implemented with completely static (or pre-allocated) network and/or buffers that only change ownership between threads where needed.
  2. Use the maximum that the OS can provide. Most operating systems allow multiple processes to listen on a single socket, and implement features where a connection will not be accepted until the first byte (or even a first full request!) is received on the socket. Use sendfile() if you can.
  3. Understand the network protocol you are using! For example, it usually makes sense to disable Nagle’s algorithm, and it can make sense to disable lingering if the (re)connection rate is high. Learn about TCP congestion control algorithms and see if it makes sense to try one of the newer ones.
I may talk more about these, as well as additional techniques and tricks to employ, in a future blog post. But for now, this hopefully provides a useful and informative foundation regarding the architectural choices for writing high performance networking code, and their relative advantages and disadvantages.
This post originally appeared on the Toptal Engineering blog