Today (2013), $1200 will buy you a computer with 8 cores, 64 gigabytes of RAM, 10-gbps Ethernet, and a solid state drive. Such systems should be able to handle:
- 10 million concurrent connections
- 10 gigabits/second
- 10 million packets/second
- 10 microsecond latency
- 10 microsecond jitter
- 1 million connections/second
Such systems already exist. We call them "hardware" or "devices". But in recent years, devices have become just software running on general purpose computers. Rip open a network device and you might find a personal computer inside. The limits of scalability aren't our hardware, but the software, specifically, the operating system.
There are three primary problems.
The first problem is that there is no "fast path" for data. It's a conflict between a "multi-user" design and a "single-purpose" design. There is no way for the primary service (such as a web server) to get priority on the system, leaving everything else (like the SSH console) as a secondary priority.
The second problem is multi-core scalability. Old code is written according to "multi-threaded" or "multi-tasking" principles, where multiple tasks must share a single CPU. New code must be written using different principles, where a single task must be split across multiple CPUs. This requires a ground-up rethink, such as synchronization that never causes a thread to stop and wait.
The third problem is memory. In the last 15 years, the size of main memory has gone up a thousand times, where it's now practical to get 128-gigabytes in a cheap desktop computer. Yet, L1 and L2 caches have remained the same size, with L3 (or last level) caches only going up 10 times. This means at scale, every pointer is a cache miss. To fix this, code must pay attention to the issues of caching and paging.
The solution to all these problems is to partition the machine into two parts: the "control" part and the "data" part. The data-partition grabs the network cards it wants and replaces the drivers with ones that bypass the kernel. The data-partition grabs all the CPU cores it wants and removes them from the operating system scheduler, so no other tasks will run on them. The data-partition grabs the bulk of the RAM and manages it itself.
A decade ago, engineers tackled what they called "the c10k problem" of making servers handle 10 thousand simultaneous connections. The conclusion was to fix the kernels, and write applications in an asynchronous manner. Today with C10M, we start with asynchronous code, but instead of better kernels, we move our code completely out of the kernel.