Thursday, February 14, 2013

C10M in the 1990s

I describe C10M in terms of the modern Internet of 2013, where desktops can have 128-gigabytes of RAM, and servers must handle millions of TCP connections. But there is nothing terribly new about the fundamental principles. They've been used since at least the 1990s.

Back in April of 1998, I started coding "BlackICE", which was the first "intrusion prevention system" or "IPS". Our first product actually shipped on Windows, though it also worked for Linux.

Network stacks were exceptionally poor back then. A Pentium Pro could handle about 7000 interrupts per second, so any operating system (Linux, Windows, etc.) could be shut down by sending packets at that rate. This locks the CPU in the interrupt handler, so not even the mouse moved. Since then, operating systems have changed their drivers so switch to polling at high rates, so it's no longer possible, but back then, this was a big concern.

Thus, we were left with no choice but to write our own hardware drivers. Our solution looked essentially like PF_RING does today: configure the network adapter to DMA directly into a user-mode ring-buffer. I don't know enough about PF_RING, I think there were some differences. 

For example, we made our ring-buffers overly large. For example, the first gigabit NICs from Intel only supported 256 descriptors. The network drivers for Windows and Linux therefore only allocated 256 descriptors in their kernel ring buffers. Our driver allocates 10,000 descriptors. This meant that at 1488000 packets/second, our driver wouldn't drop any packets, even if the code was delayed by more then 256 packets.

Another difference was for transmit. The application was an IPS that transmitted packets. We re-transmitted them directly from the receive queue. In other words, one adapter's receive-buffer was the other adapter's transmit-buffer. I'm not sure if PF_RING can work in that way.

Back then, dual-socket Pentium Pro systems were the target. Therefore, we didn't need multi-core scalability since we had only two cores. But we still used the same principles. We marked one core for exclusive use by the "packet processing" thread running the IDS code. On Windows, this was done by marking that thread "real time" priority, meaning it had priority over everything else in the system. Everything else, including the driver thread, ran on the other core.

Because the packet came in the ring-buffer, we didn't need to use the "lock" prefix for synchronization. Since the Pentium Pro's were on a shared bus, the x86 "lock" was a full memory transaction, and thus created a noticeable performance impact on the system.

Back then, incoming packets DMAed by the driver invalidated the cache entries. Thus, reading packets was a cache miss. (These day's, they are DMAed directly into the cache). There were multiple cache misses, once for the header information, and again for the packet data. Therefore, we co-located the info, putting the descriptor information right next to the packet data. I remember spending a day debugging a performance hit why reading byte 20 into the packet suddenly caused performance to drop off. That's because at byte 20, we reached the next cache line. The 44 bytes contained all the other info from the packet, like packet length, timestamp, and so forth.

The Pentium III was a major performance boost in the code. It had better automatic prefetching logic, as well as manual prefetch commands. We found that the ideal wasn't to prefetch the next packet, but to issue a prefetch two packets ahead. These days with Intel putting the packets directly into the cache for me really annoys me after all the work I did to figure out how to do this myself.

I had a conversation once with the creator of the PIX firewall. He related going through similar issues. The major difference we had was that BlackICE was built on top of existing operating systems, just reserving the resources for its exclusive use. PIX, on the other hand, used its own custom operating system. The guy tried to convince me of the superiority of having a custom operating system, but since I had 100% exclusive use of the resources, I didn't see the benefit. I suppose if I had only a single CPU, I'd need a custom operating system, because otherwise packet jitter would get real bad, but since my packet processing threads are on their own dedicated CPUs, that's not a problem.