Thursday, February 21, 2013

About the Shmoo posts

I've been writing posts based on my Shmoocon talk. They are under the label "Shmoocon2013". Click on that link to go through the posts. I should have them all posted by March 1, 2013.

I'm also linking all those posts under the "Table of Contents". Click on "Contents" above to see the links in a structured view.

Thursday, February 14, 2013

C10M in the 1990s

I describe C10M in terms of the modern Internet of 2013, where desktops can have 128-gigabytes of RAM, and servers must handle millions of TCP connections. But there is nothing terribly new about the fundamental principles. They've been used since at least the 1990s.

Back in April of 1998, I started coding "BlackICE", which was the first "intrusion prevention system" or "IPS". Our first product actually shipped on Windows, though it also worked for Linux.

Network stacks were exceptionally poor back then. A Pentium Pro could handle about 7000 interrupts per second, so any operating system (Linux, Windows, etc.) could be shut down by sending packets at that rate. This locks the CPU in the interrupt handler, so not even the mouse moved. Since then, operating systems have changed their drivers so switch to polling at high rates, so it's no longer possible, but back then, this was a big concern.

Thus, we were left with no choice but to write our own hardware drivers. Our solution looked essentially like PF_RING does today: configure the network adapter to DMA directly into a user-mode ring-buffer. I don't know enough about PF_RING, I think there were some differences. 

For example, we made our ring-buffers overly large. For example, the first gigabit NICs from Intel only supported 256 descriptors. The network drivers for Windows and Linux therefore only allocated 256 descriptors in their kernel ring buffers. Our driver allocates 10,000 descriptors. This meant that at 1488000 packets/second, our driver wouldn't drop any packets, even if the code was delayed by more then 256 packets.

Another difference was for transmit. The application was an IPS that transmitted packets. We re-transmitted them directly from the receive queue. In other words, one adapter's receive-buffer was the other adapter's transmit-buffer. I'm not sure if PF_RING can work in that way.

Back then, dual-socket Pentium Pro systems were the target. Therefore, we didn't need multi-core scalability since we had only two cores. But we still used the same principles. We marked one core for exclusive use by the "packet processing" thread running the IDS code. On Windows, this was done by marking that thread "real time" priority, meaning it had priority over everything else in the system. Everything else, including the driver thread, ran on the other core.

Because the packet came in the ring-buffer, we didn't need to use the "lock" prefix for synchronization. Since the Pentium Pro's were on a shared bus, the x86 "lock" was a full memory transaction, and thus created a noticeable performance impact on the system.

Back then, incoming packets DMAed by the driver invalidated the cache entries. Thus, reading packets was a cache miss. (These day's, they are DMAed directly into the cache). There were multiple cache misses, once for the header information, and again for the packet data. Therefore, we co-located the info, putting the descriptor information right next to the packet data. I remember spending a day debugging a performance hit why reading byte 20 into the packet suddenly caused performance to drop off. That's because at byte 20, we reached the next cache line. The 44 bytes contained all the other info from the packet, like packet length, timestamp, and so forth.

The Pentium III was a major performance boost in the code. It had better automatic prefetching logic, as well as manual prefetch commands. We found that the ideal wasn't to prefetch the next packet, but to issue a prefetch two packets ahead. These days with Intel putting the packets directly into the cache for me really annoys me after all the work I did to figure out how to do this myself.

I had a conversation once with the creator of the PIX firewall. He related going through similar issues. The major difference we had was that BlackICE was built on top of existing operating systems, just reserving the resources for its exclusive use. PIX, on the other hand, used its own custom operating system. The guy tried to convince me of the superiority of having a custom operating system, but since I had 100% exclusive use of the resources, I didn't see the benefit. I suppose if I had only a single CPU, I'd need a custom operating system, because otherwise packet jitter would get real bad, but since my packet processing threads are on their own dedicated CPUs, that's not a problem.

Wednesday, February 13, 2013

Wimpy cores and scale

The first question I got about my C10M Shmoocon presentation was: "But doesn't wimpy cores solve scale?". The idea here is that instead of a small number of "beefy" servers, that we build a large number of "wimpy" servers using ARM or Atom processors. These are the low-speed/low-power processors found in mobile phones.

There is no power advantage

A common misconception is that ARM processors are more power efficient. They aren't. They consume the same watts per computation as other processors. Instead, they achieve their goal of low-power for mobile devices simply by being slower. When enough wimpy ARM processors are combined to equal the performance of a single beefy Intel x86 processor, the combination uses just as much power.

Or more. All the interconnections with Ethernet, SATA, RAM, drives, and USB consume a lot of power per device. That's what's interesting about the Raspberry Pi hobbyist computers based on ARM. The model "B" that adds a USB port and Ethernet port consumes over twice the electrical power as the model "A".

Wimpy ARM computers fighting against the dominant Intel x86 sparks our imagination like flying cars. It sounds really cool, but the math just doesn't pan out.

The scalability advantage

While wimpy ARM servers don't have a power advantage, they may have a scalability advantage. As we all know, Apache doesn't scale with a lot of connections to the server. Doubling the number of CPU cores inside a server won't help. But, doubling the number of servers will.

In other words, consider an Apache server that handles 5k connections. Doubling the number cores within the server will only increase that to 6k connections. However, two separate servers can handle 5k connections each, or 10k total.

Thus, many wimpy ARM servers running Apache is preferable to a single beefy server.

However, the same effect is done with VMs on a beefy server. If you reach Apache's scalability problems, just create multiple VMs on the same same server. So this isn't necessary an inherent advantage of wimpiness.

In any case, with our wimpy Apache array, we still need a load balancer in front it it. That's going to be a beefy computer running something like nginx. It's likely that this server is beefy enough to just run our application in the first place, had we written it in a scalable asynchronous manner to begin with instead of Apache threads. In other words, the wimpy servers haven't so much solve the scalability problem so much as moved it around.

Scalability is not just Apache

Lots of other things than just Apache have scalability issues, and not all of them can be broken down into multiple servers.

Consider the .com DNS server. There are 100 million domains. Assuming 100 bytes per domain, that means these DNS servers need 10 gigabytes of RAM. That's not possible with today's 32-bit ARM servers. Thus, instead of thousands of ARM servers using anycast, you still need a smaller number of beefier servers.


While scalable servers like nginx are rapidly becoming more popular, Apache is still the dominant web platform. Arrays of wimpy computers (or VMs on beefy computers) will be a way of dealing with Apache. The smarter alternative is just write better software.