Intel seems stuck at 2P and HT still has massively lower latency but 80 GByte/s worth of PCIe lanes is huge. As big as main-memory throughput huge! Hence DDIO, which you reference, which allows IO to write to cache and skip the historic data-path it used to take through main memory. AFAIK AMD doesn't have anything equivalent. And they only have 16 lanes on chip: the rest come out of io-hubs.
I'd love to see someone actually try and use all that Intel PCIe IO and report on how utilized those pipes can get. Perhaps someone wants to send the PacketShader people a box loaded with GPUs? That'd be great, thanks!
Cool project! I wonder if you'd get similar perf from CPUs if you could used Intel's ISPC compiler[0] with the same GPU algorithms. I've found that GPU algorithms also perform substantially better on plain old CPUs, IMO because they use memory bandwidth more effectively.
I too would like to see how far those PCI Express busses can be pushed. :)
BTW We're adopting Intel's DPDK[1] approach to get massive packet processing performance on a single machine. So far we're liking it, but we'll see as it's not in production yet.
I'd love to see someone actually try and use all that Intel PCIe IO and report on how utilized those pipes can get. Perhaps someone wants to send the PacketShader people a box loaded with GPUs? That'd be great, thanks!
http://shader.kaist.edu/packetshader/