Modern high end CPUs are pretty fast these days: an Intel Xeon E5v3 (Haswell-EP) can pack up to 18 cores and two thirds of double precision teraflop in floating point power, while the 2015 Shenwei Alpha from China, with upwards of 32 vector-assisted cores per die, will crunch even more numbers per second. On the other hand, the GPUs have accelerated their own compute roadmap, with both Nvidia (NASDAQ: NVDA) and AMD (NYSE: AMD) devices in the 2015 schedule breaking through the 3 teraflop DP ceiling. Of course, both CPUs and GPUs of this generation come with well tuned, high bandwidth memory systems too.
The same of course applies to Intel’s (NASDAQ: INTC) Xeon Phi compute accelerator, with the next years’ Knights Landing 3 TFlop DP version matching nicely to the next generation Broadwell based Xeon E5v4. Knights Landing Xeon Phi, with its 16 GB 3D stacked memory on the package, will bring new levels of low latency ultra high bandwidth in-memory processing capabilities.
But the problems come when trying to connect these CPUs and GPUs together – the PCI Express link, used now in 99% of the cases, drastically impairs the connection, with its maximum 20 GB/s achievable net bandwidth and up to 1 microsecond roundtrip latency, over an order of magnitude slower latency what Intel QPI, AMD HyperTransport or IBM POWER8 peripheral buses and Nvidia NVlink do – and for many short transfers common in HPC, that latency can mean a lot. These other connections enable coherent shared memory between all those CPUs and GPUs, rather than messaging and copying between separate memory spaces.
So, even though the 2015 Knights Landing will still have to rely on PCIe V3 for connection to its Xeon cousins, the 2016 variety could – hopefully – use the far more efficient QPI. They better do, as, by then, the Nvidia “Pascal” GPU generation, the one after Maxwell, will team up with IBM Power8+ and Power9 to use common NVlink for tight, low latency, shared memory connection between IBM CPUs and Nvidia GPUs in computational environs.
Mind you, that need not apply just in some large supercomputers, but even in your own high end Linux workstations. If the speculated OpenPower expansion to China bears fruits soon, and we see an inexpensive Power8+ lookalike from there, with NVlink on board, making high speed heterogeneous yet shared memory ultrafast 20 – 50 TFLOPs workstations will become a reality within a year or so.
However, there’s a company that could have done it all, much earlier – you guessed it right, AMD. Remember HyperTransport, the most faithful follow-on of the Alpha EV7 bus, ahead of QPI and such? Well, why didn’t they put HyperTransport on its Hawaii and later high end GPUs, and let the GPUs coherently share each other’s memory and that of the matching Opteron CPUs? Even CrossFire stuff would operate far, far faster and neater.
It’s not too late for the company, though. If AMD does decide to again (hopefully) produce top end CPUs, and connects them via HyperTransport to its own arrays of GPUs, they could be back in business.