At the 2016 GPU Technology Conference, Nvidia finally unveiled the Pascal GPU architecture. Perhaps the most interesting aspect of the GPU aren’t the capabilities the Pascal architecture brings, but rather the first non-Intel driven high-end bandwidth interface since AMD launched HyperTransport in 2001. NVLink standard launched in 2014, when IBM announced its tie up with Nvidia to bring the high-speed interconnect to the market.
The goal of NVLink is to remove its future GPU architectures from the dependencies of PCI Express, and achieve maximum bandwidth. If NVLink was replaced with 100% PCIe lanes, the design simply would not be as efficient in terms of lines needed, and would yield 20% less bandwidth (28 GB/s deficit).
While all future Nvidia GPUs will support PCI Express in an add-in board form, Tesla P100 represents the prototype on how the high-end compute products will look in the future, reminding us of slotted processors such as the original Pentium II, AMD Athlon and the like. New design for Nvidia calls for packing as much compute power in a small package which delivers both data and power through dual LGA arrangement.
NVLink Interconnect

NVLink was created out of a need to feed the 15 billion transistor Pascal processor. Even though PCIe Gen3 x16 offers 32GB/s of bi-directional bandwidth, that was nothing compared to the aggregate 80 TB/s on-chip bandwidth, with 16 GB of HBM2 memory offering 720 GB/s of external bandwidth.
What is perhaps the most important aspect is that the bandwidth can aggregate, by ‘ganging’ or teaming the links to deliver 120 GB/s between the GPUs, and feeding 40GB/s into the PCIe Gen3 switch to connect to a CPU.
Nvidia connects up to four GPUs in a single cluster which can address all the memory inside each the separate chip. The cluster is designed to have two fully connected ‘quads’.
The P100 ‘quad’ comes with amazing specifications:
- 14,336 CUDA Cores
- 64 GB HBM2 Memory
- 2.88 TB/s Aggregate Bandwidth
- 640 GB/s Aggregate Link Bandwidth
- 42.4 TFLOPS FP32 Single Precision
- 21.2 TFLOPS FP64 Double Precision
Dual Approach to CPUs: PCIe Switch or Direct Link inside IBM’s POWER
The first iteration of Tesla P100 architecture will rely on Intel Xeon E5 v4 processors, but that is set to change over the course of 2017. Nvidia DGX-1 and partner systems will have to rely on PCI Express switches which will take 40 GB/s NVLinks and downgrade them to 32 GB/s in order to connect to a traditional Xeon E5 processor architecture.
IBM’s OpenPOWER initiative is hosting its annual conference during the GPU Technology Conference, discussing a ‘switch-less’ approach to POWER cores. As it stands today, production systems from OpenPOWER should incorporate direct NVLink connections by the end of 2017. We would not be surprised if both Pascal and Volta GPUs achieve a low double-digit performance improvement over Intel-based servers. Removing the PCIe switch will enable direct Load/Store access to CPU memory. Given that Pascal can address all the memory within 49-bit address field (48-bit for the CPU, 1-bit for the GPU), we should see load balancing between four Pascal cores and all the resources given by the CPU – treating system memory as a form of fourth level cache/memory (L1 + IntraCaches + Scratch Cache / L2 / HBM2 / SysRAM).
NVLink represents the final stage of Nvidia growing up, from the humble beginnings of using the original PCI, AGP and PCI Express. In 2016, the company finally launched its own interconnect, and enabled performance beyond the industry norms.