IBM Nvidia DOE Supercomputer
The Department of Energy announced that is has granted $425 million to build two new supercomputers at Oak Ridge National Laboratory and Lawrence Livermore National Laboratories. The systems are part of a broader CORAL initiative which is a collaboration between Oak Ridge, Argonne and Lawrence Livermore. $325 million of that will be spent on the actual supercomputer building while an additional $100 million will be used for the FastForward2 program, which is designed to encourage and enable hardware vendors to increase performance and efficiency for the next generation.
The first supercomputer, to be known as Summit, will be installed at Oak Ridge National Laboratory and will replace the currently existing ‘Titan’ supercomputer which is capable of a peak performance of 27 petaflops. Summit will be capable of delivering between 150 and 300 peak petaflops (PFLOPS) and will be used for ‘open science’. The Sierra supercomputer, designed to replace the existing Sequoia will be used for nuclear security simulations and will be capable of speeds in excess of 100 PFLOPS as well. Both systems will be faster than the world’s fastest supercomputer right now, Tianhe-2 in China, which currently clocks in at 55 PFLOPS of peak performance. Argonne’s hardware win is yet to be announced, but will be unveiled at a later date (UPDATE: the contract for Aurora and Theta went to Intel).
In order to achieve this level of performance, the laboratories participating in the CORAL initiative are harnessing the power of IBM’s (NYSE: IBM) Power 9 architecture CPUs and Nvidia’s (NASDAQ:NVDA) yet-t0-be-announced Volta GPUs. This means that since this machine is expected to come online in 2017, that we can very likely expect to see Volta GPUs in 2017. The project will use Mellanox’s interconnect technologies to connect the systems together, but in order to connect the GPU to the CPU, they will be using Nvidia’s own NVLink GPU interconnect. NVLink is Nvidia’s own proprietary interconnect specifically designed to increase the communication speed between GPUs and Nvidia is working with IBM to get this interconnect embedded directly into the IBM Power CPUs that will be powering these different supercomputer designs. Additionally, the Summit supercomputer will also be using IBM’s own IBM Elastic Storage using GPFS technology and will store 120 petabytes of data.
The system as a whole, Summit, will only use 10% more power than Titan but will deliver approximately 5-10 the performance of Titan, illustrating where supercomputer designs are headed and how the Department of Energy is really trying to drive high performance increases while also promoting energy efficiency. The expected performance for Summit has already been stated to be between 150 and 300 petaflops, however, this is thanks to over 3400 compute nodes, each delivering 40 teraflops of performance alone. Each node will consist of IBM Power 9 CPU(s) and Nvidia Volta GPU(s), unfortunately we do not know if each node will be a dual processor node or how many GPUs will fit into each node, but the expectation would be a dual processor node with at least 2 GPUs per node.
This hardware win for IBM and Nvidia is a huge one because it illustrates that the OpenPOWER partnership between the two companies is working and that it can enable IBM to ship more CPUs. This is a very big deal for IBM and Nvidia because this is the first supercomputer in the US in a long time that will be built without either Intel or AMD CPUs. It also means that Nvidia will finally make use of NVLink, which they announced will be coming out with the Pascal GPU, the predecessor to Volta. Nvidia has already said we can expect to see Pascal in 2016, which means the transition from Pascal to Volta will be a fairly quick one.
The Department of Energy has stated that the whole purpose of these new supercomputer designs is to enable exascale computing. Both Coral and FastForward2 are supposed to enable hardware manufacturers to help their customers build efficient and powerful suptercomputers capable of over 1 exaflop (or 1000 petaflops). And if they can get the Summit supercomputer to 300 petaflops, that’s going to be a huge step forward to achieving exascale computing and an exaflop supercomputer.