At the recently held 2015 HotChips conference, Avinash Sodani (KNL Chief Architect, Senior Principal Engineer, Intel) gave a speech how Intel plans to expand the Xeon Phi product lineup from a server-only, PCIe card concept into three different packages, which would appeal to the workstation and server customers in different fields. On SC’15 Conference, which takes place in Austin, TX – Intel finally confirmed the strategy and is coming out with a workstation product that will feature a fully-enabled Knights Landing (KNL) Many-Core processor.
In the first half of 2016, the company will ship Intel-built, Intel-branded workstation powered by self-booting Xeon Phi processor. The processor will be able to boot standard operating systems such as Linux distributions or Microsoft Windows. The main purpose for the workstation is to ‘enable researchers to develop and test their code before the deployment inside the supercomputer environment.’
Thus, this is not a PCIe, discrete product which most news organizations are showing pictures of, this is a computer equipped with two sockets – one will feature Xeon Phi chip, while the other socket will probably be a single Xeon E3 that provides the display output (KNL does not support display outputs).
As you can see on picture above, Intel Knights Landing (KNL) brings 72 processing cores split into 36 tiles, with similarities to AMD’s Bulldozer architecture. Each tile packs two cores with their respective L1 caches, 1MB of L2 cache and four VPU (Vector Processing Unit). This will be the largest chip Intel ever made, finally larger than the dead-end Itanium architecture. The new chip should be the fastest product Intel has ever made, with 16GB of on-board memory (Intel uses MCDRAM acronym, i.e. Multi-Channel DRAM), and a memory controller with six 72-bit channels that can support additional 384GB of DDR4-2400 memory.
Performance in theory, sounds great – 6 TFLOPS Single Precision, 3 TFLOPS Double Precision. However, as always with Intel – the numbers need to be brought into perspective. What the company is doing is amazing when it comes to what they have done so far, but as with integrated graphics performance, reality is something completely different.
The numbers look great on paper, but we need to look into the competition. AMD consumer board, Radeon R9 Fury X packs 8.6 TFLOPS of Single Precision, while NVIDIA M40 brings 7 TFLOPS of Single Precision (all numbers are IEEE 754 standard). Both AMD and NVIDIA limit the Double Precision performance of their products, and for reasons we do not understand, M40 and M4 are the first Tesla products where NVIDIA cut DP performance to a consumer level. Thus, Intel has a great opportunity for all the markets that demand Double Precision. This is where the danger for Xeon Phi lies – today, you can buy higher performing parts for a lower price, and as we saw in several data centers, Intel Xeon Phi in its current iteration is far from being an efficient product. However, Xeon Phi is the only one that can boot the OS itself, while NVIDIA and AMD both need a CPU to boot.
2016 will see the arrival of two new architectures from AMD and NVIDIA – AMD will debut Arctic Islands architecture, with Greenland GPU on the high end, and NVIDIA will debut Pascal, which will bring native connection not just to x86 processors like AMD Opteron and Intel Xeon, but IBM POWER8 and future POWER9 as well. AMD and Intel will use 14nm process (AMD will use GlobalFoundries 14nm FinFET process), while NVIDIA will utilize TSMC’s 16nm FinFET process.
To bring things into perspective, ever since it debuted in 2000, NVIDIA Quadro family ruled the roost of professional graphics accelerator market. This was accented with the arrival of HPC family named Tesla in 2007. On the other hand, AMD’s FirePro family always came distant second, with the best market share being 2.1 graphics cards for every 7.9 Quadro’s sold. Over the past decade, the share stabilized at about 10% for AMD, and even though sometimes AMD came up with a better feature set, or hardware capabilities, NVIDIA remained as ‘king of the hill’.
That might change, though. When Intel envisioned its graphics processing unit, codenamed “Larrabee”, the company wanted to enter the discrete market with a bang. Unfortunately, after numerous years in development and several billion dollars, Larrabee was dead on arrival, and caused significant management shakeup in the company. Now, the company has a chance to show all of its muscle and get into the market. This will not happen in 2016, but 2017 and 2018 might see more discrete parts based on future versions of Xeon Phi, that might give a serious run for the money not just in Tesla/FirePro S/Xeon Phi space, but in Quadro/FirePro market as well. Remember, all these markets are very low volume, but extremely high margin.
Still, both NVIDIA and AMD did not deliver ease of access when it comes to programming, which is as we all know, much more important than sheer performance. In our conversations with the organizations we advise, we discovered that all three vendors are heavily criticized. AMD pushes everything on OpenCL, which is not ‘mature enough’, NVIDIA CUDA ‘limits you to a single platform’, but also ‘is the most scalable and provides highest efficiency that Amdahl’s law allows’, while Intel was criticized for ‘it is not true that (current) Xeon Phi is completely binary compatible, as you have to recompile all the applications,’ followed by ‘the product is too hot and does not scale as (Intel) promised.’ Due to NDA limitations, we cannot disclose people behind the quotes we provided here, but we can confirm that they come from leaders of supercomputers in Top 10 and Top 25 on Top500 list.
Furthermore, where the market opportunity lies is that ‘extra step’ Intel is making with KNL. Researchers are vying for more memory onboard, and neither AMD nor NVIDIA did a good job in order to address that problem. The problems researchers and developers are trying to solve does not fit in products which originally are developed with consumers in mind (Radeon, GeForce), and the ability to create a complete server-oriented, commercially-oriented part could be Intel’s golden ticket. All people we work and continuously talk with do not accept arguments from AMD and NVIDIA in terms of putting more memory, because at the end of the day, Tesla is a GeForce with unlocked DP performance, lower clock and (sometimes) more memory and ECC support (which again eats into available memory space).
There is a fascinating battle developing here… we can’t wait for more company, like Qualcomm’s server developments.