Event, IDF 2014, Reviews

Haswell-EP Workstation Preview: Xeon E5 v3 Rocks, But Still More To Go

Today, as Intel (NASDAQ: INTC) launches the third generation of its Xeon E5 dual-CPU platform, many eyes are on the improvements it brings to the servers in the datacenter. However, the benefits are just as high – if not higher – on the high-end workstation front.

First of all, Haswell core means sped-up AVX floating point, by inclusion of fused multiply-add (FMA) ops for theoretical FP rate doubling in benchmarks like Linpack, for instance. Haswell’s AVX2 also, just as importantly, moves integer processing to the wide parallel AVX engines, essentially offloading anything aside the address calculations to the RISC-like, three-address AVX instruction format and wide register sets. For workstation apps, once re-compiled to take advantage of it, the benefits could be enormous, and also be another gradual move away from the antiquated X86 code base.

Then, the wide choice of a number of cores per SKU – from 8 all the way to 18 – enables you to pick the right balance of per-core speed (i.e. per-thread performance) and core number, depending on the parallelism of your application. Some apps scale less well across many cores, thus preferring high per-core speed, while others like ray tracing make the most out of many cores.

The initial workstation SKU in the Xeon-E5 v3 range, the E5 2687W v3 flavor, is a 3.1 GHz 10-core part that actually uses the 12-core die where 2 cores (and their associated caches) were turned off. Now, its predecessor, the 2687Wv2 on the Ivy Bridge platform, had full L3 caches even if some cores of the die were disabled, a benefit that, I guess, we will only see back in Broadwell-EP (E5 v4) SKUs next year.

Then we come to DDR4 – yes the initial DIMMs aren’t exactly speedy, especially latency-wise, but the lower voltage and other reliability features of DDR4, together with quick improvements in speed and latency expected over the next few quarters, should provide the users the never-before seen capacity on a dual-socket workstation, beyond 1.5 TB RAM, without sacrificing the bandwidth on high load situations like DDR3.

The improvements in the PCIe bandwidth, integrated voltage regulation, and sped-up QPI to 9.6 GT/s also round out the key extra benefits.

Putting it through its paces

Here we look at the initial reference workstation based on this SKU from Intel, packaged by BOXX. The machine itself is compact, using liquid cooling on a SuperMicro X10DAi workstation mainboard with three PCIe x16 v3 slots. This doesn’t max out the platforms theoretical quad-GPU full bandwidth capability, but should be enough for most users. In return, the board has space for 16 DDR4 DIMMs, i.e. a full terabyte of RAM if using 64 GB modules available early next year. The installed RAM was 128 GB, in 8 pcs of Samsung 16 GB ECC DDR4-2133 RDIMMs.

The system came with a Nvidia Quadro K2000, which I changed to AMD FirePro W9100, arguably the most powerful professional OpenGL card available as of today. With 16 GB VRAM and six DisplayPort outputs, the card is able to drive even 8K displays like the one from BOE Technology that we mentioned last week. Intel’s 240 GB + 400 GB (SATA + PCIe) SSD combo completed the picture.

The first benchmark was the brand new SPECwpc all-encompassing workstation productivity benchmark by SPEC, on this system. The suite, which takes a couple hours to run, covers everything from processor to graphics (a.k.a ViewPerf) to overall system performance, and seems to do the job with much less trouble than, for instance, BAPcO SysMark did years ago on the PCs.

Here are the first SPECwpc results, on the dual 3.1 GHz E5-2687W v3 system:

Next, we ran CineBench 15 – note that the system is about twice as fast as an overclocked 4+ GHz Core i7-5960X, the desktop Haswell-E brethren to these Xeons.

In CPU-Z, you can see the data about the CPU.

Then we come to the newest version of SiSoft Sandra. Here is the report on the key performance data.


In our next round, we will be focusing on the changes in performance obtained when changing – and tuning – the main memory, as well as looking at the opportunity for even higher CPU speed. In my own opinion, the workstation market can easily justify higher TDP – and maybe even unlocked – Xeons, especially in both 8 core and 18 core per socket configurations.