When AMD launched its Fiji-based graphics cards, all eyes were focused on its performance in consumer applications such as computer games. And while the first results forced Nvidia to launch “Titan Lite” in the form of GeForce GTX 980 Ti, DirectX 12 benchmarks are starting to show different, brighter outlook for AMD, starting with Ashes of the Singularity.
The focus of this article however, is its potential and usage in applications where Fiji GPU will be branded as Fire Pro, and Fire Pro S (Server) – where AMD can take an ASIC and upsell it to commercial clients, with full-speed enabled for Double Precision floating point operations. And AMD (theoretically) has the highest performing piece of silicon of all times – 8.9 billion transistors, 8.6 SP TFLOPS and once unlocked, massive 4.3 TFLOPS Double-Precision (535 GFLOPS DP in this consumer-focused version). In the world where fast processing and low latency means serious money, does AMD have a chance to shine?
Three Key Points of AMD Fiji
Ever since getting our hands on AMD’s newest gaming top of the line GPU, there was a feeling that this is something special. Firstly, it was by far physically the smallest high-end graphics card in, say, a decade? Thank you integrated liquid cooling – with or without the slight noise heard in the Cooler Master’s earlier revisions of the cooling pump.
Secondly, the integrated AMD-SK Hynix developed High Bandwidth Memory (HBM) – yes, in AMD-Hynix vs. Intel-Micron battle for 3-D stacked memory resulted in AMD being first in the world to deliver a consumer product with the next generation memory. How next-gen? HBM gives the ridiculously wide, all four Kilobits of it – memory bus which allows stunning bandwidth even with low 500 MHz base memory clock for a truly cool and power saving DRAM. 4096-bit bus gives you 512GB/s memory bandwidth, easily beating practically every GDDR5-based product on the market. If AMD clocks up to 1 GHz, which I am sure shouldn’t be too difficult, they we would have a whopping 1 terabyte per second of memory bandwidth for those four Gigabytes of memory.
Thirdly, the GPU overclocks lovely – and we don’t mean frames per second in games. I’ve got a decent, stable 9% to 10% overclock that passes all the HPC math benches with flying colours. Right now, AMD does not allow (officially) overclock the integrated 3-D HBM memory for some extra bandwidth too, essential in the scientific and technical computation. Still, even though AMD does not allow it, you can overclock the memory if you know how. In our case, we managed to get 575 MHz stable, resulting in bandwidth going to impressive 589 GB/s. Depending on a card you have at hand, you could clock the HBM memory to 650MHz, and achieve massive bandwidth of 691.2 GB/s, faster than AMD’s dual-GPU card of yesteryear (R9 295X2 had 640GB/s) or Nvidia GeForce GTX Titan Z (672 GB/s).
The integrated liquid cooling has its pros and cons, of course. The shortened card makes insertion into a system a breeze, even if you want to put 2 or 3 of them. The difference is especially obvious in a cramped, cable-rich system as our test configuration. We paired Gigabyte’s X99 Gaming G1 top end mainboard with Intel current high-end, Core i7 5960X (Haswell-E) processor and Thermaltake liquid cooling for the processor, all packed inside the Antec chassis.
The cons? Well, find another spare position for that integrated radiator and fan to fit in. Every Radeon R9 Fury X has its own cooling loop, with 120mm fan at the other end. If you’re looking for a multi-GPU setup, perhaps you should consider buying a 3rd party liquid cooled system, such as Thermaltake DIY section, Swiftec, Aqua Cooling, or EK-Waterblocks. It’s a shame you cannot buy the PCB alone, or PCB with a DIY waterblock, akin to what EVGA goes to NVIDIA cards.
Once AMD goes ahead with its professional, HPC server versions of this platform, a passive air cooled version will be a must – or a flavour with just a local pump module on the card, to be connected via daisy chain to a rack-wide liquid cooling system.
Knowing this is a consumer version nevertheless, without full OpenGL or fully enabled hardware double precision FP, I wanted to try the core number crunching capability of the chip and HBM without relying on the commercial apps which may prefer double precision, yet abstain from SuperPI type little routines.
Thus, our test suite consists out of freshly minted SiSoft Sandra 2015 SP3 suite, which has a number of major financial and scientific computation routines in both single and double FP precision, and with full OpenCL support on both CPU and GPU. In fact, in some cases, it spreads a single benchmark load across both CPU and GPU resources, with interesting (not always higher – darn PCIe overhead, and AMD killed the HyperTransport NUMA GPU in 2010) results.
Here you can see a summary of single precision GPU only OpenCL, GPU+CPU OpenCL (where available), CPU only OpenCL, and CPU only native code.
Wow… the difference approaches hundreds of times (not percent!) in the favour of GPU when running OpenCL code neck-to-neck. And mind you, this is a 3.6 GHz octa-core i7 5960X with four channels of DDR4-2400 CL14 (theoretical: 76.8GB/s), and souped-up 3.2 GHz uncore! Even when running the native code with full AVX2 FMA support, the difference is still like two orders of magnitude (10x, 100x). And no, Broadwell-E in 2016 and Skylake-E in 2017 won’t change the performance significantly.
To see TFLOPS on one side, and GFLOPS on another is a clear defeat for the CPU strategy, and running a CPU-0nly code is starting to look archaic. After all, even Intel is moving away from CPU-only approach with its upcoming Xeon Phi. Mark these numbers; in N-Body Simulation, Haswell CPU achieves 23.17 GFLOPS at the time when Fury X achieves 3.43 TFLOPS. CPU is 148x slower than a GPU.
Same story here – the difference is huge. Over 10 million versus less 150 thousand in high indexes, while Monte Carlo Euro Option shows where high end calculation has gone. Again, mind you, while the Core i7 5960X will set you back for $1,050, it is basically the same CPU as Intel’s Xeon E5 series in their ‘big iron’ workstation and server line-ups – where company achieves most of its profits. Less cores, but faster per-core clock, and basically the same cache & memory system. Not to mention, heavily tuned app and bench codes. Yet, this $650 GPU runs rings around its ring buses, pun intended. Let’s not get into how many figures you would need to spend in order to achieve the same performance using CPU alone. The difference in price is well covered to go and tell the engineers to ‘earn their salary’.
In short, once they go ahead to release the “Pro” version, with DP FP enabled in hardware, in a let’s say dual GPU form with 8 GB of HBM with hopefully higher clock, this could be a runaway hit in the scientific community – even though 4GB per GPU is simply too little for the workload of today.
AMD should launch HBM2 hardware with 8/16/32GB memory per chip sooner rather than later, as Nvidia will likely follow the same approach in a year’s time. Till then, “the little number crunching monster” is a fitting name for Radeon R9 Fury X. If you are in the numbers business, your app runs on OpenCL (or you are willing do so/pay for some coding) and it is in single precision FP, this petite card can dramatically speed up your work. And, most high end PCs or workstations can easily accept 2 or 3 of them without the machine feeling cluttered inside.
More as we expand the app and benchmark suite further.