3D, AMD, Analysis, Companies, Graphics, Hardware, Memory & Storage Space, Reviews, Video Card Reviews, VR World

AMD Radeon Fury X: Potential Supercomputing Monster?

AMD Fiji GPU with four HBM memory modules

When AMD launched its Fiji-based graphics cards, all eyes were focused on its performance in consumer applications such as computer games. And while the first results forced Nvidia to launch “Titan Lite” in the form of GeForce GTX 980 Ti, DirectX 12 benchmarks are starting to show different, brighter outlook for AMD, starting with Ashes of the Singularity.

The focus of this article however, is its potential and usage in applications where Fiji GPU will be branded as Fire Pro, and Fire Pro S (Server) – where AMD can take an ASIC and upsell it to commercial clients, with full-speed enabled for Double Precision floating point operations. And AMD (theoretically) has the highest performing piece of silicon of all times – 8.9 billion transistors, 8.6 SP TFLOPS and once unlocked, massive 4.3 TFLOPS Double-Precision (535 GFLOPS DP in this consumer-focused version). In the world where fast processing and low latency means serious money, does AMD have a chance to shine?

Three Key Points of AMD Fiji

Ever since getting our hands on AMD’s newest gaming top of the line GPU, there was a feeling that this is something special. Firstly, it was by far physically the smallest high-end graphics card in, say, a decade? Thank you integrated liquid cooling – with or without the slight noise heard in the Cooler Master’s earlier revisions of the cooling pump.

AMD Radeon R9 Fury X is an incredibly compact card for a high end product. Unfortunately, the liquid cooling solution is everyting but compact, and putting multiple GPUs in a single computer is a pain.

AMD Radeon R9 Fury X is an incredibly compact card for a high end product. Unfortunately, the liquid cooling solution is everyting but compact, and putting multiple GPUs in a single computer is a pain.

Secondly, the integrated AMD-SK Hynix developed High Bandwidth Memory (HBM) – yes, in AMD-Hynix vs. Intel-Micron battle for 3-D stacked memory resulted in AMD being first in the world to deliver a consumer product with the next generation memory. How next-gen? HBM gives the ridiculously wide, all four Kilobits of it – memory bus which allows stunning bandwidth even with low 500 MHz base memory clock for a truly cool and power saving DRAM. 4096-bit bus gives you 512GB/s memory bandwidth, easily beating practically every GDDR5-based product on the market. If AMD clocks up to 1 GHz, which I am sure shouldn’t be too difficult, they we would have a whopping 1 terabyte per second of memory bandwidth for those four Gigabytes of memory.

Following a small modification, you can push both GPU and HBM RAM inside AMD Catalyst - and both show great promise.

Following a small modification, you can push both GPU and HBM RAM inside AMD Catalyst – and both show great promise.

Thirdly, the GPU overclocks lovely – and we don’t mean frames per second in games. I’ve got a decent, stable 9% to 10% overclock that passes all the HPC math benches with flying colours. Right now, AMD does not allow (officially) overclock the integrated 3-D HBM memory for some extra bandwidth too, essential in the scientific and technical computation. Still, even though AMD does not allow it, you can overclock the memory if you know how. In our case, we managed to get 575 MHz stable, resulting in bandwidth going to impressive 589 GB/s. Depending on a card you have at hand, you could clock the HBM memory to 650MHz, and achieve massive bandwidth of 691.2 GB/s, faster than AMD’s dual-GPU card of yesteryear (R9 295X2 had 640GB/s) or Nvidia GeForce GTX Titan Z (672 GB/s).

The integrated liquid cooling has its pros and cons, of course. The shortened card makes insertion into a system a breeze, even if you want to put 2 or 3 of them. The difference is especially obvious in a cramped, cable-rich system as our test configuration. We paired Gigabyte’s X99 Gaming G1 top end mainboard with Intel current high-end, Core i7 5960X (Haswell-E) processor and Thermaltake liquid cooling for the processor, all packed inside the Antec chassis.

The cons? Well, find another spare position for that integrated radiator and fan to fit in. Every Radeon R9 Fury X has its own cooling loop, with 120mm fan at the other end. If you’re looking for a multi-GPU setup, perhaps you should consider buying a 3rd party liquid cooled system, such as Thermaltake DIY section, Swiftec, Aqua Cooling, or EK-Waterblocks. It’s a shame you cannot buy the PCB alone, or PCB with a DIY waterblock, akin to what EVGA goes to NVIDIA cards.

Once AMD goes ahead with its professional, HPC server versions of this platform, a passive air cooled version will be a must – or a flavour with just a local pump module on the card, to be connected via daisy chain to a rack-wide liquid cooling system.

Knowing this is a consumer version nevertheless, without full OpenGL or fully enabled hardware double precision FP, I wanted to try the core number crunching capability of the chip and HBM without relying on the commercial apps which may prefer double precision, yet abstain from SuperPI type little routines.

AMD Radeon R9 Fury X as seen by SiSoft Sandra 2015 SP3.

AMD Radeon R9 Fury X as seen by SiSoft Sandra 2015 SP3.

Thus, our test suite consists out of freshly minted SiSoft Sandra 2015 SP3 suite, which has a number of major financial and scientific computation routines in both single and double FP precision, and with full OpenCL support on both CPU and GPU. In fact, in some cases, it spreads a single benchmark load across both CPU and GPU resources, with interesting (not always higher – darn PCIe overhead, and AMD killed the HyperTransport NUMA GPU in 2010) results.

Scientific Results

SiSoft Sandra 2015 SP3 Scientific Test - showing clear difference between the world's most powerful consumer CPU and GPU.

SiSoft Sandra 2015 SP3 Scientific Test – showing clear difference between the world’s most powerful consumer CPU and GPU.

Here you can see a summary of single precision GPU only OpenCL, GPU+CPU OpenCL (where available), CPU only OpenCL, and CPU only native code.

Wow… the difference approaches hundreds of times (not percent!) in the favour of GPU when running OpenCL code neck-to-neck. And mind you, this is a 3.6 GHz octa-core i7 5960X with four channels of DDR4-2400 CL14 (theoretical: 76.8GB/s), and souped-up 3.2 GHz uncore! Even when running the native code with full AVX2 FMA support, the difference is still like two orders of magnitude (10x, 100x). And no, Broadwell-E in 2016 and Skylake-E in 2017 won’t change the performance significantly.

To see TFLOPS on one side, and GFLOPS on another is a clear defeat for the CPU strategy, and running a CPU-0nly code is starting to look archaic. After all, even Intel is moving away from CPU-only approach with its upcoming Xeon Phi. Mark these numbers; in N-Body Simulation, Haswell CPU achieves 23.17 GFLOPS at the time when Fury X achieves 3.43 TFLOPS. CPU is 148x slower than a GPU.

Financial Results

AMD_Fury_X_Sandra_FinancialBlack

AMD_Fury_X_Sandra_FinancialBinomAMD_Fury_X_Sandra_FinancialMonteC

Same story here – the difference is huge. Over 10 million versus less 150 thousand in high indexes, while Monte Carlo Euro Option shows where high end calculation has gone. Again, mind you, while the Core i7 5960X will set you back for $1,050, it is basically the same CPU as Intel’s Xeon E5 series in their ‘big iron’ workstation and server line-ups – where company achieves most of its profits. Less cores, but faster per-core clock, and basically the same cache & memory system. Not to mention, heavily tuned app and bench codes. Yet, this $650 GPU runs rings around its ring buses, pun intended. Let’s not get into how many figures you would need to spend in order to achieve the same performance using CPU alone. The difference in price is well covered to go and tell the engineers to ‘earn their salary’.

In short, once they go ahead to release the “Pro” version, with DP FP enabled in hardware, in a let’s say dual GPU form with 8 GB of HBM with hopefully higher clock, this could be a runaway hit in the scientific community – even though 4GB per GPU is simply too little for the workload of today.

AMD_Fury_X_Sandra_MemoryB

AMD should launch HBM2 hardware with 8/16/32GB memory per chip sooner rather than later, as Nvidia will likely follow the same approach in a year’s time. Till then, “the little number crunching monster” is a fitting name for Radeon R9 Fury X. If you are in the numbers business, your app runs on OpenCL (or you are willing do so/pay for some coding) and it is in single precision FP, this petite card can dramatically speed up your work. And, most high end PCs or workstations can easily accept 2 or 3 of them without the machine feeling cluttered inside.

More as we expand the app and benchmark suite further.

  • vasras

    The only interesting thing in computing for years has been the AMD advances in OpenCL computing performance (and to a lesser degree nVidia’s comparative).

    CPU performance increases have stalled now for almost 6 years with modest 25% gains after overlocking both (excluding AVX512 performance on SKylake, which is still to show it’s strength).

    Once we get HBM2 next year with a tighter node, it’s time to go OpenCL card shopping!

    • John Malone

      Apple had a real interesting idea with Mac Pro and have a dedicated compute card on each machine. The compute power on FirePro is almost 30 times the speed of a 4 core Xeon. AMD even got over 50% market share for a time on WS graphics thanks to MacPro. I personally have not tested how this solution work, but OS X and the dev tools have everything needed for highly parallels computing. The wonders of a server based OS like Unix compared to Windows that can’t scale.
      Lets hope MacPro gets bumped with a couple of Fire Pros. I would love to test out 4K H265 encoding with accelerated CL encoding.

      • BaronMatrix

        More than likely they will… Apple was one of the OpenCL founders… It’s a POSSIBILITY that with the right manufacturing ability Apple could move fully to AMD…

  • Gnyff

    I do love the “SiSoft Sandra 2015 SP3 Scientific Test” graph. The missing x-axis units and mismatching legend makes it, well, not very scientific… 😉

    • Gnyff

      On a more general note: I would find a comparison with for example some CUDA cards (GTX 980Ti?) would be more relevant than using Intel standard CPUs? 😉