3D, Business, CPU, Graphics, Hardware

Nvidia’s discloses its DP performance limitations

When Nvidia launched GT200 chip, the company claimed around 1TFLOPS of Single-Precision computing power, and roughly 150 GFLOPS of Dual-Precision performance.
This discrepancy was mostly due to the fact that Nvidia went with dedicated hardware for the DP support. Every eight-shader cluster had one dedicated dual-precision unit, costing millions of additional transistors and resulted in doubtful performance.

Fast forward to January 2009, and we have SP performance at 933 GFLOPS, while achievable DP performance dipped down to 78 GFLOPS. This figure is roughly half of what Nvidia boasted about at the time of launch, and sheer evidence that both manufacturers like to overstate the performance of actual parts. What makes things interesting is the fact that Tesla GPGPU boards aren’t even most powerful parts in the Nvidia line-up. That “honor” goes to newly introduced GTX285 and 295. In professional line-up, Quadro FX 5800 has more “oomph”, thanks to higher shader clock. but even FX5800 will remain below 100 GFLOPS in dual-precision operations… making this GPU “just” 2.5x faster than quad-core Xeon processor.

Then again, if you activate parallel execution, CPU will drop to sub-10 GFLOPS values, while the GPU will remain at 78 GFLOPS for DP and 933 GFLOPS in single precision. At the same time, ATI’s architectural concept of “emulating” the DP units by pairing more processing units in one cluster resulted in actual peak performance of 900 GFLOPS for the 4870 part (claimed performance: 1.2 TFLOPS) and 250 GFLOPS for the Dual-Precision formats. This is an impressive difference, showcasing ATI’s lead from the architectural standpoint. Extractable performance is a bit different, since some ISVs managed to extract that performance, such as ElcomSoft password cracker, while some hit different walls and could not get better performance.

The real dilemma now is to wait and see what kind of computing performance lies with upcoming 40nm GPUs.

  • Gipsel

    You are completely right that nvidias GT200 double performance is lacking compared to ATI current offerings. An already quite old HD3850 has about the same double precision performance than a GTX280. A HD4870 is almost three times faster!

    But depending on the problem CPUs aren’t as bad as you indicate. It is almost the same as with GPUs, some algorithms fit the architecture and some not. Generally speaking the probability that a given problem tanks on a GPU (obtained performance <10% of peak) is a lot higher than on a CPU.

    Just as an example, I have implemented an embarrassingly parallel (perfectly suited to GPUs) algorithm using double precision for CPUs as well as for ATI GPUs. It runs with about 40% (Core2, K10) to 60% (Core i7, HT helps a lot here) of theoretical peak performance on CPUs and 62% of theoretical peak on ATI GPUs (150GFlops sustained on a HD4870, peak would be 240GFlops).
    But because it is not a 50:50 ratio of multiplication and additions (or MADs for ATI) it is actually quite close to the maximal obtainable performance for this algorithm. There are virtually no stalls on ATI GPUs, it really executes one instruction per unit and clock cycle more than 95% of the time. The Core i7 is actually not that far behind. The hyperthreading like execution of the parallel tasks on the GPU really helps to avoid stalls and boosts the utilization (same is true for the i7) in such cases.

    And if you wonder what algorithm I am talking about, it’s Milkyway@home.