When Nvidia launched GT200 chip, the company claimed around 1TFLOPS of Single-Precision computing power, and roughly 150 GFLOPS of Dual-Precision performance.
This discrepancy was mostly due to the fact that Nvidia went with dedicated hardware for the DP support. Every eight-shader cluster had one dedicated dual-precision unit, costing millions of additional transistors and resulted in doubtful performance.
Fast forward to January 2009, and we have SP performance at 933 GFLOPS, while achievable DP performance dipped down to 78 GFLOPS. This figure is roughly half of what Nvidia boasted about at the time of launch, and sheer evidence that both manufacturers like to overstate the performance of actual parts. What makes things interesting is the fact that Tesla GPGPU boards aren’t even most powerful parts in the Nvidia line-up. That “honor” goes to newly introduced GTX285 and 295. In professional line-up, Quadro FX 5800 has more “oomph”, thanks to higher shader clock. but even FX5800 will remain below 100 GFLOPS in dual-precision operations… making this GPU “just” 2.5x faster than quad-core Xeon processor.
Then again, if you activate parallel execution, CPU will drop to sub-10 GFLOPS values, while the GPU will remain at 78 GFLOPS for DP and 933 GFLOPS in single precision. At the same time, ATI’s architectural concept of “emulating” the DP units by pairing more processing units in one cluster resulted in actual peak performance of 900 GFLOPS for the 4870 part (claimed performance: 1.2 TFLOPS) and 250 GFLOPS for the Dual-Precision formats. This is an impressive difference, showcasing ATI’s lead from the architectural standpoint. Extractable performance is a bit different, since some ISVs managed to extract that performance, such as ElcomSoft password cracker, while some hit different walls and could not get better performance.
The real dilemma now is to wait and see what kind of computing performance lies with upcoming 40nm GPUs.