Graphics, Hardware

UPDATE #2: nVidia NV100 [Fermi] is less powerful than GeForce GTX 285?


At the beginning of this week, nVidia officially unveiled its third generation of Tesla GPGPU product line. From the selling prices, this product line looks like yet another cash cow for the company, just like the Quadro series [according to Jen-Hsun Huang, Quadro accounts for less than 25% of all shipments, yet makes up for almost 75% of NVDA profits].

NV100-based Tesla board - Fermi cGPU architecture
NV100-based Tesla board – nVidia took a page from AMD’s book and added DVI output. Now you no longer have to purchase expensive Quadro FX 5800 if you want to build a quad-GPU based personal supercomputer

There will be two NV100-based Tesla cards: C2050 and C2070. As it usually happens, differences between them are clocks and the amount of system memory: C2050 comes with 3GB GDDR5, while C2070 packs full 6GB of GDDR5 memory, pretty much the maximum amount of memory you can shove on a PCB of relatively compact dimensions. ASP [Average Sales Price] for C2050 is set at $2499, a nice $1000 jump over the existing generation. Tesla C2070 will go for $3,999 but do bear in mind that the cost of these products is significantly higher due to massive amount of GDDR5 memory. Four GPU Server boxes based on C2050 will go for $12,995 while four C2070 will go for $18,995.

According to specifications based on NV100 A2 silicon [subject to change], C2050 will deliver 520 GFLOPS of IEEE 754-2008 Dual Precision format and 1.040 TFLOPS of single precision. C2070 stands a bit better, 630 GLOPS of Dual-Precision and 1.26 TFLOPS in Single Precision. The single-precision numbers are the inconvenient reason why nVidia didn’t mention single-precision performance on any of the Tesla launch slides. The numbers are a pretty big letdown, given that you can buy an EVGA’s factory overclocked GT200-based GTX 285 FTW board, packing 1.063 single-precision TFLOPS. Truth to be told, previous generation Tesla C1060 only delivered 933 GFLOPS SP and mere 77 GLOPS in Dual Precision so scientific community can be happy with the precision increase.

nVidia NV100/GT300 based cards look near-identical to the mock up card
NV100 boards look near-identical to the mock up card – you can either use 1×8-pin or 2×6-pin PEG connectors

Power demands are really interesting: as we reported months ago, NV100/GT300 boards targeted 225W TDP and that target has been achieved. The boards pack 8-pin [150W] and 6-pin [75W] PEG [PCI Express Graphics] connectors but as expected, you should either use a single 8-pin or dual 6-pin to power the card. While this will be true for Quadro and Tesla-based parts, we expect that nVidia’s partners will find a way to use both 8-pin and 6-pin power connectors to enable overclocking. Given the achieved performance on these OEM-spec-limited parts, we would not be surprised to see significantly overclocked consumer cards that as usual, will launch prior to commercial lineup [Quadro / Tesla].

As far as consumer products go, NV100 will be a tough sell from the numbers game, as 15-month old ATI Radeon HD 4870 has equal amount of computational horsepower as the upcoming GeForce GTX 300 series with 1 and 1.5GB of GDDR5 memory. When we take a look at 2.72 TFLOPS delivered by ATI Radeon HD 5870 or the imminent launch of AMD’s Hemlock, a dual-GPU board which according to Rick Bergman features "5 TFLOPS out of this baby" for $599.

History Revolvo Ipsum – History repeats itself
To us, NV100 i.e. GT300/Fermi architecture is looking more like NV30 every day. Revolutionary architecture, but underpowered, just like the first GeForce 256 [NV10] paved way for GeForce 3 [NV20] and GeForce 4 [NV25/28] and like NV30 paved way for NV40/45/47/RSX [essentially, GeForce 5800, 6800, 7800, 7900 and RSX are the same]. Do note that NV35 [FX5900] was "an anomaly", with added fixed-function shader hardware and wider memory controller to make GeForce 5800 architecture usable. How will NV100 fare? That chapter is still waiting to be written.

Update #1, November 18, 2009 01:14AM GMT – We were contacted by Mr. Andrew Humber, Senior PR Manager for Tesla Business and we also read few of the comments. In our story, it was not our intention to compare a consumer card – GeForce GTX 285 with a commercial one [C1060]. The essence of our story was that the NV100/GT300/Fermi board is not offering the same performance jump as it was the case in previous generations, while at the same time the price changed significantly.
This is a small comparison between three generations of Tesla parts:
3Q 2007: C870 1.5GB – $799  – 518 GFLOPS SP / No DP support
2Q 2008:
C1060 4GB  – $1499 – 933 GFLOPS / 78 GFLOPS DP
2Q 2010:
C2050 3GB  – $2499 – 1040 GFLOPS / 520 GFLOPS DP
3Q 2010:
C2070 6GB  – $3999 – 1260 GFLOPS / 630 GFLOPS DP

As you can see, C1060 offered 78-fold increase in Dual Precision [:-] over the previous generation and almost doubled the computing performance. This is not the case with the upcoming generation of NV100-based parts as far as single precision performance goes, even though it is an 8x increase in performance.

Update #2, November 18 2009 02:17AM GMT – Following our article, we were contacted by Mr. Andy Keane, General Manager of Tesla Business and Mr. Andrew Humber, Senior PR Manager for Tesla products. In a long discussion, we discussed the topics in this article and topics of Tesla business in general. First and foremost, Tesla is the slowest clocked member of Fermi-GPU architecture as it has to qualify for supercomputers. The way to win the HPC contract is more complex than the CPUs itself.
Bear in mind that Intel flew heads out of O
ak Ridge with their otherwise superior Core 2 architecture after Woodcrest had a reject rate higher than 8% [the results of that Opteron vs. Xeon trial in 2006 are visible today, as Oak Ridge is the site of AMD-powered Jaguar, world’s most powerful supercomputer]. In order to satisfy the required multi-year under 100% stress, nVidia Tesla C2050/2070 went through following changes when compared to the Quadro card [which again, will be downclocked from the consumer cards]:

  • Memory vendor is providing specific ECC version of GDDR5 memory: ECC GDDR5 SDRAM
  • ECC is enabled from both the GPU side and Memory side, there are significant performance penalties, hence the GFLOPS number is significantly lower than on Quadro / GeForce cards.
  • ECC will be disabled on GeForce cards and most likely on Quadro cards
  • The capacitors used are of highest quality
  • Power regulation is completely different and optimized for usage in Rack systems – you can use either a single 8-pin or dual 6-pin connectors
  • Multiple fault-protection
  • DVI was brought in on demand from the customers to reduce costs
  • Larger thermal exhaust than Quadro/GeForce to reduce the thermal load
  • Tesla cGPUs differ from GeForce with activated transistors that significantly increase the sustained performance, rather than burst mode.

Since Andy is the General Manager for Tesla business, he has no contact with the Quadro Business or GeForce Business units, thus he was unable to answer what the developments are in those segments. We challenged nVidia over several issues and we’ll see what the resolve of those open matters will be. In any case, we’ll keep you informed.