3D, AMD, Business, Companies, Graphics, Hardware, Software Programs

AMD’s Folding performance explained, future development revealed

Following the article about Top graphics cards for Folding@Home, it seems that I managed to get some doors opened and receive answers  from the people closely involved with the project.
I had that luck of being contacted by people who were or still are involved with the project, and thus their answers were quite interesting. Names will remain unrevealed, of course.;-) In order to keep the clarity of the article, I’ve dumbed down some items that came up in discussions  – I will try to keep it both technical and simple. Impossible task, I know.
Onto the matter then – the reason for ATI’s problems lies in the fact that ATI had a client for several hardware generations. Going back to the beginning, Dr. Vijay Pande (head of the F@H project) and Mike Houston (GPGPU pioneer, now emplyoee of AMD) demonstrated Folding@Home client around two years ago, using ATi Radeon X1900 as a base for demonstration.
The Problem
And here lies the problem with current GPU client – ATI X1K hardware comes off with one big flaw – lack of local memory share between the shader units. As you probably know, Nvidia designed G80 and following GPUs with shaders in groups of 8 units, featuring cache in-between them. According to our sources, that cache issue that stop ATI from achieving greatness, because we heard claims that their VLIW shader arrangement works in “best in class” mode.

The reason for GeForce dominance lies in the purple bar - scratch cache

The reason for GeForce dominance lies in the purple bar - scratch cache

Then again, problem in gaming with X1K and later R600 and RV670 was the relative lack of texture units (TMUs), and problem with GPGPU continued to be – local share. You now might be wondering what will happen if you don’t put that “scratch cache” in the GPU. What happens is that your CPU will be constantly polled, and this drags the performance down to the gutter.
We heard a lot of technical details about that particular issue, and the difference in scaling between dual-core CPU and a quad-core one. All in all, quite interesting stuff. But there is one large point to be made: the reason why Nvidia is so successful with CUDA is the fact that Nvidia offered what companies needed (scratch cache, CUDA, math libraries), while ATI suffers from selecting Brook+ to be their bread and butter until OpenCL comes along.
RV770 saves the day…or not?
The RV770 GPU, more known as Radeon 4800 series is a vast improvement over previous generations. GPGPU-wise, most important thing is introduction of local share, since every 10 shaders got their “slice of the pie”. But GPGPU is more complex field that just “here is the feature, we can all use it now”.
Our sources repeatedly criticized Brook+, claiming it is not in sync with AMD’s own CTI and Stream SDK’s. Brook+ allegedly breaks “with new drivers, with old drivers”, “whatever can go wrong, it can” and so on.
ATI’s hardware now has local share, but that support has to be hard-coded into Brooke+. AMD recently released 1.21 Beta Stream SDK featuring local share, but that same support has to come inside Brooke+ as well.
The Solution: Q1’09
So, we have shown you the problem, and now the time is for the solution. ATI can’t fix the performance issue on previous-gen hardware, but it will solve multitude of issues on Radeon 4800 boards. The team at Stanford is taking some necessary steps to re-do the workflow and introduce local memory share. This could take months, so realistic goal is to have a new client coming in Q1’09.
Once that Radeon 4870 gets fully utilized, those 800 shaders and 70% of theoretical value (700-800 GFLOPS instead of 1-1.2 TFLOPS) should be good enough for reaching the level of GTX280.

Next story update will bring some views and opinions from AMD folk.