Business, Graphics

ARM Highlights Benefits of Their New, Faster GPGPU


Last week, ARM held their TechCon 2010 technology conference. They introduced a new Mali-T604 GPU [Graphics Processing Unit] IP and accompanying CoreLink 400 IP interconnect which provides high-speed cache sharing. Both of these are partnered up with the recently announced 2.5GHz Cortex-A15 MPCore processor design.
System IP for ARM Cortex-A15 processor and Mali-T604 GPU
System IP for ARM Cortex-A15 processor and Mali-T604 GPU reveals architectural improvements for increased performance while keeping the trademark advantage over competition in power consumption
Tudor Brown, ARM’s COO and former CTO, noted at Techcon 2010 that they are now 20 years old, and ARM’s partners have shipped 20 billion of its cores to date. Over the next 10 years, he said, another 100 billion cores will be sold. He gave an overview of the developing markets that ARM is going to address with their new offerings.
Linaro, the Linux tool kit consortium for ARM IP demoed three mobile devices using its Ver 10.11 open source stack at the ARM Technology Conference. The promise of Linaro is to speed time-to-market for Linux based products by standardizing common chip-specific software.
Linaro’s focus on upstream development has gotten attention from the Linux community. Linux Foundation executive director Jim Zemlin said "The Linux Foundation welcomes the increase in upstream investment that Linaro has made on behalf of the ARM community. The collaborative engineering work Linaro is doing in the Linux kernel will help accelerate innovation in open source."
During the Linaro presentations, three dual-core ARM Cortex-A9 CPU variations were demoed. Samsung showed its Orion, and Texas Instruments had its OMAP4 processor running Ubuntu. ST-Ericsson demoed its U8500 chip running MeeGo. All three demos used Linaro tools. ARM and five of its SoC [System on Chip] customers formed Linaro in June at Computex in Taiwan. The group now includes 70 full-time engineers and expects to have more than 100 by early next year.
According to ARM’s specifications the CoreLink 400 system IP "enables designers to resolve the critical issues of coherency, virtualization, latency and power management to ensure each processor is able to share memory resources and maximize overall system performance."
Michael Dimelow, ARM’s Director of Processor Marketing, told BSN* that without hardware coherency the flushing and invalidation of data requires many CPU cycles because data is written to main memory [DDR]. This burns power, increases latency and occupies the CPU. Plus, cache maintenance software is notoriously difficult to debug.
BSN* also spoke with Ian Smythe, Director of Marketing and Jem Davies, ARM Fellow, VP of Technology, in the Media Processing Division. They said the CoreLink 400 systems AMBA 4 ACE module allows the hardware to manage cache coherency. Caches do not need to be flushed or invalidated and external memory accesses are reduced. Therefore, shared data can now be read directly from processors? caches. Hardware coherency simplifies software and the processor spends less time maintaining caches ?good for power and performance. This also improves QoS [quality of service] for communications applications.
ARM’s Mali-T604 GPU is fourth-generation IP which will work with Cortex-A9. The T604 is claimed to be five times faster than the current ARM GPU. The specific model for comparison was left out. We guess they mean their current multicore Mali-400 MP 2D/3D GPU.
Michael Dimelow told BSN* that their new Mali-T604 supports "General Purpose computing on GPU [GPGPU] applications which gives developers enhanced augmented reality and gesture recognition.?
ARM Mali-T604 features
ARM’s new GPU brings a sea of improvements over the current generation Mali-400 MP
The Mali-T604 has new core technology to reduce memory bandwidth usage by up to 30 percent, which in turn delivered benefits in power efficiency. The "green transistors" at work in the Mali-T604 mean less power consumption in mobile phones, tablets, DTVs, and automotive infotainment. EPRI says the world’s energy consumption will increase 57 percent from 2002 to 2025.
Lance Howarth, general manager of the Media Processing Division at ARM, said "Visual computing is driving the next generation of consumer electronics, as consumers and developers demand the highest levels of graphics performance." He explained that the "tri-pipe architecture in the Mali-T604 provides both market leading compute functionality and high-performance graphics without compromise, enabling unequaled user experiences in energy-efficient consumer electronic devices."
ARM claims the Mali-T604’s triple-pipeline design means that the chip can be used for both GPGPU and GPU tasks simultaneously. "GPGPU capabilities will make it possible for procedural content to be computed on the device," said Dr S├ębastien Deguy, founder and CEO, Allegorithmic.
High Level Architectural Overview of ARM Mali-T604
High Level Architectural Overview of ARM Mali-T604

ARM has also included API support for Kronos Group’s OpenGL ES and OpenVG plus Microsoft’s DirectX along with the game-friendly DirectX 11. This means we will be seeing the Windows 7 Embedded OS and the Windows 7 Phone OS running on Mali-T604 GPGPU with dual
and quad-core Cortex-A9 CPU. Likely candidates are Cortex-A9 based Marvell ARMADA XP and Samsung’s Hummingbird.
At the launch of the Cortex-A15 MPCore processor IP, ARM said it is equipped with an out-of-order superscalar pipeline, along with a tightly-coupled low-latency L2 cache of up to 4MB. ARM says the processor can decode and dispatch up to three instructions per cycle, or three times the rate of an ARM11 processor.

The Cortex-A15 specifications say up to eight instructions can be issued per cycle, enabling the processor to take less than 10 microseconds to move into standby or wake up. Floating point and NEON instruction set performance for signal processing and multimedia have also been improved the company says.
ARM's roadmap for Cortex MP processors
ARM’s roadmap for Cortex MP processors places next-gen A15 core coming in late 2012…and this is something ARM executives aren’t telling
Compared to the Cortex-A9, the Cortex-A15 is said to add more efficient hardware support for operating system [OS] virtualization, soft-error recovery, larger memory addressability, and system coherency. Those "green transistors" are being very busy while using small amounts of power.
What are these "green transistors"? ARM has always been able to get more computing power per watt of power consumption than the x86 architecture CPU, because, the ARM architecture was developed around battery power from the beginning. They are the number one choice for powering mobile phones and smartphones.

In turn, Intel is famous for their "performance transistors". The x86 architecture was developed jointly by Microsoft and Intel. However, performance transistors require a large power source and they are always plugged into shore power sooner than green transistor hardware. In the past ARM’s green transistors have not been able to compete head-to-head with Intel’s x86 performance transistors – see Van Smith’s "The Coming War: ARM versus x86" – for a comparison of the two approaches.
The new products announced at TechCon 2010 should move into direct competition with Intel in several markets. ARM is known for leveraging their partners? abilities. This week, Marvell announced their ARMADA XP, a four-core Cortex-A9 running at 1.6GHz aimed at the server marketplace.
Another step up for the Cortex-A9 is the optimization package targeting Samsung 32nm LP High-K Metal Gate [HKMG] process technology. The presentation showed how the ARM Processor Optimization Pack [POP] provides a highly tuned foundation for implementing Cortex-A9 processors in low power, mobile applications. Based on ARM Artisan optimized logic and memory physical IP, the POP is also supported by implementation knowledge and ARM benchmarking, providing a rich foundation for leading edge SoC designs. The Processor Optimization Pack enables operation over 1GHz, and is available for immediate licensing from ARM.
The importance of having Samsung as the lead fab for 32nm ARM IP is that they are a part of the Common Platform Alliance. The Alliance is moving ahead with offering their customers the same level of wafer quality at all the members? locations.

Samsung's 32nm ARM Cortex based chipKuang-Kuo Lin, director of design enablement in the foundry unit for Samsung Semiconductor says they are ready to start volume production on their main fab lines of 32nm ARM IP products. Sooner than later, Samsung Korea’s new 32nm ARM IP products will go into allocation because of high volume demand from their foundry customers including Apple, Qualcomm, and Xilinx. BSN* expects that the overflow demand will result in GlobalFoundries’ Dresden and New York fabs picking up new customers.
This brings us back to the green transistor versus performance transistor debate. Intel is finding it more difficult to decrease the power usage for a performance transistor than it is for ARM and its partners to translate their green transistors and gradually increasing the number of CPU and GPU instructions per cycle while  maintaining low-power consumption from batteries.
Since ARM’s Mali is a fourth-generation GPGPU, their potential for a successful product delivering all the claimed features is much higher than when Intel leaped head first into the GPU business with their ill-fated Larrabee project. The Mali-T604 introduction is well matched to ARM customer requirements. BSN* expects it to dominate the smartphone marketplace and be a contender in the tablet market.
Gil Russell contributed his interview information at ARM TechCon 2010 to this article.