Part 2 describes the techniques BDTI used for optimizing DSP algorithms on the Cortex-R4. For more analysis of ARM cores, see Can the ARM11 Handle DSP?
In 2004, ARM announced its newest generation of licensable cores, called the "Cortex" family. Cortex cores span a wide range of performance levels, with Cortex M-series cores at the low end, Cortex R-series cores providing mid-range performance, and the Cortex A-series applications processors offering the highest performance. The first Cortex core to be announced was the Cortex-M3, and since then ARM has announced several others, including the Cortex-A8 and A9, the Cortex-M1, and the Cortex-R4.
The Cortex-R4 targets moderately demanding applications such as hard disk drives, inkjet printers, automotive safety systems, and wireless modems. It is marketed as a higher-performance replacement for the older ARM9E core. BDTI recently completed a benchmark analysis of the ARM Cortex-R4 core and is now releasing the first independent signal processing benchmark results for this processor. In this article, we'll take a look at its benchmark results and compare its performance to that of other ARM cores (including the ARM11, another moderate-performance core) and selected competitors.
Table 1 summarizes key attributes of selected ARM processor cores.

Table 1. Characteristics of selected ARM cores.
* Clock speed data provided by ARM, not verified by BDTI. Clock speeds for ARM9E and ARM11 are worst-case speeds in a TSMC CL013G process and ARM Artisan SAGE-X library. Clock speed for Cortex-R4 is worst-case for a 90 nm CLN90G Artisan Advantage implementation. High-end clock speed for Cortex-A8 is based on a custom implementation.
As shown in Table 1, the Cortex-R4 is a superscalar core that can issue and execute up to two instructions per cycle. Like the Cortex-A8, it supports the ARMv7 instruction set architecture and the Thumb2 compressed instruction set, but the Cortex-R4 does not support the NEON signal processing extensions. As a result, its signal processing capabilities and features are much more limited than those of the Cortex-A8.
The Cortex-R4 as a Signal Processing Engine
The Cortex-R4 targets applications that include moderate signal processing requirements, and the core includes hardware and instructions to help improve its performance on this type of processing. For example, the Cortex-R4 supports SIMD (single instruction, multiple data) instructions that enable it to perform two 16-bit multiply-accumulate operations (MACs) per cycle; MAC operations are heavily used in many common signal processing algorithms, such as filters and FFTs.
To assess the Cortex-R4's signal processing capabilities and compare its performance to that of other processors, BDTI benchmarked the Cortex-R4 using the BDTI DSP Kernel Benchmarks, a suite of 12 key DSP algorithms such as FIR filters, FFTs, and a Viterbi decoder. These benchmarks are hand-optimized for each processor, typically in assembly language, and verified by BDTI. The BDTI DSP Kernel benchmarks have been implemented on a wide variety of processor cores and chips, providing a range of comparison data for evaluating new processors.
BDTI uses processors' results on the DSP Kernel Benchmarks to generate an overall signal processing speed metric, the BDTImark2000. (When the benchmark performance is verified using a simulator rather than hardware, this metric is called the BDTIsimMark2000.) The BDTImark2000 metric combines the number of cycles required to execute each benchmark with the processor's instruction cycle rate (i.e., its clock speed) to determine the amount of time the processor requires to execute the benchmarks. For off-the-shelf chips, we use the fastest clock speed at which the chip is currently shipping. For licensable cores, the clock speed depends on how the core is fabricated. To enable apples-to-apples comparisons, BDTI typically uses clock speeds for their cores fabbed in a TSMC 130 nm process, under worst-case conditions. ARM has not reported this data for all of its cores, so BDTI has used alternate clock speeds in some cases, as noted in the table above.
In Figure 1, we present BDTIsimMark2000 cores for selected ARM cores, alongside BDTImark2000 scores for two off-the-shelf DSP processor chips for comparison.

(Click to enlarge)
Figure 1. BDTImark2000 scores for selected cores and chips. The BDTImark2000 is a composite DSP speed metric based on processors' results on the BDTI DSP Kernel Benchmarks. A higher score indicates a faster processor. ARM has not provided clock speeds for the Cortex-R4 and Cortex-A8 that conform to BDTI's uniform conditions for cores; therefore, the results for these two cores should not be compared to results for non-ARM cores.
As shown in Figure 1, the Cortex-R4 and ARM11 have similar signal processing performance. (For a full analysis of the ARM11's signal processing performance, see "Can the ARM11 Handle DSP?") The Cortex-R4 is not intended to replace the ARM11; rather, ARM positions the Cortex-R4 as a higher-performance replacement for the ARM9E. Compared to that processor, the Cortex-R4 is nearly three times as fast. Some of the speed increase is due to the Cortex-R4's more powerful architecture (we'll discuss this more later), and some is due to its faster clock speed.
At the clock speeds shown above, the Cortex-R4's signal processing speed is similar to that of the Texas Instruments TMS320C55x, a widely used, mid-range DSP chip. At this level of performance, the Cortex-R4 may be able to subsume the processing typically allocated to a low-cost DSP processor. At 450 MHz, the Cortex-A8 with NEON signal processing extensions is more than twice as fast as the 375 MHz Cortex-R4. (The 450 MHz clock speed used here to calculate benchmark results for the Cortex-A8 is the estimated speed of the core as fabricated in Texas Instruments' OMAP3410 chip.)
From the data presented in Figure 1, it's clear the clock rate accounts for only part of the signal processing speed differences among processors. The other factor is the processors' architectural "power"—that is, how much work each processor can accomplish in each clock cycle. In the next section, we'll look at some of the architectural differences that contribute to the performance numbers shown above.