The floating-point unit in the synergistic processor element of the
1st generation multi-core CELL processor is described. The FPU
supports 4-way SIMD single precision and integer operations and 2-way
SIMD double precision operations. The design required a
high-frequency, low latency, power and area efficiency with primary
application to the multimedia streaming workloads, such as 3D
graphics. The FPU has 3 different latencies, optimizing the
performance critical single precision FMA operations, which are
executed with a 6-cycle latency at an 11FO4 cycle time. The latency
includes the global forwarding of the result. These challenging
performance, power, and area goals were achieved through the co-design
of architecture and implementation with optimizations at all levels of
the design. This paper focuses on the logical and algorithmic aspects
of the FPU we developed, to achieve these goals.
               (
geocities.com/de/christian_jacobi)                   (
geocities.com/de)