Superscalar Technology
 
CPU Informations: Calculation with pipeline is fast but comprises a bottleneck: only one instruction is carried out in the execution stage.

In a CPU there are the ALU and FPU units.
These two units can work in parallel, because that both carry out two different instructions.
Thus a larger computing speed is obtained.
A superscalar processor (opposed to the scalar processors: 1 layer) comtains more than one layer.
All the layers of a unit work in parallel and thus more one has layers and more the numbers of instructions carried out at the same time increases (a number of instructions = a number of layers).
There can be layers with ALU, FPU, MMX, 3D unit...

Scalar processor with two Layers:     (two layers is the minimum for a superscalar processor)
 
First Instruction... 

Second Instruction... 

Third Instruction... 

Fourth Instruction...

Instructions 1 and 2 start at the same time (in this exemple) and after a cycle, as each layer is pipelined,
instructions 3 and 4 can start.

One 486 (100 Mhrz) is approximately twice less fast than Pentium 100.
However they work at the same frequency! But Pentium is Superscalaire: it has two units ALU (ALU1 and ALU2).

Intel informations:
  ALU FPU
486 SX 1 0
486 DX 1 1
Pentium 2 1
Pentium Pro 2? 1? or 2?
Pentium II 2? or 4? 1? or 2?
 

Future processors are envisaged:

The new processors will be: with pipeline, superscalars (2/4 ALU, 2 FPU, 2 MMX, 1/2 3D).
The 3D unit is likely to be very much used by the accelerating cards, it will be very quickly superscalar (with 2 or 4 layers).

Future processors :
Units: ALU FPU MMX 3D
K6 3D 2 1 or 2 2 1
Cayenne 2 2 ? 1 1
The 3D unit of the new 6x86MX (cayenne) is named MMXFP
because the 3D instruction will be a part of an enhanced MMXunit.

A superscalar unit can’t avoid waitings and certain stages (A and C especially) work during more than one cycle.
Moreover it happens that a Layer1 awaits another Layer2 if it needs the result of  Layer2.
One thus needs for the CPU a unit which deals with these characteristics in order to make pass the maximum of independent instructions.
 

If you want more informations:

NACSE Superscalar Simulator
http://www.nacse.pdx.edu/Simulator/