PENTIUM/486 Optimization

PENTIUM/486 Optimization
------------------------

By Michael Kunstelj.

G'day all...

In case any one is interested, here is some information on optimising
assembly code on the Pent / 486. If you don't know/aren't interested in
ASSEMBLY code, this will all seem about as interesting as ending titles in
a bad X-Rated movie. Still, hopefully someone out there will find this
useful. If so, enjoy... and....

Brief Intro
----------------

Now a brief intro on the innards of the 486 and Pent586.
Most processors these days have within in them a system termed a
"pipeline".
The 486 / 586 are certainly within this category. However, most people
aren't quite sure what exactly a pipeline is, here's a rough analogy:

Imagine a processing plant, with a line of employees each performing a
certain task. On one end of the line, the first employee receives a piece,
which he works apon. After completing his task, he passes it to the next
employee, who in turn works apon the item. The process continues until
the item is at the end of the line, apon which time it is considered
completed. However, with this simplistic model, does it not seem strange
that for each item, all employees must wait for the first item to
manufactured before the next one can be considered. Yup, it is, and stupid
to boot.
It makes far more sense for each employee to work on something in the
interim. Thus, employee 1 in the above example finishes his work on the
item, passes it to employee 2, and then picks up a new item and starts
working on it.

That's pretty well the same system in a Microprocessors pipeline.
Numerous blocks perform in parallel, passing their information on when
the stage in front of it is free.

Because of pipelining, times taken in terms of instruction cycles become
blurred. An instruction that takes 7 clock cycles to implement, for
example, will be behind in the pipeline from the previous instruction. After
the previous instruction is completed, the final stage in the pipeline will be
dedicated to the next instruction. This final stage may take perhaps 1 or 2
clock cycles. The net effect? The instruction will effectively appear to
have taken 1 or 2 clock cycles to execute.

The 586 Pent is a superscalar system using multiple pipeline to process
data. You get the idea from the example above sort of how it would work.

Anyway, here are some ways to speed up 486/586 code:

SPEEDING UP 486/586 CODE.
--------------------------------------------------

AGI
------
An Address Generation Interlock is occurs when a register which is
currently being used as the base or an index was the destination component
of a previous instruction.

For example,:

add edx, 4
// AGI, ONE CLOCK STALL
mov esi, [edx]

For the 486, AGI's can only occur on adjacent instructions. On the 586
Pent, having the term superscalar in its design specification to amaze
people with its high degree of parallel execution, instructions up to 3
locations away can cause an AGI.

The code: (worst case scenario)

add esi, 4
pop ebx
dec ebx
mov edx, [esi]

Takes 4 clocks on a 486. On a 586, the move command must wait for the
add command, thus AGI stalling for one clock. The code above then
would run in three clock cycles on a 586.

Some Instructions that read or write implicitly to registers cause AGI stalls.
eg. mov esp, ebp [agi] pop ebp.

When decoding an instruction with either an index or an immediate-
displacement combination, 486's have a one clock penalty. (586's do not)
Example:

mov spam, 248
mov dword ptr [esp+4], 1

486's have a one clock penalty when modifying the lower word of a
DWORD register. 586's do not.

Example:
mov al,0
mov [ebp], eax
(a one clock penalty on 486, no penalty on a 586)

CODE ALIGNMENT
-------------------------------

To speed up the 486 instructions, alignment of code should be on the
beginning of cache boundaries.
(32 bytes on 586, 16 bytes on 486)
Thus, start your routine on the start of the cache page. This will speed up
the processing of all the instructions within that page. The effect of this
speed up on the 586 is not a dramatic as is it is on the 486.

DATA ALIGNMENT
------------------------------

Misaligned access in the data cache causes an extra 3 cycles on both the
486 and 586.
Ways to speed up data:
For DWORD data, alignment should be on a 4-byte boundary.
For WORD data, alignment should be on a 2 byte boundary for the 486,
and simply within the 4-byte page for the 586.
For 8 byte data (64 bits), data should be aligned on a 8-byte boundary.
And yes, on many applications with tight inner loops, these things do
accumulate to make a noticeable speed-up.

SPEEDING UP REGISTER AND OPERAND USAGE
-------------------------------------------------------------------------------

Use the EAX as much as possible. Many instructions are 1 byte shorter
when using EAX. That's less code you have to move around and slightly
less internal processing.
Use the DS register as much as possible for roughly the same reason,
In the case of several references being made to a variable addressed with a
displacement, load the displacement into a register.
Try to use the ESP register to reference the stack in lower levels of your
subroutines.

HANDY INFO ON SPEEDING UP INTEGER INSTRUCTIONS
---------------------------------------------------------------------------------------------

Spaced out yet? Feel like optimising code now is a job for anal-retentive
nasal-spray toting coders locked in small dark rooms somewhere? Well,
here are some tips to make you swear-off from using nasal spray ever
again. Well, not actually that bad, but hey.... Here are some
optimisations one can perform on the Integer side of the 486/586. I never
use the floating point processor, and for that matter don't like it a whole
heap. Most people used fixed point routines for fast graphics etc anyway.

1. Avoid using complex instructions like LEAVE, ENTER, LOOP, string
instructions etc Simpler instructions will get the job done faster. Simpler
instructions have been getting faster with every new processor that has
come along.

2. With 8-bit operands, do your best to use the byte opcodes, rather than
using the 32 bit instructions with sign and zero extended modes. Internal
sign extension is expensive, and speed increases can be found by
simplifying the process as much as possible.

3. LEA's generally increase the chance of AGI's. However, LEA's are
advantageous when:
-> In many cases an LEA instruction may be used to replace constant
multiply instructions. (a sequence of LEA, add and shift for example)
-> LEA may be used as a three/four operand addition instruction.
LEA ECX, [EAX+EBX*4+ARRAY_NAME]
-> Can be advantageous to avoid copying a register when both operands to
an ADD are being used after the ADD as LEA need not overwrite its
operands.

4. Zero-Extension with short ints terror tales
The MOVZX takes four cycles to execute due to due zero-extension
wobblies. A better way to load a byte into a register is by:
xor eax,eax
mov al,memory

As the xor just clears the top parts of EAX, the xor may be placed on the
outside of a loop that uses just byte values. The 586 shows greater
response to such actions.
It is recommended that 16 bit data be accessed with the MOVSX and
MOVZX if you cannot place the XOR on the outside of the loop.

5. When comparing a value in a register with 0, use the TEST command.
TEST operates by ANDing the operands together without spending any
internal time worrying about a destination register. (Compared with just
ANDing a register with itself to test 0 condition)
Use test when comparing the result of a boolean AND command with an
immediate constant for equality or inequality if the register is EAX. (IF
(VAR & 4) {...} for example)

6. Address Calculations

Pull address calculations into load and store instructions. Memory
reference instructions have 4 operands: a relocatable time segment base, a
base register, a scaled index register and a displacement. Often several
integer instructions can be eliminated by fully using the operands of
memory addresses. Often, most people don't worry about it and turn to
drinking coffee instead. I'm not sure, care less. Its an options to speed
things up if you want it.

7. INTEGER DIVIDE
In most cases, an Integer Divide is preceded by a CDQ instruction.
This is as divide instructions use EDX:EAX as the dividend and CDQ sets
up EDX.
It is better to copy EAX into EDX, then right shift EDX 31 places to sign
extend.
The copy/shift instructions take the same number of clocks as CDQ,
however, on 586's allows two other instructions to execute at the same
time. If you know the value is a positive, use XOR EDX, EDX.

8. INTEGER MULTIPLY
The integer multiply by an immediate can usually be replaced with a faster
and simpler series of shifts, subs, adds and lea's.
As a rule of thumb when 6 or fewer bits are set in the binary representation
of the constant, it is better to look at other ways of multiplying and not use
INTEGER MULTIPLY. (the thumb value is 8 on a 586)
A simple way to do it is to shift and add for each bit set.

9. Clearing Registers
Using XOR reg, reg is fast but sets up conditions codes. A slightly slower
way to do it of course is to mov reg, 0 which preserves condition codes.

10. Avoid ENTER, instead, try something like:
PUSH EBP
mov ebp, esp
sub esp, BYTE_COUNT

11. JUMP INSTRUCTIONS
Jump instruction come in two forms, one real near that jumps between -127
and 128 of the current location, and a 32 bit version that does a full jump.
The short form is quite a bit faster, however unfortunately many compilers
put long jumps where short jumps would suffice.
To ensure that short jumps can be used (and you know it is possible),
explicitly specify the destination as being byte length.

12. Task Switching
Task Switching is evil. It is slow, real slow. Avoid task switching too
often, as more and more of your time will be spent in processing the task
switch.
For faster task switching, perform you task switching in software. This
allows a smaller processor state to be saved and restored. Anything that
shaves time off 75+ clock cycles isn't all that bad in my book.

13. Minimise segment register loads and the use of far pointers as dearly
much as you can. If you are doing a lot of processing on a small piece of
data far away, it might be faster just to copy the data to nearby and work
on it there.

...and....

14. THINK ABOUT WHAT YOU WANT TO DO
All the other optimisations of Unrolling Loops, moving the invariant data
etc still apply. That, however, is more an algorithmic problem for you, but
still keep it firmly in mind.

586 Specific Optimisations
--------------------------

The 586Pent has a five stage pipeline structure. (You remember, five
employees sitting in a row)
However, the pipeline is split into two different pipelines, known
as Pipeline U and Pipeline V. This allows two instructions to be
executed in parallel and thus presents the opportunity of
executing two/instuctions per clock.
The U pipeline is vwery much like the 486 pipeline, in that it can handle
the entire set of instructions. Pipeline V on the otherhand can
handle only simple instructions.
The fast parts of the U and V pipelines are possible as much of the
functions are hardwired not microcoded.

Anyway, I've blurted on about that for a bit, but what you want to know
is "How to I get two instructions running in one clock/cycle?"
Lament no further, here are the criteria:

1. Both instructions must be simple (see below)
2. There must not be a read-after-write or write-after-read
register/flag dependencies between them.
3. Neither can have a displacement and an immediate
4. Instructions with prefixes can only occurr in the U-pipe
(except for Jcc)

The following are considered "simple" instructions,
the ALU means any ALU command (ADD etc):

1. Mov reg, reg/mem/immed
2. mov mem, reg/imm
3. ALU reg, reg/mem/immed
4. ALU mem, reg/immed
5. inc reg/mem
6. dec reg/mem
7. push reg/mem
8. pop reg
9. nop
10. lea reg/mem
11. Jmp/ Call / Jcc near

Note, however, than while both pipelines can perform ALU
instructions concurrently, this is not the case when
one ALU instruction requires the output of the
other pipeline ALU instruction. E.g. ALU instruction in pipe U and
performing ADC in pipe V.
There are two exceptions to this rule:
1) In the case of a compare/branch combination.
2) With push/pop combinations.

Branch Possibility Thingy
-------------------------

Another nice optimisation with the 586 hardware is that whenever there is
a possibility of a jump, for example, Jg, the 586 will automatically
start reading code from the jump location just in case the the
Jump is accepted. Remember the CODE alignment thing above? Well,
it couldn't hurt your grandmother to align the labels on the code pages.
(So a jump will go to the start of a code page)

Amazing new 586 Optimisations and speedups
------------------------------------------

The 586 has a lot of really nice optimisations and kick-ass
features in addition to what I've mentioned above, unfortunately,
this information is proprietry and will only be given
to people who sign non-disclosure agreements and shell out a
$$ or two... Bummer...
(Meethinks, 586 clone-makers won't be too happy about that... :-) )
Compiler/OS makers will probably be the purchasers.
Quite a few of these optimisations won't work on anything less than
a 586, so don't sweat too much about it. Hey, just write
for 386/486/586 and you'll be ok.

I hope this information has been helpful for you.
Now make some coffee, brush your teeth and phone up your girlfriend and
go and see a movie. This document will be here when you get back, and I
imagine there is only so much of this you can take in one day.... :-) :-) :-)

Live long and code well...

Regards,

Michael Kunstelj.

{ > I MURDER AND EAT SENSITIVE PEOPLE < :-)
{ Contact at : virteam@smoke.canberra.edu.au or
{ mick@cairo.anu.edu.au
{
{ Disclaimer: I don't represent anyone, and no-one would be foolish
enough to represent {me. I take no responsibility of the information
presented here. If you cat explodes {when you hit the enter key, I don't
want to know. (except if you have photos...) :-) :-)
{ Plus every other disclaimer that everyone has ever thought of. So there...
;-)