PENTIUM/486 Optimization
                        ------------------------

                          By Michael Kunstelj.


G'day all...

In case any one is interested, here is some information on optimising 
assembly code on the Pent / 486.  If you don't know/aren't interested in 
ASSEMBLY code, this will all seem about as interesting as ending titles in 
a bad X-Rated movie.  Still, hopefully someone out there will find this 
useful.  If so, enjoy... and....


Brief Intro
----------------

Now a brief intro on the innards of the 486 and Pent586.
Most processors these days have within in them a system termed a 
"pipeline".
The 486 / 586 are certainly within this category.  However, most people 
aren't quite sure what exactly a pipeline is, here's a rough analogy:

Imagine a processing plant, with a line of employees each performing a 
certain task.  On one end of the line, the first employee receives a piece, 
which he works apon.  After completing his task, he passes it to the next 
employee, who in turn works apon the item.  The process continues until 
the item is at the end of the line, apon which time it is considered 
completed.  However, with this simplistic model, does it not seem strange 
that for each item, all employees must wait for the first item to 
manufactured  before the next one can be considered.  Yup, it is, and stupid 
to boot.
It makes far more sense for each employee to work on something in the 
interim.  Thus, employee 1 in the above example finishes his work on the 
item, passes it to employee 2, and then picks up a new item and starts 
working on it.

That's pretty well the same system in a Microprocessors pipeline.  
Numerous blocks perform in parallel, passing their information on when 
the stage in front of it is free.

Because of pipelining, times taken in terms of instruction cycles become 
blurred.  An instruction that takes 7 clock cycles to implement, for 
example, will be behind in the pipeline from the previous instruction.  After 
the previous instruction is completed, the final stage in the pipeline will be 
dedicated to the next instruction.  This final stage may take perhaps 1 or 2 
clock cycles.  The net effect?   The instruction will effectively appear to 
have taken 1 or 2 clock cycles to execute.

The 586 Pent is a superscalar system using multiple pipeline to process 
data.  You get the idea from the example above sort of how it would work.

Anyway, here are some ways to speed up 486/586 code:


SPEEDING UP 486/586 CODE.
--------------------------------------------------

AGI
------
An Address Generation Interlock is occurs when a register which is 
currently being used as the base or an index was the destination component 
of a previous instruction.

For example,:

add edx, 4
// AGI, ONE CLOCK STALL
mov esi, [edx]

For the 486, AGI's can only occur on adjacent instructions.  On the 586 
Pent, having the term superscalar in its design specification to amaze 
people with its high degree of parallel execution, instructions up to 3 
locations away can cause an AGI.

The code:  (worst case scenario)

add esi, 4
pop ebx
dec ebx
mov edx, [esi]

Takes 4 clocks on a 486.  On a 586, the move command must wait for the 
add command, thus AGI stalling for one clock.  The code above then 
would run in three clock cycles on a 586.

Some Instructions that read or write implicitly to registers cause AGI stalls.  
eg. mov esp, ebp  [agi] pop ebp.


When decoding an instruction with either an index or an immediate-
displacement combination, 486's have a one clock penalty. (586's do not)  
Example:

mov  spam, 248
mov dword ptr [esp+4], 1

486's have a one clock penalty when modifying the lower word of a 
DWORD register. 586's do not.

Example:
mov al,0
mov [ebp], eax
(a one clock penalty on 486, no penalty on a 586)


CODE ALIGNMENT
-------------------------------

To speed up the 486 instructions, alignment of code should be on the 
beginning of cache boundaries.
(32 bytes on 586, 16 bytes on 486)
Thus,  start your routine on the start of the cache page.  This will speed up 
the processing  of all the instructions within that page.  The effect of this 
speed up on the 586 is not a dramatic as is it is on the 486.

DATA ALIGNMENT
------------------------------

Misaligned access in the data cache causes an extra 3 cycles on both the 
486 and 586.
Ways to speed up data:
For DWORD data, alignment should be on a 4-byte boundary.
For WORD data, alignment should be on a 2 byte boundary for the 486, 
and simply within the 4-byte page for the 586.
For 8 byte data (64 bits), data should be aligned on a 8-byte boundary. 
And yes, on many applications with tight inner loops, these things do 
accumulate to make a noticeable speed-up.

SPEEDING UP REGISTER AND OPERAND USAGE
-------------------------------------------------------------------------------

Use the EAX as much as possible.  Many instructions are 1 byte shorter 
when using EAX.  That's less code you have to move around and slightly 
less internal processing.
Use the DS register as much as possible for roughly the same reason,
In the case of several references being made to a variable addressed with a 
displacement, load the displacement into a register.
Try to use the ESP register to reference the stack in lower levels of your 
subroutines.



HANDY INFO ON SPEEDING UP INTEGER INSTRUCTIONS
---------------------------------------------------------------------------------------------

Spaced out yet?  Feel like optimising code now is a job for anal-retentive 
nasal-spray toting coders locked in small dark rooms somewhere?   Well, 
here are some tips to make you swear-off from using nasal spray ever 
again.  Well, not actually that bad, but hey....    Here are some 
optimisations one can perform on the Integer side of the 486/586.   I never 
use the floating point processor, and for that matter don't like it a whole 
heap.  Most people used fixed point routines for fast graphics etc anyway.

1. Avoid using complex instructions like LEAVE, ENTER, LOOP, string 
instructions etc  Simpler instructions will get the job done faster.   Simpler 
instructions have been getting faster with every new processor that has 
come along.

2. With 8-bit operands, do your best to use the byte opcodes, rather than 
using the 32 bit instructions  with sign and zero extended modes.   Internal 
sign extension is expensive, and speed increases can be found by 
simplifying the process as much as possible.

3.  LEA's generally increase the chance of AGI's.  However, LEA's are 
advantageous  when:
-> In many cases an LEA instruction may be used to replace constant 
multiply instructions. (a sequence of LEA, add and shift for example)
-> LEA may be used as a three/four operand addition instruction.
  LEA ECX, [EAX+EBX*4+ARRAY_NAME]
-> Can be advantageous to avoid copying a register when both operands to 
an ADD are being used after the ADD as LEA need not overwrite its 
operands.

4. Zero-Extension with short ints terror tales
The MOVZX  takes four cycles to execute due to due zero-extension 
wobblies. A better way to load a byte into a register is by:
  xor eax,eax
  mov al,memory
 
As the xor just clears the top parts of EAX, the xor may be placed on the 
outside of a loop that uses just byte values.   The 586 shows greater 
response to such actions.
It is recommended that 16 bit data be accessed with the MOVSX and 
MOVZX if you cannot place the XOR on the outside of the loop.

5. When comparing a value in a register with 0, use the TEST command.  
TEST operates by ANDing the operands together without spending any 
internal time worrying about a destination register.  (Compared with just 
ANDing a register with itself to test 0 condition)
Use test when comparing the result of a boolean AND command with an 
immediate constant for equality or inequality if the register is EAX.  (IF 
(VAR & 4) {...} for example)

6. Address Calculations

Pull address calculations into load and store instructions.  Memory 
reference instructions have 4 operands: a relocatable time segment base, a 
base register, a scaled index register and a displacement.  Often several 
integer instructions can be eliminated by fully using the operands of 
memory addresses.  Often, most people don't worry about it and turn to 
drinking coffee instead.  I'm not sure, care less.  Its an options to speed 
things up if you want it.

7.  INTEGER DIVIDE
In most cases, an Integer Divide is preceded by a CDQ instruction.
This is as  divide instructions use EDX:EAX as  the dividend and CDQ sets 
up EDX.
It is better to copy EAX into EDX, then right shift EDX 31 places to sign 
extend.
The copy/shift instructions take the same number of clocks as CDQ, 
however, on 586's allows two other instructions to execute at the same 
time.  If you know the value is a positive, use XOR EDX, EDX.

8. INTEGER MULTIPLY
The integer multiply by an immediate can usually be replaced with a faster 
and simpler series of shifts, subs, adds and lea's.  
As a rule of thumb when 6 or fewer bits are set in the binary representation 
of the constant, it is better to look at other ways of multiplying and not use 
INTEGER MULTIPLY. (the thumb value is 8 on a 586)  
A simple way to do it is to shift and add for each bit set.

9. Clearing Registers
Using XOR reg, reg is fast but sets up conditions codes.  A slightly slower 
way to do it of course is to mov reg, 0 which preserves condition codes.

10. Avoid ENTER, instead, try something like:
PUSH EBP
mov ebp, esp
sub esp, BYTE_COUNT

11. JUMP INSTRUCTIONS
Jump instruction come in two forms, one real near that jumps between -127 
and 128 of the current location, and a 32 bit version that does a full jump.  
The short form is quite a bit faster, however unfortunately many compilers 
put long jumps where short jumps would suffice. 
To ensure that short jumps can be used (and you know it is possible), 
explicitly specify the destination as being byte length.

12. Task Switching
Task Switching is evil.  It is slow,  real slow.  Avoid task switching too 
often, as more and more of your time will be spent in processing the task 
switch.
For faster task switching, perform you task switching in software.  This 
allows a smaller processor state to be saved and restored.  Anything that 
shaves time off  75+ clock cycles isn't all that bad in my book.   

13.  Minimise segment register loads and the use of far pointers as dearly 
much as you can.  If you are doing a lot of processing on a small piece of 
data far away, it might be faster just to copy the data to nearby and work 
on it there.  


...and....

14. THINK ABOUT WHAT YOU WANT TO DO
All the other optimisations of Unrolling Loops, moving the invariant data 
etc still apply.  That, however, is more an algorithmic problem for you, but 
still keep it firmly in mind.


586 Specific Optimisations
--------------------------

The 586Pent has a five stage pipeline structure. (You remember, five
 employees sitting in a row)
However, the pipeline is split into two different pipelines, known 
as Pipeline U and Pipeline V.   This allows two instructions to be
executed in parallel and thus presents the opportunity of 
executing two/instuctions per clock.
The U pipeline is vwery much like the 486 pipeline, in that it can handle 
the entire set of instructions.  Pipeline V on the otherhand can 
handle only simple instructions.
The fast parts of the U and V pipelines are possible as much of the 
functions are hardwired not microcoded.

Anyway, I've blurted on about that for a bit, but what you want to know
is "How to I get two instructions running in one clock/cycle?"
Lament no further, here are the criteria:

  1. Both instructions must be simple (see below)
  2. There must not be a read-after-write or write-after-read
      register/flag dependencies between them.
  3. Neither can have a displacement and an immediate
  4. Instructions with prefixes can only occurr in the U-pipe
      (except for Jcc)

The following are considered "simple" instructions,
the ALU means any ALU command (ADD etc):

 1. Mov reg, reg/mem/immed
 2. mov mem, reg/imm
 3. ALU reg, reg/mem/immed
 4. ALU mem, reg/immed
 5. inc reg/mem
 6. dec reg/mem
 7. push reg/mem
 8. pop reg
 9. nop
 10. lea reg/mem
 11. Jmp/ Call / Jcc near

Note, however, than while both pipelines can perform ALU
instructions concurrently, this is not the case when
one ALU instruction requires the output of the
other pipeline ALU instruction.  E.g.  ALU instruction in pipe U and 
performing ADC in pipe V.
There are two exceptions to this rule:
1) In the case of a compare/branch combination.
2) With push/pop combinations.

Branch Possibility Thingy
-------------------------

Another nice optimisation with the 586 hardware is that whenever there is 
a possibility of a jump, for example, Jg, the 586 will automatically 
start reading code from the jump location just in case the the 
Jump is accepted.  Remember the CODE alignment thing above?  Well,
it couldn't hurt your grandmother to align the labels on the code pages.
(So a jump will go to the start of a code page)


Amazing new 586 Optimisations and speedups
------------------------------------------

The 586 has a lot of really nice optimisations and kick-ass
features in addition to what I've mentioned above, unfortunately,
this information is proprietry and will only be given
to people who sign non-disclosure agreements and shell out a
 $$ or two...  Bummer...
(Meethinks, 586 clone-makers won't be too happy about that... :-) )
Compiler/OS makers will probably be the purchasers.
Quite a few of these optimisations won't work on anything less than
a 586, so don't sweat too much about it.  Hey, just write
for 386/486/586 and you'll be ok.


I hope this information has been helpful for you.
Now make some coffee, brush your teeth and phone up your girlfriend and 
go and see a movie.   This document will be here when you get back, and I 
imagine there is only so much of this you can take in one day.... :-)  :-)  :-)

Live long and code well...


Regards,

   Michael Kunstelj.


{   > I MURDER AND EAT SENSITIVE PEOPLE <  :-)
{ Contact at :  virteam@smoke.canberra.edu.au or
{		mick@cairo.anu.edu.au 
{
{ Disclaimer:  I don't represent anyone, and no-one would be foolish 
enough to represent {me.  I take no responsibility of the information 
presented here.  If you cat explodes {when you hit the enter key,  I don't 
want to know. (except if you have photos...)  :-) :-)
{ Plus every other disclaimer that everyone has ever thought of. So there... 
;-)