PENTIUM/486 Optimization ------------------------ By Michael Kunstelj. G'day all... In case any one is interested, here is some information on optimising assembly code on the Pent / 486. If you don't know/aren't interested in ASSEMBLY code, this will all seem about as interesting as ending titles in a bad X-Rated movie. Still, hopefully someone out there will find this useful. If so, enjoy... and.... Brief Intro ---------------- Now a brief intro on the innards of the 486 and Pent586. Most processors these days have within in them a system termed a "pipeline". The 486 / 586 are certainly within this category. However, most people aren't quite sure what exactly a pipeline is, here's a rough analogy: Imagine a processing plant, with a line of employees each performing a certain task. On one end of the line, the first employee receives a piece, which he works apon. After completing his task, he passes it to the next employee, who in turn works apon the item. The process continues until the item is at the end of the line, apon which time it is considered completed. However, with this simplistic model, does it not seem strange that for each item, all employees must wait for the first item to manufactured before the next one can be considered. Yup, it is, and stupid to boot. It makes far more sense for each employee to work on something in the interim. Thus, employee 1 in the above example finishes his work on the item, passes it to employee 2, and then picks up a new item and starts working on it. That's pretty well the same system in a Microprocessors pipeline. Numerous blocks perform in parallel, passing their information on when the stage in front of it is free. Because of pipelining, times taken in terms of instruction cycles become blurred. An instruction that takes 7 clock cycles to implement, for example, will be behind in the pipeline from the previous instruction. After the previous instruction is completed, the final stage in the pipeline will be dedicated to the next instruction. This final stage may take perhaps 1 or 2 clock cycles. The net effect? The instruction will effectively appear to have taken 1 or 2 clock cycles to execute. The 586 Pent is a superscalar system using multiple pipeline to process data. You get the idea from the example above sort of how it would work. Anyway, here are some ways to speed up 486/586 code: SPEEDING UP 486/586 CODE. -------------------------------------------------- AGI ------ An Address Generation Interlock is occurs when a register which is currently being used as the base or an index was the destination component of a previous instruction. For example,: add edx, 4 // AGI, ONE CLOCK STALL mov esi, [edx] For the 486, AGI's can only occur on adjacent instructions. On the 586 Pent, having the term superscalar in its design specification to amaze people with its high degree of parallel execution, instructions up to 3 locations away can cause an AGI. The code: (worst case scenario) add esi, 4 pop ebx dec ebx mov edx, [esi] Takes 4 clocks on a 486. On a 586, the move command must wait for the add command, thus AGI stalling for one clock. The code above then would run in three clock cycles on a 586. Some Instructions that read or write implicitly to registers cause AGI stalls. eg. mov esp, ebp [agi] pop ebp. When decoding an instruction with either an index or an immediate- displacement combination, 486's have a one clock penalty. (586's do not) Example: mov spam, 248 mov dword ptr [esp+4], 1 486's have a one clock penalty when modifying the lower word of a DWORD register. 586's do not. Example: mov al,0 mov [ebp], eax (a one clock penalty on 486, no penalty on a 586) CODE ALIGNMENT ------------------------------- To speed up the 486 instructions, alignment of code should be on the beginning of cache boundaries. (32 bytes on 586, 16 bytes on 486) Thus, start your routine on the start of the cache page. This will speed up the processing of all the instructions within that page. The effect of this speed up on the 586 is not a dramatic as is it is on the 486. DATA ALIGNMENT ------------------------------ Misaligned access in the data cache causes an extra 3 cycles on both the 486 and 586. Ways to speed up data: For DWORD data, alignment should be on a 4-byte boundary. For WORD data, alignment should be on a 2 byte boundary for the 486, and simply within the 4-byte page for the 586. For 8 byte data (64 bits), data should be aligned on a 8-byte boundary. And yes, on many applications with tight inner loops, these things do accumulate to make a noticeable speed-up. SPEEDING UP REGISTER AND OPERAND USAGE ------------------------------------------------------------------------------- Use the EAX as much as possible. Many instructions are 1 byte shorter when using EAX. That's less code you have to move around and slightly less internal processing. Use the DS register as much as possible for roughly the same reason, In the case of several references being made to a variable addressed with a displacement, load the displacement into a register. Try to use the ESP register to reference the stack in lower levels of your subroutines. HANDY INFO ON SPEEDING UP INTEGER INSTRUCTIONS --------------------------------------------------------------------------------------------- Spaced out yet? Feel like optimising code now is a job for anal-retentive nasal-spray toting coders locked in small dark rooms somewhere? Well, here are some tips to make you swear-off from using nasal spray ever again. Well, not actually that bad, but hey.... Here are some optimisations one can perform on the Integer side of the 486/586. I never use the floating point processor, and for that matter don't like it a whole heap. Most people used fixed point routines for fast graphics etc anyway. 1. Avoid using complex instructions like LEAVE, ENTER, LOOP, string instructions etc Simpler instructions will get the job done faster. Simpler instructions have been getting faster with every new processor that has come along. 2. With 8-bit operands, do your best to use the byte opcodes, rather than using the 32 bit instructions with sign and zero extended modes. Internal sign extension is expensive, and speed increases can be found by simplifying the process as much as possible. 3. LEA's generally increase the chance of AGI's. However, LEA's are advantageous when: -> In many cases an LEA instruction may be used to replace constant multiply instructions. (a sequence of LEA, add and shift for example) -> LEA may be used as a three/four operand addition instruction. LEA ECX, [EAX+EBX*4+ARRAY_NAME] -> Can be advantageous to avoid copying a register when both operands to an ADD are being used after the ADD as LEA need not overwrite its operands. 4. Zero-Extension with short ints terror tales The MOVZX takes four cycles to execute due to due zero-extension wobblies. A better way to load a byte into a register is by: xor eax,eax mov al,memory As the xor just clears the top parts of EAX, the xor may be placed on the outside of a loop that uses just byte values. The 586 shows greater response to such actions. It is recommended that 16 bit data be accessed with the MOVSX and MOVZX if you cannot place the XOR on the outside of the loop. 5. When comparing a value in a register with 0, use the TEST command. TEST operates by ANDing the operands together without spending any internal time worrying about a destination register. (Compared with just ANDing a register with itself to test 0 condition) Use test when comparing the result of a boolean AND command with an immediate constant for equality or inequality if the register is EAX. (IF (VAR & 4) {...} for example) 6. Address Calculations Pull address calculations into load and store instructions. Memory reference instructions have 4 operands: a relocatable time segment base, a base register, a scaled index register and a displacement. Often several integer instructions can be eliminated by fully using the operands of memory addresses. Often, most people don't worry about it and turn to drinking coffee instead. I'm not sure, care less. Its an options to speed things up if you want it. 7. INTEGER DIVIDE In most cases, an Integer Divide is preceded by a CDQ instruction. This is as divide instructions use EDX:EAX as the dividend and CDQ sets up EDX. It is better to copy EAX into EDX, then right shift EDX 31 places to sign extend. The copy/shift instructions take the same number of clocks as CDQ, however, on 586's allows two other instructions to execute at the same time. If you know the value is a positive, use XOR EDX, EDX. 8. INTEGER MULTIPLY The integer multiply by an immediate can usually be replaced with a faster and simpler series of shifts, subs, adds and lea's. As a rule of thumb when 6 or fewer bits are set in the binary representation of the constant, it is better to look at other ways of multiplying and not use INTEGER MULTIPLY. (the thumb value is 8 on a 586) A simple way to do it is to shift and add for each bit set. 9. Clearing Registers Using XOR reg, reg is fast but sets up conditions codes. A slightly slower way to do it of course is to mov reg, 0 which preserves condition codes. 10. Avoid ENTER, instead, try something like: PUSH EBP mov ebp, esp sub esp, BYTE_COUNT 11. JUMP INSTRUCTIONS Jump instruction come in two forms, one real near that jumps between -127 and 128 of the current location, and a 32 bit version that does a full jump. The short form is quite a bit faster, however unfortunately many compilers put long jumps where short jumps would suffice. To ensure that short jumps can be used (and you know it is possible), explicitly specify the destination as being byte length. 12. Task Switching Task Switching is evil. It is slow, real slow. Avoid task switching too often, as more and more of your time will be spent in processing the task switch. For faster task switching, perform you task switching in software. This allows a smaller processor state to be saved and restored. Anything that shaves time off 75+ clock cycles isn't all that bad in my book. 13. Minimise segment register loads and the use of far pointers as dearly much as you can. If you are doing a lot of processing on a small piece of data far away, it might be faster just to copy the data to nearby and work on it there. ...and.... 14. THINK ABOUT WHAT YOU WANT TO DO All the other optimisations of Unrolling Loops, moving the invariant data etc still apply. That, however, is more an algorithmic problem for you, but still keep it firmly in mind. 586 Specific Optimisations -------------------------- The 586Pent has a five stage pipeline structure. (You remember, five employees sitting in a row) However, the pipeline is split into two different pipelines, known as Pipeline U and Pipeline V. This allows two instructions to be executed in parallel and thus presents the opportunity of executing two/instuctions per clock. The U pipeline is vwery much like the 486 pipeline, in that it can handle the entire set of instructions. Pipeline V on the otherhand can handle only simple instructions. The fast parts of the U and V pipelines are possible as much of the functions are hardwired not microcoded. Anyway, I've blurted on about that for a bit, but what you want to know is "How to I get two instructions running in one clock/cycle?" Lament no further, here are the criteria: 1. Both instructions must be simple (see below) 2. There must not be a read-after-write or write-after-read register/flag dependencies between them. 3. Neither can have a displacement and an immediate 4. Instructions with prefixes can only occurr in the U-pipe (except for Jcc) The following are considered "simple" instructions, the ALU means any ALU command (ADD etc): 1. Mov reg, reg/mem/immed 2. mov mem, reg/imm 3. ALU reg, reg/mem/immed 4. ALU mem, reg/immed 5. inc reg/mem 6. dec reg/mem 7. push reg/mem 8. pop reg 9. nop 10. lea reg/mem 11. Jmp/ Call / Jcc near Note, however, than while both pipelines can perform ALU instructions concurrently, this is not the case when one ALU instruction requires the output of the other pipeline ALU instruction. E.g. ALU instruction in pipe U and performing ADC in pipe V. There are two exceptions to this rule: 1) In the case of a compare/branch combination. 2) With push/pop combinations. Branch Possibility Thingy ------------------------- Another nice optimisation with the 586 hardware is that whenever there is a possibility of a jump, for example, Jg, the 586 will automatically start reading code from the jump location just in case the the Jump is accepted. Remember the CODE alignment thing above? Well, it couldn't hurt your grandmother to align the labels on the code pages. (So a jump will go to the start of a code page) Amazing new 586 Optimisations and speedups ------------------------------------------ The 586 has a lot of really nice optimisations and kick-ass features in addition to what I've mentioned above, unfortunately, this information is proprietry and will only be given to people who sign non-disclosure agreements and shell out a $$ or two... Bummer... (Meethinks, 586 clone-makers won't be too happy about that... :-) ) Compiler/OS makers will probably be the purchasers. Quite a few of these optimisations won't work on anything less than a 586, so don't sweat too much about it. Hey, just write for 386/486/586 and you'll be ok. I hope this information has been helpful for you. Now make some coffee, brush your teeth and phone up your girlfriend and go and see a movie. This document will be here when you get back, and I imagine there is only so much of this you can take in one day.... :-) :-) :-) Live long and code well... Regards, Michael Kunstelj. { > I MURDER AND EAT SENSITIVE PEOPLE < :-) { Contact at : virteam@smoke.canberra.edu.au or { mick@cairo.anu.edu.au { { Disclaimer: I don't represent anyone, and no-one would be foolish enough to represent {me. I take no responsibility of the information presented here. If you cat explodes {when you hit the enter key, I don't want to know. (except if you have photos...) :-) :-) { Plus every other disclaimer that everyone has ever thought of. So there... ;-)