Code Gems Part 6

This text comes from IMPHOBIA Issue XII - July 1996


Let's jump right in:

* Filling a register with the Carry Flag *

Sometimes we need to fill a register with the carry flag: put 0 to the desired reg if Carry=0 or put ffffffff if Carry=1. Probably the fastest way to do it (for example, with EDX):
        SBB     EDX,EDX
In practice it can be used to prepare EDX before a division. (Before dividing a signed number by an unsigned number.) By the way, the Carry won't change after this operation. I rambled about it and made the conclusion that SBB EDX,EDX is faster and in many cases more appropriate than CDQ. What's more, there's an undocumented 1-byte instruction called SETALC which fills AL with the Carry; it's code is 0d6h. (Hi Rod!)

* BSWAP with 16-bit register *

Yeah, BSWAP with a word register operand is a valid instruction. At least in the genuine Intel processors. Some specifications recommend against using it for better compatibility with Intel-clones. Anyway, if it works, it brings down the upper word of the doubleword register without affecting its upper 16 bit.For example, if EAX=ab120000 then
        BSWAP   AX
results in EAX=ab1212ab. Flags won't change of course.

* Operand sizes *

At instructions where the operand is small (like 'ADD EBX,12'), the machine code is shorter since the operand is stored in 8 bits and will be sign- extended to 32-bit when the instruction actually runs. This means that all operand values in the [-128..127] range (hexadecimally [ffffff80..0000007f]) save 3 bytes per instruction. There's only one trick concerning it. When using 'ADD EBX,80h', the operand takes 4 bytes, since 00000080h falls out of the range. But the
        SUB EBX,0FFFFFF80h
will do the same as 'ADD EBX,80h' but three bytes shorter. The same goes for 'SUB EBX,80h' / 'ADD EBX,0ffffff80h' too. Note: EAX's operand is always 32-bit, it can't be shrunk. This comes from the 8086 era, when the 16-bit registers ruled. In 16-bit code most instructions which affect AX are shorter therfore there were no need for making such sign-extended operands for AX; and since the 32-bit code comes directly from the 16-bit code, there are no short sign-extending opcodes for EAX. Except if you use the good old LEA ;-) So 'LEA EAX,[EAX+12]' costs three bytes while 'ADD EAX,12' eats five. Of course, only in 32-bit code! In real mode 'LEA EAX,EAX+12' have TWO prefix bytes (slowing the 1-clock LEA back to THREE clocks). I've got one more addressing-mode trick. Instead of 'MOV EAX,[EBX*2]' the 'MOV EAX,[EBX+EBX] is better. The first one's code is seven bytes, the second's is three ;-) Generally, every scaled addressing like [EBX*2] have a 32-bit displacement so the [EBX*2] is in fact [EBX*2+00000000h]. But if another register is present, the opcode can be shorter, for example, 'MOV EAX,[ESI+EBX*4+10h]' is a 4-byte instruction. Nevertheless, on the 486 every instruction which uses two registers in addressing or one register scaled, takes one extra cycle! So 'MOV EAX,[EBX+EBX] takes two cycles and so the 'MOV EAX,[EBX*2]', and the 'LEA EAX,[EBP+EDX*8+1234h]' too.

* Rounding *

Let's say we want to divide a number by a power of two (2, 4, 8, 16, etc.) then rounding it upwards. In this case
        SAR EAX,4
        ADC EAX,0
will perfectly work. The credit for this one goes to Good Old Dave :-)

* Penalties on the 486 *

In most cases, the 486 is free from flow-dependence penalties which mean that an instruction which uses the result of the previous instruction will not cause slowdown:
        ADD EAX,EBX
        ADD ECX,EAX
takes two cycles. On a Pentium, however, it takes two cycles too, but the
        ADD EAX,EBX
        ADD ECX,EBX
takes one cycle because the second instr. doesn't use the result of the first so they can be 'pair'-ed. These situations are quite well described in the opt32.doc application note released by Intel, I just want to point to one interesting thing. (By the way, there's a new versioun out with Pentium Pro optimization tricks. Check ftp .intel.com!) Generally, the 486 has two types of flow-dependence penalties:

1) immediately using a register after its 8-byte subregister was modified (so this applies to (e)ax, (e)bx, (e)cx, (e)dx after al, bh, etc. has been changed);

2) using a register in addressing immediately after it was modified. (This is valid for all registers, but beware, the LEA is an addressing instruction, so try avoid using it if its operand was modified by the previous instruction). For example, how many cycles does the following code sequence eat (in protected mode assuming 100% cache hit):
        ADD ECX,EBP
        ADC BL,DL
        MOV AL,[EBX]
On a 486 the add is one, the adc is another one, but the mov takes three even if the operand is already in the cache! Why? There is a double penalty: one clock for using a register in addressing immediately after it was modified. (Address Generation Interlock AGI),; and another cycle for using a register immediately after its subregister was modified (Flow Break). So this innocent MOV instruction costs THREE cycles... Hey, I'm a smart coder, I'm gonna put an instruction between the ADC and the MOV, and the problem is solved! Really? The
        ADD ECX,EBP
        ADC BL,DL
        SUB ESI,EBP
        MOV AL,[EBX]
sequence takes 5 clocks: the add, adc, sub take three but the mov takes two because ONE cycle inserted BETWEEN the ADC and the MOV can save only ONE penalty, not TWO. So for a perfect one clock per one instruction ratio at least TWO instructions have to be inserted. Or, one two-cycle instr. like shr or even a prefixed like add ax,bx in 32-bit code.

* Aligning inner loops *

Aligning the first instruction of an 'inner' loop to 16-byte boundary may increase performance: the 486's and the Pentium Pro's instruction prefetcher (or what) just loves aligned addresses. But the straight-forward solution:
        JMP _INNERLOOP
        ALIGN 16
        _INNERLOOP:
sounds quite dirty from many points of view: it may waste both space and speed. Certaily a more elegant method needs, which won't put a JMP when the _INNERLOOP label is already on paragraph boundary; and when there are only 1-2 bytes remain before the next aligned position, it will put some NOPs instead of a JMP:
CODEALIGN MACRO
LOCAL _NOW,_THEN
_NOW:
ALIGN        16
_THEN:
IF (_THEN-_NOW)              ; Already aligned?
 IF (_THEN-_NOW) LE 3        ; <=3 remained?
  DB (_THEN-_NOW) DUP (90h)  ; put NOPs
 ELSE
  ORG _NOW                   ; Set position back
  JMP _THEN                  ; Jump to the boundary
  ALIGN        16            ; Apply a
 ENDIF
ENDIF
ENDM
Simply put 'codealign' before the top of the inner loop:
(...Inner loop preparation code...)
codealign
_INNERLOOP:
(...Inner loop...)
This one is not fully optimized from the speed's point of view: instead of two NOPs (when two bytes remain) one 'MOV ESP,ESP' would be faster and instead of three NOPs an 'ADD ESP,0'.

* Aligning a memory pointer *

After allocating a few doublewords it's not sure that the array's starting address is on dword boundary, we have to adjust it:
        ADD     EBX,3
        AND     EBX,0fffffffch
With this technique it's possible to align to any power of two. This idea came from Technomancer's PM kernel.

Another (space-saving) trick is when flags needed to be stored with pointers; for example, in a 3D code one pointer belongs to every face (triangle). In runtime we need one bit to indicate if the face is visible, or not. It would be too expensive to allocate another byte for every triangle. But where can that one bit be stored? Right in the pointer. This requires that the faces' stuctures always start on even addresses; in this case, their address' lowest bit is always zero, so the pointers' lowest bit is always zero. This lowest bit can be used as a flag: set by the appropriate routine (e.g. 'int face_visible(face *)') and checked by other routines. Only take care to clear it before accessing the face's structure.

* Kicking out conditional jumps, Part II *

When optimizing for the Pentium (or for Pentium Pro with respect for compatibility) it might save some cycles to change the conditional jumps to non-branching code. For example, the next code sequence implements this IF constuction:
IF (eax = ebx) THEN ecx=x;
               ELSE ecx=y;
In asm:
        SUB     EAX,EBX
        CMP     EAX,1
        SBB     ECX,ECX
        AND     ECX,x-y
        ADD     ECX,y
It also reveals how to copy the Zero flag to the Carry in one cycle ('CMP EAX,1' ;-) On the Pentium Pro, however, the next version should be prefered:
        MOV     ECX,x
        CMP     EAX,EBX
        CMOVNE  ECX,y
Yeah, the Pro has been blessed with the conditional move instruction, like cmovs, cmovne, cmovg, and all the rest. As always, the optimization techniques of the Pro are completely different from the other members of the Intel family. For example, using a 32-bit register after its subregister was modified may cause a 6-7 clock penalty! Can you believe it? Linear texturemapping inner loops using the classic 8-bit U/V coordinates suck... Some other notes: loop aligning to paragraph boundary now boosts the performance again since the Pro always loads 16 bytes of code in one step. The banned instructions: CDQ and MOVZX are also faster than the Pentium's mov/shift-crafted substitutors.

* Do you have any 'Code Gems' ? *

but you don't feel enough inspiration to write a whole article? Do you want to share your knowledge with Imphobia readers?

Right then, please send your 'gem's to me! I will publicate them in the next issue.
           Ervin / AbaddoN
              Ervin Toth
       48/a, Kiss Janos street
            1126 Budapest
               HUNGARY
            +36-1-201-9563
           ervin@sch.bme.hu