Code Gems Part 6
This text comes from IMPHOBIA Issue XII - July 1996
Let's jump right in:
* Filling a register with the Carry Flag *
Sometimes we need to fill a register with the carry flag: put 0 to the
desired reg if Carry=0 or put ffffffff if Carry=1. Probably the fastest way
to do it (for example, with EDX):
SBB EDX,EDX
In practice it can be used to prepare EDX before a division. (Before
dividing a signed number by an unsigned number.) By the way, the
Carry won't change after this operation. I rambled about it and made
the conclusion that SBB EDX,EDX is faster and in many cases more
appropriate than CDQ. What's more, there's an undocumented 1-byte
instruction called SETALC which fills AL with the Carry; it's code is 0d6h.
(Hi Rod!)
* BSWAP with 16-bit register *
Yeah, BSWAP with a word register operand is a valid instruction. At
least in the genuine Intel processors. Some specifications recommend against
using it for better compatibility with Intel-clones. Anyway, if it works, it
brings down the upper word of the doubleword register without affecting
its upper 16 bit.For example, if EAX=ab120000 then
BSWAP AX
results in EAX=ab1212ab. Flags won't change of course.
* Operand sizes *
At instructions where the operand is small (like 'ADD EBX,12'), the machine
code is shorter since the operand is stored in 8 bits and will be sign-
extended to 32-bit when the instruction actually runs. This means
that all operand values in the [-128..127] range (hexadecimally
[ffffff80..0000007f]) save 3 bytes per instruction. There's only one trick
concerning it. When using 'ADD EBX,80h', the operand takes 4
bytes, since 00000080h falls out of the range. But the
SUB EBX,0FFFFFF80h
will do the same as 'ADD EBX,80h' but three bytes shorter. The same goes for
'SUB EBX,80h' / 'ADD EBX,0ffffff80h' too. Note: EAX's operand is always
32-bit, it can't be shrunk. This comes from the 8086 era, when the 16-bit
registers ruled. In 16-bit code most instructions which affect AX are
shorter therfore there were no need for making such sign-extended operands
for AX; and since the 32-bit code comes directly from the 16-bit code,
there are no short sign-extending opcodes for EAX. Except if you use the
good old LEA ;-) So 'LEA EAX,[EAX+12]' costs three bytes while 'ADD EAX,12'
eats five. Of course, only in 32-bit code! In real mode 'LEA EAX,EAX+12'
have TWO prefix bytes (slowing the 1-clock LEA back to THREE clocks).
I've got one more addressing-mode trick. Instead of 'MOV EAX,[EBX*2]'
the 'MOV EAX,[EBX+EBX] is better. The first one's code is seven bytes, the
second's is three ;-) Generally, every scaled addressing like [EBX*2] have a
32-bit displacement so the [EBX*2] is in fact [EBX*2+00000000h]. But if
another register is present, the opcode can be shorter, for example,
'MOV EAX,[ESI+EBX*4+10h]' is a 4-byte instruction. Nevertheless, on the 486
every instruction which uses two registers in addressing or one
register scaled, takes one extra cycle! So 'MOV EAX,[EBX+EBX] takes two
cycles and so the 'MOV EAX,[EBX*2]', and the 'LEA EAX,[EBP+EDX*8+1234h]'
too.
* Rounding *
Let's say we want to divide a number by a power of two (2, 4, 8, 16, etc.)
then rounding it upwards. In this case
SAR EAX,4
ADC EAX,0
will perfectly work. The credit for this one goes to Good Old Dave :-)
* Penalties on the 486 *
In most cases, the 486 is free from flow-dependence penalties which mean
that an instruction which uses the result of the previous instruction
will not cause slowdown:
ADD EAX,EBX
ADD ECX,EAX
takes two cycles. On a Pentium, however, it takes two cycles too, but
the
ADD EAX,EBX
ADD ECX,EBX
takes one cycle because the second instr. doesn't use the result of the
first so they can be 'pair'-ed. These situations are quite well described in
the opt32.doc application note released by Intel, I just want to
point to one interesting thing. (By the way, there's a new versioun out
with Pentium Pro optimization tricks. Check ftp .intel.com!)
Generally, the 486 has two types of flow-dependence penalties:
1) immediately using a register after its 8-byte subregister was modified
(so this applies to (e)ax, (e)bx, (e)cx, (e)dx after al, bh, etc. has
been changed);
2) using a register in addressing immediately after it was modified.
(This is valid for all registers, but beware, the LEA is an addressing
instruction, so try avoid using it if its operand was modified by the
previous instruction). For example, how many cycles does the following
code sequence eat (in protected mode assuming 100% cache hit):
ADD ECX,EBP
ADC BL,DL
MOV AL,[EBX]
On a 486 the add is one, the adc is another one, but the mov takes three
even if the operand is already in the cache! Why? There is a double penalty:
one clock for using a register in addressing immediately after it was
modified. (Address Generation Interlock AGI),; and another cycle for
using a register immediately after its subregister was modified (Flow Break).
So this innocent MOV instruction costs THREE cycles... Hey, I'm a smart
coder, I'm gonna put an instruction between the ADC and the MOV, and the
problem is solved! Really? The
ADD ECX,EBP
ADC BL,DL
SUB ESI,EBP
MOV AL,[EBX]
sequence takes 5 clocks: the add, adc, sub take three but the mov takes two
because ONE cycle inserted BETWEEN the ADC and the MOV can save only ONE
penalty, not TWO. So for a perfect one clock per one instruction ratio at
least TWO instructions have to be inserted. Or, one two-cycle instr.
like shr or even a prefixed like add ax,bx in 32-bit code.
* Aligning inner loops *
Aligning the first instruction of an 'inner' loop to 16-byte boundary may
increase performance: the 486's and the Pentium Pro's instruction
prefetcher (or what) just loves aligned addresses. But the
straight-forward solution:
JMP _INNERLOOP
ALIGN 16
_INNERLOOP:
sounds quite dirty from many points of view: it may waste both space and
speed. Certaily a more elegant method needs, which won't put a JMP when the
_INNERLOOP label is already on paragraph boundary; and when there are
only 1-2 bytes remain before the next aligned position, it will put some
NOPs instead of a JMP:
CODEALIGN MACRO
LOCAL _NOW,_THEN
_NOW:
ALIGN 16
_THEN:
IF (_THEN-_NOW) ; Already aligned?
IF (_THEN-_NOW) LE 3 ; <=3 remained?
DB (_THEN-_NOW) DUP (90h) ; put NOPs
ELSE
ORG _NOW ; Set position back
JMP _THEN ; Jump to the boundary
ALIGN 16 ; Apply a
ENDIF
ENDIF
ENDM
Simply put 'codealign' before the top of the inner loop:
(...Inner loop preparation code...)
codealign
_INNERLOOP:
(...Inner loop...)
This one is not fully optimized from the speed's point of view: instead of
two NOPs (when two bytes remain) one 'MOV ESP,ESP' would be faster and
instead of three NOPs an 'ADD ESP,0'.
* Aligning a memory pointer *
After allocating a few doublewords it's not sure that the array's
starting address is on dword boundary, we have to adjust it:
ADD EBX,3
AND EBX,0fffffffch
With this technique it's possible to align to any power of two. This idea
came from Technomancer's PM kernel.
Another (space-saving) trick is when flags needed to be stored with
pointers; for example, in a 3D code one pointer belongs to every face
(triangle). In runtime we need one bit to indicate if the face is visible,
or not. It would be too expensive to allocate another byte for every
triangle. But where can that one bit be stored? Right in the pointer. This
requires that the faces' stuctures always start on even addresses; in
this case, their address' lowest bit is always zero, so the pointers'
lowest bit is always zero. This lowest bit can be used as a flag: set by the
appropriate routine (e.g. 'int face_visible(face *)') and
checked by other routines. Only take care to clear it before accessing the
face's structure.
* Kicking out conditional jumps, Part II *
When optimizing for the Pentium (or for Pentium Pro with respect for
compatibility) it might save some cycles to change the conditional jumps
to non-branching code. For example, the next code sequence implements this
IF constuction:
IF (eax = ebx) THEN ecx=x;
ELSE ecx=y;
In asm:
SUB EAX,EBX
CMP EAX,1
SBB ECX,ECX
AND ECX,x-y
ADD ECX,y
It also reveals how to copy the Zero flag to the Carry in one cycle
('CMP EAX,1' ;-) On the Pentium Pro, however, the next version should be
prefered:
MOV ECX,x
CMP EAX,EBX
CMOVNE ECX,y
Yeah, the Pro has been blessed with the conditional move instruction, like
cmovs, cmovne, cmovg, and all the rest. As always, the optimization
techniques of the Pro are completely different from the other members of
the Intel family. For example, using a 32-bit register after its subregister
was modified may cause a 6-7 clock penalty! Can you believe it? Linear
texturemapping inner loops using the classic 8-bit U/V coordinates suck...
Some other notes: loop aligning to paragraph boundary now boosts the
performance again since the Pro always loads 16 bytes of code in one step.
The banned instructions: CDQ and MOVZX are also faster than the Pentium's
mov/shift-crafted substitutors.
* Do you have any 'Code Gems' ? *
but you don't feel enough inspiration to write a whole article? Do you want
to share your knowledge with Imphobia readers?
Right then, please send your 'gem's to me! I will publicate them in the next issue.
Ervin / AbaddoN
Ervin Toth
48/a, Kiss Janos street
1126 Budapest
HUNGARY
+36-1-201-9563
ervin@sch.bme.hu