Code Gems Part 1
This text comes from IMPHOBIA Issue IX - February 1995
Welcome to the first part of Code Gems! Here follows some nice trick on
the Intel processor series. The aim of this article is to introduce those few
-byte-length jewels of assembly coding which undoubtedly prove:
"There's always a better way."
Respectable part of this was debugged out of various products, while others
were experimentally worked out by me or by one of my friends and I'm pretty
sure that You have known many of them before. So there are no unambiguous
credits for this.
* ECX-Loop in 16-bit code *
By default, TASM doesn't support real mode LOOPs using ECX. So we have to
write our little macro:
ELOOP macro _label
db 67h
loop _label
endm
This works for CX-LOOP in 32-bit code too. A similar macro for LOOPE and
LOOPNE can be written. It's good even if JUMPS is activated; but in my
opinion JUMPS isn't so good for optimizing, it rather serves the point
of convenience.
* Rejecting JUMPS *
Nice that the 386 knows the long conditional jumps - but not for the
LOOP. When a long LOOP is needed (and JUMPS is on), TASM compiles this:
loop cycle_temp
jmp cycle_end
cycle_temp:
jmp cycle_start
cycle_end:
From the optimization's point of view (both size & speed) it isn't good.
What I do is I turn on the JUMPS until the final version then compile without
JUMPS, and fix the remaining LOOPs with hand. Be careful with using JUMPS
in 286-compatible code too. With a small brainwork another dozen of bytes
can be saved. Take a look at this piece of initialzing code:
test_config:
[VGA checking code]
jne bad_config
[286 checking code]
jne bad_config
[mouse checking code]
jne bad_config
[soundcard checking code]
jne bad_config
...
If the bad_config is too far from this code, every conditional jump will be
extracted into two instructions. So if we put a
bad_config_collector:
jmp bad_config
instruction close enough to TEST_CONFIG and replace all JNE BAD_CONFIG
with JNE BAD_CONFIG_COLLECTOR, then we saved another few bytes. Of
course only when BAD_CONFIG can't be brought any closer.
* Nested Loops *
Sometimes there's a need for little nested loops. One solution:
mov cl,outer_cycle_num
outer_cycle:
[outer cycle code]
mov ch,inner_cycle_num
inner_cycle:
[inner cycle code]
dec ch
jne inner_cycle
loop outer_cycle
This is two byte shorter than DEC CL/JNE combination. It vas invented by
TomCat / AbaddoN while developing a bootsector intro.
* Optimizing with ESP *
Using ESP as a general-purpose register isn't so familiar because the
interrupts should be disabled. But check this out: In real mode (and sometimes
in protected mode) the stack operations ignore the upper word of
ESP. (Except if a protected mode program forgot to reload the segment
rights & limits. This is why I recommend to restore the default real mode
settings.) So. When SP=0000, the first word to be pushed will be placed to
SS:FFFE. Ergo if we initialize ESP to 00010000h, we have a wonderful 32-bit
number, stack operations refer to the top of the stack segment, and interrupts
can be enabled. For example let's assume we want a big nested loop
(ECX is the only free register, and ESP=00010000h):
mov ecx,(outer_num)*10000h
outer_cycle:
[outer cycle code]
mov cx,inner_num
inner_cycle:
[inner cycle code]
loop inner_cycle
sub ecx,esp
jne outer_cycle
This can be combined with the other nested loop method. And this is a
possible technique for using the upper words of the 32-bit registers without
a couple of SHRs. The disadvantage is that CX must be zero when the SUB ECX,
ESP occurs. But if we restrict the usage of the upper words to 15-bit,
the lower word can be anything.
Example:
mov ebx,(cyclenum-1)*10000h
cycle:
[cycle code]
sub ebx,00010000h
jns cycle
BX can contain any value that won't be touched. Cyclenum-1 can be max. 8000h.
Another small thing concerning the stack: on 386+ after all instruction
which modifies SS, the interrupts will be disabled for the next instruction.
So we can save that CLI/STI pair.
* REP zeroes CX *
Usual problem : mem->mem copy. DS:SI, ES:DI, and CX are prepared but
rep movsb
is slow...And
shr cx,1
jnb copyeven
movsb
copyeven:
je copyready
rep movsw
copyready:
is also slow...
Then comes the light:
shr cx,1
rep movsw
adc cx,cx
rep movsb
sounds good.
I found it in the SSI Spring '94 Software demo by Future Crew. Yes, I
debugged! And that was worth...Remember, LOOP also zeroes (E)CX.
* Puzzle *
Let's assume that EAX contains 0,except the least 8 bits (AL).
How many instructions needed to fill the upper 3 bytes of EAX with AL
(Without any pre-calculated tables) ??
E.g. if EAX=000000e3, it should be transformed to e3e3e3e3.
If you wish to guess it yourself, think of it a little before you read further.
Note: The solution to this will be in Code Gems 2!