Code Gems Part 1

This text comes from IMPHOBIA Issue IX - February 1995


Welcome to the first part of Code Gems! Here follows some nice trick on the Intel processor series. The aim of this article is to introduce those few -byte-length jewels of assembly coding which undoubtedly prove:

"There's always a better way."

Respectable part of this was debugged out of various products, while others were experimentally worked out by me or by one of my friends and I'm pretty sure that You have known many of them before. So there are no unambiguous credits for this.

* ECX-Loop in 16-bit code *

By default, TASM doesn't support real mode LOOPs using ECX. So we have to write our little macro:
ELOOP   macro   _label
        db      67h
        loop   _label
endm
This works for CX-LOOP in 32-bit code too. A similar macro for LOOPE and LOOPNE can be written. It's good even if JUMPS is activated; but in my opinion JUMPS isn't so good for optimizing, it rather serves the point of convenience.

* Rejecting JUMPS *

Nice that the 386 knows the long conditional jumps - but not for the LOOP. When a long LOOP is needed (and JUMPS is on), TASM compiles this:
        loop    cycle_temp
        jmp     cycle_end
cycle_temp:
        jmp     cycle_start
cycle_end:
From the optimization's point of view (both size & speed) it isn't good. What I do is I turn on the JUMPS until the final version then compile without JUMPS, and fix the remaining LOOPs with hand. Be careful with using JUMPS in 286-compatible code too. With a small brainwork another dozen of bytes can be saved. Take a look at this piece of initialzing code:
test_config:
        [VGA checking code]
        jne     bad_config
        [286 checking code]
        jne     bad_config
        [mouse checking code]
        jne     bad_config
        [soundcard checking code]
        jne     bad_config
        ...
If the bad_config is too far from this code, every conditional jump will be extracted into two instructions. So if we put a
bad_config_collector:
        jmp     bad_config
instruction close enough to TEST_CONFIG and replace all JNE BAD_CONFIG with JNE BAD_CONFIG_COLLECTOR, then we saved another few bytes. Of course only when BAD_CONFIG can't be brought any closer.

* Nested Loops *

Sometimes there's a need for little nested loops. One solution:
        mov     cl,outer_cycle_num
outer_cycle:
        [outer cycle code]
        mov     ch,inner_cycle_num
inner_cycle:
        [inner cycle code]
        dec     ch
        jne     inner_cycle
        loop    outer_cycle
This is two byte shorter than DEC CL/JNE combination. It vas invented by TomCat / AbaddoN while developing a bootsector intro.

* Optimizing with ESP *

Using ESP as a general-purpose register isn't so familiar because the interrupts should be disabled. But check this out: In real mode (and sometimes in protected mode) the stack operations ignore the upper word of ESP. (Except if a protected mode program forgot to reload the segment rights & limits. This is why I recommend to restore the default real mode settings.) So. When SP=0000, the first word to be pushed will be placed to SS:FFFE. Ergo if we initialize ESP to 00010000h, we have a wonderful 32-bit number, stack operations refer to the top of the stack segment, and interrupts can be enabled. For example let's assume we want a big nested loop (ECX is the only free register, and ESP=00010000h):
        mov     ecx,(outer_num)*10000h
outer_cycle:
        [outer cycle code]
        mov     cx,inner_num
inner_cycle:
        [inner cycle code]
        loop    inner_cycle
        sub     ecx,esp
        jne     outer_cycle
This can be combined with the other nested loop method. And this is a possible technique for using the upper words of the 32-bit registers without a couple of SHRs. The disadvantage is that CX must be zero when the SUB ECX, ESP occurs. But if we restrict the usage of the upper words to 15-bit, the lower word can be anything.
Example:

       mov     ebx,(cyclenum-1)*10000h
cycle:
       [cycle    code]
       sub     ebx,00010000h
       jns     cycle
BX can contain any value that won't be touched. Cyclenum-1 can be max. 8000h.

Another small thing concerning the stack: on 386+ after all instruction which modifies SS, the interrupts will be disabled for the next instruction. So we can save that CLI/STI pair.

* REP zeroes CX *

Usual problem : mem->mem copy. DS:SI, ES:DI, and CX are prepared but
        rep     movsb
is slow...And
        shr     cx,1
        jnb     copyeven
        movsb
copyeven:
        je      copyready
        rep     movsw
copyready:
is also slow...
Then comes the light:
        shr     cx,1
        rep     movsw
        adc     cx,cx
        rep     movsb
sounds good.

I found it in the SSI Spring '94 Software demo by Future Crew. Yes, I debugged! And that was worth...Remember, LOOP also zeroes (E)CX.

* Puzzle *

Let's assume that EAX contains 0,except the least 8 bits (AL).
How many instructions needed to fill the upper 3 bytes of EAX with AL
(Without any pre-calculated tables) ??
E.g. if EAX=000000e3, it should be transformed to e3e3e3e3.
If you wish to guess it yourself, think of it a little before you read further.
Note: The solution to this will be in Code Gems 2!