Code Gems Part 3

This text comes from IMPHOBIA Issue X - June 1995


Welcome and greetings! Prepare for another bunch of coding tips... By the way, if You have some nice tricks but You don't feel enough inspiration to write Code Gems part 5, please send 'em to me. I'm running out of ideas, but I'm sure there's a couple of tricks left :-) I'd like to say a big- big thanks to my enthusiastic friends who helped me with finishing this article : Perla, Nicke, Deity, George, Stinyo, Rodrigo, G.O.D., #coders and all I forgot...

* Correction for the previous part *

In the last issue I wrote that TASM doesn't support the LOOP instruction using ECX instead of CX in 16-bit code. Well, I was wrong, sorry about that. Of course, it has the ability of doing that. (I've managed to throw a glance at an original Tasm book :-) There are two instruction aliases called LOOPW and LOOPD. The first one always uses CX as counter (independently from the size of the current code segment), the other vice versa. These can be used as LOOPWE, LOOPDNE, and so on. JECXZ also available.

* Calculating the absolute value of AX *

This wonderful 'gem' was developed by Laertis / Nemesis.
        cwd
        xor     ax,dx
        sub     ax,dx
* Short Compare, part II *

Checking if a register contains 8000h (or 80h or 80000000h):
        neg     register
        jo      it_was_8000
The content of the desired register won't be changed if it was 8000h :-)

* Multi-Segment STOS/MOVS *

In flat real mode it's possible to use multi-segment block movement : ECX for counting and ESI / EDI for addressing.
ESTOSD  macro
        db      67h
        stosd
endm
For example, this code clears four megabytes of memory:
        xor     eax,eax
        mov     ecx,100000h
        mov     esi,200000h
        rep     estosd
* Pixel drawing in protected mode *

Here comes a 'routine' which sets a pixel to the given value in 256-color mode:
(parameters: EAX=X coordinate,EBX=Y coordinate, CL=color)

        add     eax,table[ebx*4]
        mov     [eax],cl
The only difference from the real mode method that the TABLE doesn't contain the 0, 320, 640, etc. values.It contains (a0000-base of DS),(a0000-base of DS+320), ... There's an other version which doesn't change EAX :
        mov     edx,table[ebx*4]
        mov     [eax+edx],cl
* Simple recursive calls *

Sometimes we have to call one subroutine many times like this:
        mov     cx,4
        call    waitraster
        loop    $-3
But this requires a register as cycle counter ;-) There's the other way:
        call    waitraster4
        ...
waitraster4:
        call    waitraster2
waitraster2:
        call    waitraster
waitraster:
        mov     dx,03dah
        ...
        ret
Well, this is not really interesting. It just works :-) Now a more usable example : loading instrument data to the AdLib card.
;Load AdLib instrument. Inputs:
;ds:si: register values (5 words;
;       lower byte: data for operator
;       1, higher byte: data for
;       operator 2)
;al:    adlib port (0,1,2,8,9,a,10h,
                    11h,12h)
loadinstr:
        mov     dx,388h
        add     al,0e0h
        call    double_load

        sub     al,0c6h
        call    double_double
        add     al,1ah
double_double:
        call    double_load
        add     al,1ah
double_load:
        call    final_load
final_load:
        mov     ah,[si]
        inc     si
        out     dx,al
        call    adlib_address_delay
        xchg    al,ah
        out     dx,al
        call    adlib_data_delay
        mov     al,ah
        add     al,3
        ret
* Hardware scroll with one page *

First a few words about vertical hardware scrolling. The 'standard' scroll requires at least two pages. In the beginning the first page is visible, and it's black. Then the screen goes up one row - the first row of the second page appears at the bottom. Now this row is copied to the 1st row of the 1st page (which row is now invisible). This process continues until the 2nd page is entirely visible. At this point the two pages are identical. Now the 1st page is displayed again and the whole process starts from the beginning. The problem with it is the memory requirememnt, which is too big. With this method it's impossible to make a 640*480 scroll since one page occupies more than 128k video memory.

But why do we need two pages? Because the video memory is not 'circular'. I mean if we'd scroll the screen up by one pixel, the 1st row of the video memory which was on the top of the screen now would be at the bottom. With this kind of video memory we could do a smooth vertical scroll with a single page : in the beginning, the screen is black. Now wait for a vertical retrace, then change the 1st row, and shift the screen up by one row that the previously modified row appear in the bottom. Perfect eh? The question is how can we make 'circular' memory...

It's a well-known fact that there's a certain problem with the hardware scroll on TSENG cards : every second page contains some 'noise' instead of the scroll we're expected to see...The cause of this is the 'memory display start' register (3d4/0c,0d) which works a bit different than other cards. At other cards always only the first 256k of the video memory will be displayed on the screen, even if the memory display start register (MDSR) is set close to the end of the 256k. These cards handle this 256k memory as a circular buffer, but the TSENG boards not:
.----------. <- screen -> .----------.
| video    |<-   MDSR   ->|video     |
| memory   |              |memory    |
|          |              |          |
|     3ffff|              |    3ffff |
|----------|              ---------- |
|00000     |     TSENG -> |40000     |
| wraps    |              |continues |
|          | <- VGA       |          |
`----------´              `----------´
So what we can do is 'emulate' the standard VGA circular buffer with the LINE COMPARE REGISTER (LCR, 3d4/18h). The function of this register is pretty simple: if the scanline counter reaches this value, the display address wraps to 0, beginning of the video memory:
            .---------.
    MDSR -> |video    |
            |memory   |
            |         |
line        |    ?????|
compare  -> |---------|
register    |00000    |
            |wraps    |
            |         |
            `---------´
The *big* advantage is that it's possible to emulate shorter than 256k circular video memory! It should work on all VGA cards. The most elegant way is to add a LCR changer code to the MDSR modifier routine. With this the existing 'standard' scrollers can be fixed for TSENG cards too. Remember, the line compare register is 10-bit, the highest two bits are located in 3d4/7/4. bit and 3d4/9/6. bit.

* Gouraud shading - 2 instructions/pixel *

The main goal of this example is not really to show a G-shading with two instructions ;-) It's rather an example for 'how to pray down the upper words of 32-bit registers without shifting'. There's often a need for calculating with fixed-point numbers: a doubleword's upper word is the whole part, the lower is the fractional part. The problem is that the upper words of the 32-bit registers are hard to reach. For example, at ADD EAX,EBX how to get EAX's upper word? No (quick) way :-( The idea beyond t he trick is changing the upper & lower words, and using ADC instead of ADD:
; EAX & EBX are fixed-point numbers
        ror     eax,16
        ror     ebx,16
cycle:
        ...
        adc     eax,ebx
        stosw
        ...
        loop    cycle
The whole part of the fixed-point numbers will be in the lower words :-) It's very important to save the Carry flag for appropriate result. Now the Gouraud shading. The following piece of code is only a horizontal shaded line drawer routine, not the whole poly-filler. Colors are expected to be fixed-point numbers presented as doublewords with 8-bit whole part in the highest byte (this value will appear on the screen) and 18-bit fractional part. (18 bits may seem to be a lot, but surely more accurate than 8 bits ;-)
;In:    eax:    end color
;       ebx:    start color
;       ecx:    line length
;       es:edi: destination
;!!!    32-bit PM version   !!!

gou_line:
        sub     eax,ebx

;Fill edx with the carry flag
        rcr     eax,1
        cdq
        rcl     eax,1

        idiv    ecx
;Pull down the upper parts of dwords
        rol     eax,8
        rol     ebx,8
        xchg    ebx,eax

;Calculate the address of the entry
;point in the linearized code
        neg     ecx
        lea     ecx,[ecx*2+ecx+320*3+
                offset gou_linearized]
        jmp     ecx

gou_linearized:
rept    320
        stosb
        adc     eax,ebx
endm
        ret
Variations: If You want to use it in real mode, then You have to modify the linearized-code entry point calculation, because the length of a stosb/adc pair is four bytes:
       neg     cx
       shl     cx,2
       add     cx,320*4+offset g.lin.
       jmp     cx
486-optimization fans may think some indexed linearized code instead of stosb :-) In this case take care to correctly set up the lin. code because the lengths of 'mov [edi+0],al', 'mov [edi+1],al' and 'mov [edi+200h],al' are different, so with a rept we won't get equal-length instructions.