FASTCLR.BA?,.ASM                                   ((C) 2005 by YSS)
----------------

Test program to fast CLR hires $13 (16000 byte cycle loop) through the use of the
32bit internal VGA video bus.  Read the notes for technical details.  The
notes may be somewhat scattered, but all the info needed is there.  Those
notes were put together from observations through direct programming of the
VGA registers and from reading tech info on VGA, so that's just what they
are, notes.

This file lists one example in both BASIC and 80x86 ASSEMBLY versions.
Listings are somewhere inside file.


Premise: It's possible to clear the VGA mode 13 bitmap at 4X speed, i.e,
         through a byte loop that repeats 16000 times instead of 64000 times.



INTRO
-----

A long time ago, 2600 taught me -and some friends as well- that during game
programming every cycle counts (if you want to make sharp smooth games
anyway).  Clearing the bitmap screen is no exception, and you'll want to
do it as fast as possible.  Instead of using the 64000 byte clear loop (or
its 32000 16bit clr version), why not use VGA's internal 32bit bus to do the
same thing at 4x normal speed? That's what FASTCLR.BA?,.ASM do:  They
program VGA's sequencer for this purpose.  The details are down below in
the notes.

FASTCLR.BA?,.ASM work in 7 easy steps:

1- Set hires mode 13
2- Fill the screen with a test pattern (anything)
3- Wait for a key before clearing.  If key=esc -> end
4- Tell VGA's sequencer NOT TO CHAIN its 4 bit planes during CPU access.
5- Wipe the ENTIRE video screen (320x200, 64000 bytes) thru just 16000 bytes.
6- Wait for a key again.  If key=esc -> end.
7- If not, CHAIN back the way it was and repeat from step 2.

** end means set screen back to normal and exit.

Good luck.
pixelrat@hotmail.com
///////////////////////////////////



Notes: 
                                                            (Mar 30, 2005)
03C4 #4 BT3 controls the "chaining" of bit planes in VGA.

('03C4 #4' is to be read 'Port 03C4 index 4' meaning write 04 to port 03C4
                                             and read or write from/to
                                             port 03C5 to read/alter index 4
                                             of 03C4)

[Incidentally, port 03C4 on ibm/at style machines with VGA is mapped to
VGA's "Sequencer Registers", which has 5 registers.  Register 4 (index 4)
is called the "Memory Mode Register" and that's the one in question
at this time.]


[Remember how VGA cards use memory banks called "display memory bit planes".
There are 4 planes which are numbered 0 to 3.  Those bit planes are where
video image data comes from in VGA.]


When planes are "chained", CPU accesses to them are individual, that is,
sequential. 
                                                                   
Example:   For addresses incrementing one byte at a time, the CPU accesses
one plane first, then the next one in chain, then a third and then a fourth.
Afterwards it accesses the first one it first accessed again but one
sequential byte after (sequential in the plane), and so on..

[sequential in the plane means that each plane has 64K bytes, numbered from
0 to 65535.  That sequence is not necessarily followed by the CPU
address decoding, which is what this is all about.]

[** CHAINING seems to affect CPU accesses only.  Display output doesn't seem
    to be affected.]

[ When the bit planes are chained, when CPU writes to VGA, it writes to one
single plane at a time.  When planes are "not chained", "loose", "unchained"
or however you'd call it, when CPU writes to VGA it can write to ALL planes
AT ONCE.]



The chaining is actually controlled by A1,A0, the two least significant
CPU address bits into VGA memory.  That means that accesses A1-A0=10 are
done upon VGA memory bit plane 2 ('10' binary).  That also means the
following:

          Every contiguous byte in each plane is accessed every 4th CPU
          access, since the last three were done upon the other planes.
          I know it may be kind of hard to picture, but it works.

          Ex.:  In chained mode,

   OFFSET ADRR = 0000 (A1-A0=00) accesses byte 0 of plane 0
                 0001        01     ''     ''  0 ''   ''  1
                 0002        02     ''     ''  0 '    '   2
                 0003        03     ''     ''  0 '    '   3
                 0004        00     ''     ''  1 '    '   0
                 .
                 .
                 and so on..

                 In other words, if we do MOV AX,A000
                                          MOV ES,AX
                                          ES: MOV AL,[0000]

                                          then AL= byte 0 of plane 0

                                          Now, when we do

                                          ES: MOV AL,[0003]

                                          AL = byte 0 of plane 3

                                          now, when we do

                                          ES: MOV AL,[0004]

                                          AL = byte 1 of plane 0, and so on.



          Now, in un-chained mode (NOT CHAIN 4),

   OFFSET ADDR = 0000 accesses byte 0 of planes 0,1,2 and 3 ALL AT ONCE.

   it does this because VGA has an internal 32 bit bus, so for every CPU
   8 bit access VGA performs (or can perform?) one 32bit access.

   So by the same token,

   OFFSET ADDR = 0001 accesses byte 1 of planes 0,1,2 and 3 ALL at once.
                 0002    ''     ''  2 '    '    .....
                 .
                 .
                 and so on.

Now the trick.  If a CPU 8 bit write is performed when C4=0 and if   the
                right conditions are met what happens is that the 8 bits
                from the CPU are copied into all four planes at once,
                so it's not hard to see how with one single byte write
                4 actual bytes are written into VGA memory.
                (those right conditions are set when video mode 13 is set
                from BIOS (by the famous 'MOVAX,13/INT10' ML lines).


Since 64000 bytes are used by mode 13 that means that somehow
16000 bytes from each bit plane are used.  The rest of them.. who cares
about the rest.. right?

So to clear THE ENTIRE 320X200 mode 13 screen all that's needed is to clear
the 1st 16000 bytes from each plane, so our loop goes from 0 to 15999.
Since 4 bytes at a time are written, our 16000 cycle actually clears 64000
bytes.  Neat, huh?  Speed gain of 4x.

To clear chain4:

   MOV DX,03C4
   MOV AL,04
   OUT DX,AL      ;03C4 = 04
   INC DX
   IN AL,DX       ;
   AND AL,F7      ;03C5 = PEEK(03C5) AND $F7
   OUT DX,AL      ;
                  ;ALL THIS AMOUNTING TO:  CLEAR BT3 OF INDEX 04 IN 03C4
                  ;

then to clear 320x200 -> AT SEGMENT A000:
                         FORT=0TO15999:POKET,0:NEXT




An analysis of 'CHAIN 4':

[chaining seems to work on [and can be analyzed as working on] two levels:
1- on CPU address decoding to/from VGA
2- and on how many planes the CPU affects at once during writes.

From the decoding point of view, NOT CHAIN 4 accesses planes sequentially
in the sequential byte order of the bit planes.  Which plane to access
is controlled by the MAP MASK REGISTER (03C4 #02, BTS3-0, one bit
                                            each plane, plane 0 bit 0,
                                            plane 1 bt1, and so on..)
Read address offset 0 reads byte 0 from plane ????
                     1            1            ???? ... and so on..

Still from the decoding point of view, CHAIN 4 accesses planes not
sequentially (in the sequential byte order of the bit planes).  Instead,
it accesses byte 0 from plane 0 from offset address 0 (A000:0000).
Byte 0 from plane 1 from offset address 1 (A000:0001).  Byte 0 from plane 2
from offset address 2.  Byte 0 from plane 3 from offset address 3.
Byte 1 from plane 0 from offset address 4.   Another way of looking at it is
that it accesses byte 0 from plane 0 from offset address 0.  Byte 1 from
plane 0 from offset address 4.  Byte 2 from plane 0 from offset address 8,
and so on..

From the point of view of how many bit planes the CPU affects at once during
writes, when CHAIN 4 (C4 bit=1, that is 03C4 #04 bt3=1), CPU affects one
plane at a time ONLY [MAP MASK REGISTER seems to be of no use].

When NOT CHAIN 4, instead, if MAP MASK REGISTER has bts3-0=1111 (meaning
"all planes active for CPU access"), then all planes are affected by the
CPU during writes.

]

/////////////////////////////////////////////////////////////////////////////
/////////////////////////////////////////////////////////////////////////////





WORKING EXAMPLES:
----------------


Ok enough talk.  Let's rock a bit:

[*** COPY THE EXAMPLES AND SAVE THEM TO AN ASCII FILE.
     THE BASIC EXAMPLE SAVE AS FASTCLR.BAS AND THE
     ASSEMBLY EXAMPLE AS FASTCLR.ASM.

     >>>  FOR THE BASIC EXAMPLE, USE GWBASIC LIKE THIS:
          FROM THE DOS PROMPT RUN GWBASIC:  GWBASIC.EXE
          FROM INSIDE GWBASIC DO THIS:
               LOAD "FASTCLR"
               GOSUB 700
               RUN


     >>>  FOR THE ASSEMBLY LANGUAGE EXAMPLE, USE MASM.EXE AND LINK.EXE
          (DIFFERENT VERSIONS OF MASM.EXE AND LINK.EXE SHOULD WORK.
           EXE2BIN.EXE MAY NEED SETVER.EXE TO WORK. (MAY BE EXE2BIN.COM)
           SO, USE MASM.EXE, LINK.EXE, EXE2BIN.EXE LIKE THIS:
               FROM DOS PROMPT:
                                 MASM FASTCLR;
                                 LINK FASTCLR;
                                 EXE2BIN FASTCLR.EXE FASTCLR.COM
                                 DEL FASTCLR.EXE
                                 FASTCLR
                                            ]

Here's the actual working BASIC program:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
<<<<<FILE BEGINS RIGHT BELOW THIS LINE:  FILE:  FASTCLR.BAS>>>>>>>

1 'PROG: FASTCLR.BA?
4 KEY OFF:'ANNOYING KEYS OFF
10 SHELL "VGA13":'VGA13.COM = 7 BYTE ROUTINE TO SET MODE 13:
12              :'ENTER THESE INTO 0100:  B8 13 00 CD 10 CD 20
13              :'OR USE 'GOSUB 700' RIGHT HERE TO MAKE VGA13.COM, THEN RUN
14 OUT &H3C4,2:A=INP(&H3C5):OUT &H3C4,4:B=INP(&H3C5):'A=MAP MASKS  B=CHAIN4
(BT3)
16 DEF SEG=&HA000
20 FOR T=0 TO 63999!:POKE T,C:C=C+1:IF C>255 THEN C=0
24 NEXT
30 A$=INKEY$:IF A$="" THEN 30
40 IF A$=CHR$(27) THEN 96
44 OUT &H3C4,4:OUT &H3C5,B AND NOT 8:'BT3=0 -> NO CHAIN (NO CHAIN 4 SO
45                                  :'         1 CPU WRITE YIELDS 4 PLANE WRITES
50 FOR T=0 TO 15999
60 POKE T,0:NEXT
64 OUT &H3C4,4:OUT &H3C5,B:' BACK TO NORM
70 A$=INKEY$:IF A$="" THEN 70
80 IF A$=CHR$(27) THEN 96
90 GOTO 14
96 SCREEN 9:SCREEN 0:END
697 REM
698 REM MAKE VGA13.COM
699 REM
700 OPEN "VGA13.COM" AS #1 LEN=1
704 FIELD #1,1 AS FA$
710 A$="B81300CD10CD20"
730 FOR T=0 TO 6:B$="&H"+MID$(A$,1,2):A$=MID$(A$,3):B=VAL(B$):B$=CHR$(B)
740 LSET FA$=B$:PUT #1,1+T:NEXT T:CLOSE #1:RETURN

<<<<<<<<<FILE ENDS RIGHT ABOVE THIS LINE.  FILE:  FASTCLR.BAS>>>>>>>
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

(This program needs and uses a short 7 byte machine language routine called
VGA13.COM which is used to set mode 13 (GWBASIC doesn't set mode 13
on its own.  Before running this program, do GOSUB 700, which creates
VGA13.COM.  Afterwards, RUN the program.
If this program is to be compiled before running, then do this:
add line 5:
                 5 GOSUB 700:END
                 compile and execute it, then REM out that line,
                 or delete it, and then compile and execute it again
                 to see it working.)








and here's an assembly language demonstration of it:  does the same thing
as the BASIC version but at machine speed.


<<<<<FILE BEGINS RIGHT BELOW THIS LINE:  FILE: FASTCLR.ASM>>>>

;YSS89, cont, ycont,Yss 1989-, sdD+J.
;------------------------------------
PAGE ,132
TITLE COMTRIX

;NO DS, NO SS, JUST CS
;DSEG SEGMENT PARA PUBLIC 'DATA'
;DSEG ENDS
;
;TO ASSEMBLE:  NAME THIS SOURCE FASTCLR.ASM
;        MASM FASTCLR;
;        LINK FASTCLR;
;        EXE2BIN FASTCLR.EXE FASTCLR.COM
;        DEL FASTCLR.EXE
;
;        ** AT THIS POINT YOU SHOULD HAVE FASTCLR.COM READY TO USE
;   FROM PROMPT.
;

CSEG SEGMENT PARA PUBLIC 'CODE'
ORG 100H
ASSUME CS:CSEG,DS:CSEG

START PROC FAR

MOV AX,13H  ;
INT 10H     ;MODE 13

;------------------------
;------------------------ PROGRAM STARTS HERE:
;------------------------
;
; WE'LL DO THESE STEPS IN
; SAME ORDER:
;
; 1- SET SCREEN TO MODE13
; 2- FILL SCREEN WITH SOMETHING
; 3- WAIT FOR A KEY.  ESC -> END
; 4- NO CHAIN 4: 03C4 #04 BT3=0
; 5- CLEAR 16000 BYTES (FROM CPU'S POINT OF VIEW, ACTUALLY 64000 BYTES CLEARED)
; 6- WAIT FOR A KEY. IF KEY = ESC -> END
; 7- IF NOT, RESTORE CHAIN4 AND GO BACK TO STEP 2
;
;
P02:
;------------------------
;------------------------ FILL THE VGA MODE 13 SCREEN WITH SOMETHING
;------------------------
MOV AX,0A000H
MOV ES,AX
MOV AX,0000
MOV DI,AX
ADD AX,0100H
MOV CX,32768
CLD
P00:
STOSW
ADD AX,071FH
LOOP P00


;------------------------
;------------------------ WAIT FOR A KEY.  ESC-> END
;------------------------
MOV AH,8
INT 21H
CMP AL,27
JE XIT


;------------------------
;------------------------ NO CHAIN 4:  03C4 #04 BT3=0
;------------------------
MOV DX,03C4H
MOV AL,4
OUT DX,AL
INC DX
IN AL,DX     ;AL=[#04]
PUSH AX      ;USE LATER
AND AL,0F7H
OUT DX,AL    ;#04=[#04] AND F7


;------------------------
;------------------------ CLR VGA SCREEN
;------------------------
MOV AX,0000
MOV DI,AX
MOV CX,8000
CLD
P01:
STOSW
LOOP P01


;------------------------
;------------------------ WAIT FOR A KEY.  ESC -> END, OTHERWISE,
;------------------------
MOV AH,8
INT 21H
CMP AL,27
POP AX
JE XIT


;------------------------
;------------------------ RESTORE CHAIN 4 (03C4 #04 BT3=1) AND BACK AGAIN
;------------------------
MOV AH,AL
MOV DX,03C4H
MOV AL,4
OUT DX,AX
JMP P02


XIT:
MOV AX,3    ;BACK TO 'NORMAL'
INT 10H
INT 20H

START ENDP
CSEG ENDS


     END START


<<<<<FILE ENDS RIGHT ABOVE THIS LINE.  FILE:  FASTCLR.ASM>>>>>