LZSSFAQ

FAQ for the LZSS Compression

Nemesis - Terminus Traduction - 2003

Translated to English by DaMarsMan (Kojiro Translations 2005)

�Table of Contents :

�What is it used with?� (why you are reading this FAQ?) ��

�Page 2

�How does this compression work? (in general)��

Page 2

�Details on it's coding

1.�� Binary coding��

�� Page 4

2.�� Coding of the length��

�� Page 4

3.�� Coding fo the position��

Page 4Coding of this type (easy to understand)� ��

Page 4

�Practice methods (because practice is good)

1.�� Finding an LZSS scheme��

�� Page 5

2.�� Understanding the operations (where?s the ?decode? button ?)��

Page 5

3.�� How about compressing ?��

Page 5Finsishing the program.��

Page 5

�The Program ! Give us the program!

1.�� Decompression��

�� Page 6Compression��

�� Page 6

�Final Thanks (they deserve it)��

Page 7

�What is it used with ? (why are you reading this FAQ?)

�First of all, this is the type of compression used for text. It can be used in other cases but is more practical for ��

�compressing text. Of course you are going to learn why.

�Let?s use an example of a sentence.�

-�� The dog sees the cat across the street.

-�� The dog crosses the street.

-�� The dog gets flattened.

-�� The cat benefits.

�The isn?t very in depth example but it is, however, useful? One might observe that the "cat", "dog" and "street" are �

�repeated several times. This is very repetitious. LZSS works well if you want to replace them with a word like "it".

�This is how it is when it comes to data processing. You can compress everything or individual parts. Single �

�characters can be compressed or even entire sentences.

�How does this compression work ? (in general)

�We will start with looking at decompression in its simplest form and that will help us to get an idea of� of why the �

�final texts looks the way it does.

��

�Look at the text above. It?s far from readable. Actually its the credits from Final Fantasy V for the SNES. This is �

�before using the table file.� You do know what tables are, right ? I mean, you are looking at a document about LZSS, �

�I would think you have worked with at least one table before. After the table is applied the result looks like below.

��

�It looks better but it is still not looking its best ; ). Look at the first character (It's made of 8 bits....remember this, it's

�important).� Look at the pattern. You will see 8 readable characters and then an odd character. Then another 8

�readable and so on....

�Take a look at the second line where it says YOSHIT?? AMANO . If you know your Squaresoft then you know that �

�this is talking about� Yoshitaka Amano. Now you may notice that it seems to have actually lost compression here. �

�The compression added an extra character which seems useless.

�Let's find an explanation for the extra character. Look at the characters. There is FF, FF and then FD.

�In binary this is : �� 11111111 then � 11111101 (8 bits for 8 characters or character �

�strings).

�When the byte is flipped you get this :�� 11111111 and�� 10111111

�So, as you can see, a 1 corresponds to a character which is alone, which has been the case thus far. A 0 �

�corresponds to a chain already met. Do you remember the sentences from the beginning. This is the missing AKA

�from SAKAGUCHI.

�The binary codes, therefore, determine which bytes correspond to chains and which bytes correspond to single �

�characters.

�With the characters, you also find "beacons" which won allow data fields of variable size. In general, there is one

�byte per character and two per coded chain.

�How are the famous chains we say coded? There could be a number of ways but it usually is coded in 2 bytes.

�Those would be the posistion of the chain and its length. It's good to have a diagram to look at for this.� Do you

�understand this stuff yet?

�Let's suppose we have compressed the beginning of the text already up until "YOSHI". Let's start at "TAKA �

�AMANO".

1)       We would probably look first at the TAK that we have seen before. It would be pointless to look for less the 3

�characters considering it takes 2 bytes to code the chain in return.� Also, you may look at previous data now, but at �

�the time of decompression you won't even know what it is.

�<DIRECTOR<HIRONOBU SAKAGUCHI><IMAGE DESIGN<YOSHITAKA AMANO>

�2) Since we had no luck with TAK, that means we consider T as a normal character and move on.

�<DIRECTOR<HIRONOBU SAKAGUCHI><IMAGE DESIGN<YOSHITAKA AMANO>

2)       As you see AKA will work. We must move on to test the forth character. Searching for AKAG bring no luck. Three �

�characters will have to do.

�<DIRECTOR<HIRONOBU SAKAGUCHI><IMAGE DESIGN<YOSHITAKA AMANO>

3)       If you had found a continuation, it could go up to 5,6, or even more characters. Then you could even have Mr �

�ORIAKA AMANA� which could be inserted HIRONUBU and YOSHITAKA. (If there is already a clone earlier then

�why not?) Then, if you stopped on SAKAGUCHI you wouldn't lose characters badly.

�<DIRECTOR<HIRONOBU SAKAGUCHI><ORIAKA AMANA<YOSHITAKA AMANO>

�5) This isn't the case here, but the only way of testing is to go through all the text up to that position and keep in �

�memory the best found matches.

�6) Finally, if you took a close look� you should have found in position 21, 3 identical characters. You then replace the

�new AKA found by the couple (21,3) coded out of 2 bytes. After that, you move on to "AMANO..."

�In a less complex way, you break up and check in the following manner:

To be compressed

Compressed text

ABBABBBABABBB

BBABBBABABBB

BABBBABABBB

ABBBABABBB

ABB

BABABBB

ABB(1,3)

ABBB

ABB(1,3)(3,3)

ABB(1,3)(3,3)(4,4)

13 octets

9 octets

�Beware! Text like the chain BBBB can be coded efficiently by B (1,3). It is risky for the program like this but it �

�actually works well when used. In that case you have to retrieve the characters one by one. That case is seldom �

�taken advantage of.

�That's it for the general information. That concludes the FAQ ; ). Now we will look at how the texts are coded with �

�computers and some alternative methods and characteristics.

�Details on it's coding :

1.�� Binary Coding :

�First, let's look at where the bits are in the coding of the couple (posistion, length). As you may suspect, you are far

�more likely to find short similar chains everywhere in the text instead of huge blocks of text all in the same place.

�This is why you need to look at the 16 bits available (2 bytes).� Common LZSS uses 4 bits for length of chain and 12

�bits for its posistion. 4 bits for length + 12 bits position = 2 bytes. Some other common methods include 5/11 or

�3/13.

�Pay attention to the way to bits are arranged. Some times they can be pretty weird and variable as you see in the

�examples. In the example, blue is the bits for length and red is the bits for posistion.

�Wing Commander (SNES) :

�Final Fantasy V (SNES) :

�In these cases, its best to have a look at the assembly while the game is running. A little common sense and some

�testing should help too. Don't continue working until you get it right. ; ). Let's go on about the coding of the position

�and length.

2.�� Coding of the length :

�We should start with the easiest, the length of the coded chains. This is usually limited. You can tell with what we �

�looked at above that its best to have a length greater then 3 to achieve maximum compression. However, some

�LZSS compression compress 2 bytes. I can't understand why and I think this would cause some slow downs.

�Taking into account that we are limiting are chain length to at least 2. You then can remove that minimum from the

�coded value saving some bits. I pointed out before that the number of bits is the limiting factor here.

�Another reason for limiting the length is for cost of research and inevibility of the number of allocated bits. LZSS can

�be just as slow compressing as it is fast while decompressing. This is why the less complex the chains are that you

�are looking for the quicker the compression is. Although, if you are going to encode something permanently (video

�game) then the slow compression time is a good trade off for a better compression ratio. There is no limit to the

�length which can be used but no one usually rewrites the same exact sentence twice in text.

�Thus for FF5, you go from 3 to 35 characters (2^5+3) coded between 00000 and 11111. And for Wing, it's from 2 to

�17 characters (2^4+2).

3.�� Coding of the position:

�Here is where it starts getting complicated. ; (� Prepare to be a little lost.� : )� Once you find out how the bits are

�arranged for the posistion, it's already not that bad. Once the arrangement is found it should be checked by finding

�the actual position in the file.

�It's neccessary to make sure you don't search too far. Here is where you may run in to some confusing findings.

�One small addition may throw it off. In most cases, there is a more or less logical offset. (There is always a function.

�For example, 221+35%256=1 from what I remember) Even when you have the decompressed text to look at, the

�harder part is the amount of time finding the chains and where to find the peices when decompressing. There is

�usually a spot where the characters are in a block of 4096 (2^12) if the compression was coded with 12 bits for the

�posistion. This means you will only have to look for the chains in this window. So if you are starting at 0, only search �

�for the chains up to 4096. This position is relative to the starting location. There are other options but I don't think

�they exceed this stage.

4096

9192

Pseudo absolute

4096

9192

Relative

4.�� Coding of this type (easy to understand):

�Let me take a paragraph to say that the bits of data are sometimes reversed compared to the order we have seen

�them. In this same manner, a

�0 could mean a single character and a 1 could mean chain. But that's easy to find if you using the practice methods.
�Practice methods (because practice is good) :

1.�� To find a LZSS scheme:

�Theres almost nothing easier or more obvious than this� : )� . Like you have seen above, the LZSS is a byte of

�information that follows data. However, there are 8 bytes of data if nothing can be coded. It doesn't have to have it in

�front. Thus, if you know the characters of the text, you can find the chain (you can even use a relative search). If you

�come upon a byte equal to 00 or FF you know that the next byte has our information.

�If you choose another method, like looking through the file until you find the text, it still should be recognized

�quickly. Either way you will see the FF or a different byte and you should be able to read at least some of it.

�If you got this far you have past the first step, finding the text. Theres still more to do though ; ).

2.�� Understanding the operations (Where is the "decode" button?) :

�First off you got to look at the structure of the byte heading. First determine if it's the 0 or the 1 that encodes the

�characters. Then you got to look at the 8 readable characters (9 bytes) until finding the first strange character. In

�theory, you should quickly determine the order of the bits (reversed or not). Next you have to look for the number that

�corresponds to the chain. You have to find the order of the bits in the byte.

�Thus, when you are writing code that will return the chain you have to know which bits are doing what.� To practice,

�take several and compare them. This is tedious but the only way to find how the bits are arranged. (this is for when

�you are not disassembling the code while running and viewing the code. It is easier to get the arrangement then.)

�And, at this stage, in my humble opinion, it is better to start with the programming. Trying to decompress an LZSS

�without writing a program is insane! Constant testing is the best way to find the solution. The length is represented

�by the last bits. For the position, don't get to proud and think that 0 in the beginning� points to the position (you can

�have very well 111111011111 with an offset of 33 and the point on the position 0 but when you add one you get�

�(1)000000000000).

�Once you have figured out the bits, all that is left is finding the offset. (if its coded that simple). To find the offset you

�have to do an upward search to find the chain. Pay attention because sometimes you will find it in the first search.

�Sometimes it will be the last one you find. You may not find it at all. In any case, its best to have the location of the

�text. Write down where its at.

�Theres not really any more to finding the coding besides that. You can't continue on really if you can't find it in the 2^

�(number of bits of the position) window you are searching in. Make sure you search absolute and relative when you

�are looking.

3.�� How about compressing ?

�Ah, well it's same thing but in reverse. It's pointless to get in to details cause there is only one way to do it. It's not

�very complicated once you know your limits in the preceding paragraph. Of course if you want to compress text

�directly without following the rules of an existing decompression than the only limits are that of the computer you are

�using. On a side note, the more complex your program is the quicker you will lose memory and speed. The

�differences become very minimal after a certain point.

4.�� Finishing the program?

�Now that we have everything figured out we should double check to make sure our offsets and everything is right. At

�this point its neccessary to right a compression/decompression routine. I don't suggest a program that does

�everything. Firstly, I write these types of programs exclusively in C. I'm afriad if I do that it will be bad for certain

�purist because my code can be confusing or I forget to write some comments. Secondly, everyone does not use C

�and if this FAQ survives at least a few years and is even lost of people's hard drives (including mine) other languages

�will replace this one when writing these types of programs (which is already happening).

�This is why on the other hand, I'm gonna go ahead and write this code in abstracted algorithmic language (thats my

�primitive way of saying English ; ) Nothing I wrote is unchangeable. I complicated some things and didn't optimize

�others. My purpose was to get the point across in the least confusing way. So optimizations that hide certain details

�will be avoided.

�The Program! Give use the Program!

1.�� Decompression :

�To find out certain constants of the routine:

-�� Number of bits of the position (NBPOS)

-�� Which bits do what when linking to the chain (with the limit in a table)

-�� the offset for the position (OFFP) and the cut (OFFC)

�To prepare the matching variables:

-�� a zone of memory to store the text in, at least 2^NBPOS (BUFFER)

-�� a table or a byte for the bits indicating the type of coding (TYPES)

-�� a position of reading and writing to be initialized (POSR, POSW)

-�� size and position of the chain that will be returned (POS,SIZE)

�Make a function that will convert a byte to bits in your desired way (CONVERT)

�(I wont deal with that as you should be able to do that yourself)

�DO WHILE (As long as there is compressed text. That is for you to determine.) MAKE

�� Look at the coding of the byte in POSL and apply CONVERT to get it to the table in TYPES

�� FOR I from 1 to 8� (for the eight elements of data) MAKE

�� IF (the type of coding is that of a character) THEN

�� Copy the character (byte)

�� Increment POSR and POSW

�� IF NOT

�� Recover the size and position of the 2 bytes in POSW

�� Add OFFP to POS and make the window 2^NBPOS (the number of bits must agree)��

�� Add OFFC to SIZE

�� Copy SIZE characters starting from POS in POSR (and in BUFFER)

�� Add 2 to POSW and SIZE with INSTALLATION

�� END IF

�� IF (If it comes to end of BUFFER, POSL%(2^NBPOS)=0) THEN

�� Then go back to the beginning of BUFFER

�� END IF

�� END FOR

�END DO WHILE

2.�� Compression :

�(Very similar at the beginning except for slight differences)

�Constants :

-�� number of bits length (NBLONG)

�New variables :

-�� Area of memory to store compressed data (you have to wait for all 8 bits to write anything because the byte heading is continually obtained ) (COMP)

-�� Variables for research: long length gotten, position of this one, position of the current research, ? (MLONG, MPOS, LONG, POS)

�This is a function that will convert 8 bits to a byte in the way you specify (RECONVERT)

�DO WHILE (There is text to be compressed) MAKE

�� FOR I FROM 1 to 8 MAKE

�� IF (End of text found)

�� You fill TYPES with the code for the characters and COMP with a null character

�� IF NOT (POSL exceeded OFFT) THEN

�� You make research by initializing POS at a good spot in the 2^NBPOS

�� Create MLONG and set to 0

�� DO WHILE (POS<POSE, Like I stated at the beginning of the FAQ, at the end of the first ��

�part

ET and MLONG should be less than NBLONG ET MLONG<(2^NBLONG+OFFT)) MAKE

�� Look for the longest chain possible that matches the current position

�� In order to do that, increment LONG and POs where neccessary

�� IF (LONG>MLONG) THEN

�� Store the information of POS and LONG in MPOS and MLONG

�� END IF

�� END DO WHILE

�� IF (MLONG>2, you found a useful chain)

�� Remove OFFP with MPOS and OFFT with MLONG

�� Copy data from MLONG and MPOS into COMP (one byte)

�� Put TYPES[i] for a chain

�� IF NOT

�� Put in the corresponding info for the characters

�� Put MLONG at 1 and all the cases

�� END IF

�� Increment POSL of MLONG

�� IF NOT

�� Put in TYPES[i] the code for a character

Recopy the character into COMP

Increment POSL

�� END IF

�� END FOR

�� Recopy TYPES after converting it into a byte

�� Recopy the bytes of COMP at the end

�� Increment POSE continually : 1+2*(Number of coded chains)+(Number of coded characters)

�END DO WHILE

Final Thanks (they deserve it) :

Thanks to Copernic who has accepted me at Terminus and pushed me a little with hacking (finally at a decent level), NeoMithrandil and Manz who often gave me information, and i can't forget Ryusan who straight up demanded that I make this FAQ ^^ (who is not even considering the questions I put aside)

I also thank many authors who have written documents on various compressions, lzss especially, and finally the people to spend their lives playing video games because without them I would never have wanted to do this.

Considering that this is the first FAQ that I have finished, I would like to greet everyone who wrote the hacking documents that I read before writing this. These people include Jay, Copernic, Skeud, Rysley, Moogle and many others.

And of course I have to thank you since you have read up to this far ; ).

If you are looking for other information, then I can't really help you with that. Documents dealing with LZSS and compression are very abundant but those that are speciafically for the rom hacking aspect of that are less common. Nevertheless, if you want information on comrpessions, then Jay's "Compression Tutorial" may be helpful for you.

You can always reach me at tt_nemesis@yahoo.fr <mailto:tt_nemesis@yahoo.fr> (send before the address is useless considering what i've just done. Perhaps there will be others of them too :)

Nemesis 26/03/03 (after about four hours of drafting and work ;)