📄 compress.txt
字号:
{ Fib(1)=1{ Fib(n)=Fib(n-2)+Fib(n-1)Fib(0)=0, Fib(1)=1, Fib(2)=1, Fib(3)=2, Fib(4)=3, Fib(5)=5, Fib(6)=8, Fib(7)=13,etc.But 1, 1, 2, 3, 5, 8 are the occurrences of our list! We can actuallydemonstrate that to have an unbalanced tree, we have to take a list withan occurrence based on the Fibonacci series (these values are minimal).If the data to compress have m different bytes, when the tree is unbalanced,the longest code need m-1 bits. In our little previous example where m=5,the longest codes are associated to the bytes 100 and 115, respectively coded0001b and 0000b. We can also say that to have an unbalanced tree we must haveat least 5+3+2+1+1=12=Fib(7)-1. To conclude about all that, with a coder thatuses m-1 bits, you must never have an input stream size over than Fib(m+2)-1,otherwise, there could be a bug in the output stream. Of course, with my codecsthere will never be a bug because I can deal with binary code sizes of 1 to 256bits. Some encoder could use that with m=31, Fib(31+2)-1=3,524,577 and m=32,Fib(32+2)-1=5,702,886. And an encoder that uses unisgned integer of 32 bitsshouldn't have a bug until about 4 Gb...+==========================================================+| The LZW encoding |+==========================================================+The LZW scheme is due to three searchers, i.e. Abraham Lempel and Jacob Zivworked on it in 1977, and Terry Welch achieved this scheme in 1984.LZW is patented in USA. This patent, number 4,558,302, is covered by UnisysCorporation and CompuServe. IBM seems to have discovered the same, and patentedit. (Who's right???)You may get a limited licence by writting to:Welch Licencing DepartmentOffice of the General CounselM/S C1SW19Unisys corporationBlue BellPennsylvania, 19424 (USA)If you're occidental, you are surely using an LZW encoding every time you arespeaking, especially when you use a dictionary. Let's consider, for example,the word "Cirrus". As you read a dictionary, you begin with "A", "Aa", and soon. But a computer has no experience and it must suppose that some wordsalready exist. That is why with "Cirrus", it supposes that "C", "Ci", "Cir","Cirr", "Cirru", and "Cirrus" exist. Of course, being that this is a computer,all these words are encoded as index numbers. Every time you go forward, you adda new number associated to the new word. Being that a computer is byte-basedand not alphabetic-based, you have an initial dictionary of 256 letters insteadof our 26 ('A' to 'Z') letters.Example: Let's code "XYXYZ". First step, "X" is recognized in the initialdictionary of 256 letters as the 89th. Second step, "Y" is read. Does "XY"exist? No, then "XY" is stored as the word 256. You write in the output streamthe ASCII of "X", i.e. 88. Now "YX" is tested as not referenced in the currentdictionary. It is stored as the word 257. You write now in the output stream 89(ASCII of "Y"). "XY" is now met. But now "XY" is known as the reference 256.Being that "XY" exists, you test the sequence with one more letter, i.e. "XYZ".This last word is not referenced in the current dictionary. You write then thevalue 256. Finally, you reach the last letter ("Z"). You add "YZ" as thereference 258 but it is the last letter. That is why you just write the value90 (ASCII of "Z").Another encoding sample with the string "ABADABCCCABCEABCECCA".+----+-----+---------------+------+----------+-------------------------+------+|Step|Input|Dictionary test|Prefix|New symbol|Dictionary |Output|| | | | | |D0=ASCII with 256 letters| |+----+-----+---------------+------+----------+-------------------------+------+| 1 | "A" |"A" in D0 | "A" | "B" | D1=D0 | 65 || | "B" |"AB" not in D0 | | | and "AB"=256 | |+----+-----+---------------+------+----------+-------------------------+------+| 2 | "A" |"B" in D1 | "B" | "A" | D2=D1 | 66 || | |"BA" not in D1 | | | and "BA"=257 | |+----+-----+---------------+------+----------+-------------------------+------+| 3 | "D" |"A" in D2 | "A" | "D" | D3=D2 | 65 || | |"AD" not in D2 | | | and "AD"=258 | |+----+-----+---------------+------+----------+-------------------------+------+| 4 | "A" |"D" in D3 | "D" | "A" | D4=D3 | 68 || | |"DA" not in D3 | | | and "DA"=259 | |+----+-----+---------------+------+----------+-------------------------+------+| 5 | "B" |"A" in D4 | "AB" | "C" | D5=D4 | 256 || | "C" |"AB" in D4 | | | and "ABC"=260 | || | |"ABC" not in D4| | | | |+----+-----+---------------+------+----------+-------------------------+------+| 6 | "C" |"C" in D5 | "C" | "C" | D6=D5 | 67 || | |"CC" not in D5 | | | and "CC"=261 | |+----+-----+---------------+------+----------+-------------------------+------+| 7 | "C" |"C" in D6 | "CC" | "A" | D7=D6 | 261 || | "A" |"CC" in D6 | | | and "CCA"=262 | || | |"CCA" not in D6| | | | |+----+-----+---------------+------+----------+-------------------------+------+| 8 | "B" |"A" in D7 | "ABC"| "E" | D8=D7 | 260 || | "C" |"AB" in D7 | | | and "ABCE"=263 | || | "E" |"ABC" in D7 | | | | || | <"ABCE" not in D7| | | | |+----+-----+---------------+------+----------+-------------------------+------+| 9 | "A" |"E" in D8 | "E" | "A" | D9=D8 | 69 || | |"EA" not in D8 | | | and "EA"=264 | |+----+-----+---------------+------+----------+-------------------------+------+| 10 | "B" |"A" in D9 |"ABCE"| "C" | D10=D9 | 263 || | "C" |"AB" in D9 | | | and "ABCEC"=265 | || | "E" |"ABC" in D9 | | | | || | "C" |"ABCE" in D9 | | | | || | <"ABCEC" not in D9> | | | |+----+-----+---------------+------+----------+-------------------------+------+| 11 | "C" |"C" in D10 | "CCA"| | | 262 || | "A" |"CC" in D10 | | | | || | <"CCA" not in D10| | | | |+----+-----+---------------+------+----------+-------------------------+------+You will notice a problem with the above output: How to write a code of 256(for example) on 8 bits? It's simple to solve this problem. You just say thatthe encoding starts with 9 bits and as you reach the 512th word, you use a10-bits encoding. With 1024 words, you use 11 bits; with 2048 words, 12 bits;and so on with all numbers of 2^n (n is positive). To better synchronizethe coder and the decoder with all that, most of implementations use twoadditional references. The word 256 is a code of reinitialisation (the codecmust reinitialize completely the current dictionary to its 256 initial letters)and the word 257 is a code of end of information (no more data to read).Of course, you start your first new word as the code number 258.You can also do so as in the GIF file format and start with an initialdictionary of 18 words to code an input stream with only letters coded on 4 bits(you start with codes of 5 bits in the output stream!). The 18 initial wordsare: 0 to 15 (initial letters), 16 (reinit the dictionary), and 17 (end ofinformation). First new word has code 18, second word, code 19, ...Important: You can consider that your dictionary is limited to 4096 differentwords (as in GIF and TIFF file formats). But if your dictionary is full, youcan decide to send old codes *without* reinitializing the dictionary. All thedecoders must be compliant with this. This enables you to consider that it isnot efficient to reinitialize the full dictionary. Instead of this, you don'tchange the dictionary and you send/receive (depending if it's a coder or adecoder) existing codes in the full dictionary.My codecs are able to deal as well with most of initial size of data in theinitial dictionary as with full dictionary.Let's see how to decode an LZW encoding. We saw with true dynamical Huffmanscheme that you needed an header in the encoding codes. Any header is uselessin LZW scheme. When two successive bytes are read, the first must exist in thedictionary. This code can be immediately decoded and written in the outputstream. If the second code is equal or less than the word number in the currentdictionary, this code is decoded as the first one. At the opposite, if thesecond code is equal to the word number in dictionary plus one, this means youhave to write a word composed with the word (the sentence, not the code number)of the last code plus the first character of the last code. In between, you makeappear a new word. This new word is the one you just sent to the output stream,it means composed by all the letters of the word associated to the first codeand the first letter of the word of the second code. You continue the processingwith the second and third codes read in the input stream (of codes)...Example: Let's decode the previous encoding given a bit more above.+------+-------+----------------+----------+------------------+--------+| Step | Input | Code to decode | New code | Dictionary | Output |+------+-------+----------------+----------+------------------+--------+| 1 | 65 | 65 | 66 | 65,66=256 | "A" || | 66 | | | | |+------+-------+----------------+----------+------------------+--------+| 2 | 65 | 66 | 65 | 66,65=257 | "B" |+------+-------+----------------+----------+------------------+--------+| 3 | 68 | 65 | 68 | 65,68=258 | "A" |+------+-------+----------------+----------+------------------+--------+| 4 | 256 | 68 | 256 | 68,65=259 | "D" |+------+-------+----------------+----------+------------------+--------+| 5 | 67 | 256 | 67 | 65,66,67=260 | "AB" |+------+-------+----------------+----------+------------------+--------+| 6 | 261 | 67 | 261 | 67,67=261 | "C" |+------+-------+----------------+----------+------------------+--------+| 7 | 260 | 261 | 260 | 67,67,65=262 | "CC" |+------+-------+----------------+----------+------------------+--------+| 8 | 69 | 260 | 69 | 65,66,67,69=263 | "ABC" |+------+-------+----------------+----------+------------------+--------+| 9 | 263 | 69 | 263 | 69,65=264 | "E" |+------+-------+----------------+----------+------------------+--------+| 10 | 262 | 263 | 262 |65,66,67,69,67=256| "ABCE" |+------+-------+----------------+----------+------------------+--------+| 11 | | 262 | | | "CCA" |+------+-------+----------------+----------+------------------+--------+Summary: The step 4 is an explicit example. The code to decode is 68 ("D" inASCII) and the new code is 256. The new word to add to the dictionary is theletters of the first word plus the the first letter of the second code (code256), i.e. 65 ("A" in ASCII) plus 68 ("D"). So the new word has the letters 68and 65 ("AD").The step 6 is quite special. The first code to decode is referenced but thesecond new code is not referenced being that the dictionary is limited to 260referenced words. We have to make it as the second previously given case, itmeans you must take the word to decode plus its first letter, i.e. "C"+"C"="CC".Be care, if any encountered code is *upper* than the dictionary size plus 1, itmeans you have a problem in your data and/or your codecs are...bad!Tricks to improve LZW encoding (but it becomes a non-standard encoding):- To limit the dictionary to an high amount of words (4096 words maximum enableyou to encode a stream of a maximmum 7,370,880 letters with the same dictionary)- To use a dictionary of less than 258 if possible (example, with 16 colorpictures, you start with a dictionary of 18 words)- To not reinitialize a dictionary when it is full- To reinitialize a dictionary with the most frequent of the previous dictionary- To use the codes from (current dictionary size+1) to (maximum dictionary size)because these codes are not used in the standard LZW scheme.Such a compression scheme has been used (successfully) by Robin Watts<ct93008@ox.ac.uk>.+==========================================================+| Summary |+==========================================================+-------------------------------------------------RLE type 1:Fastest compression. Good ratio for general purpose.Doesn't need to read the data by twice.Decoding fast.-------------------------------------------------RLE type 2:Fast compression. Very good ratio in general (even for general purposes).Need to read the data by twice.Decoding fast.-------------------------------------------------RLE type 3:Slowest compression. Good ratio on image file,quite middle for general purposes.Need to read the data by twice.Change line:#define MAX_RASTER_SIZE 256into:#define MAX_RASTER_SIZE 16to speed up the encoding (but the result decreases in ratio). If you compresswith memory buffers, do not modify this line...Decoding fast.-------------------------------------------------RLE type 4:Slow compression. Good ratio on image file, middle in general purposes.Change line:#define MAX_RASTER_SIZE 66into:#define MAX_RASTER_SIZE 16to speed up the encoding (but the result decreases in ratio). If you compresswith memory buffers, do not modify this line...Decoding fast.-------------------------------------------------Huffman:Fast compression. Good ratio on text files and similar, middle for generalpurposes. Interesting method to use to compress a buffer already compressed byRLE types 1 or 2 methods...Decoding fast.-------------------------------------------------LZW:Quite fast compression. Good, see even very good ratio, for general purposes.Bigger the data are, better the compression ratio is.Decoding quite fast.-------------------------------------------------The source codes work on all kinds of computers with a C compiler.With the compiler, optimize the speed run option instead of space option.With UNIX system, it's better to compile them with option -O.If you don't use a GNU compiler, the source file MUST NOT have a sizeover 4 Gb for RLE 2, 3, and Huffman, because I count the numberof occurrences of the bytes.So, with GNU compilers, 'unsigned lont int' is 8 bytes instead of 4 bytes(as normal C UNIX compilers and PCs' compilers, such as Microsoft C++and Borland C++).Actually:* Normal UNIX compilers, => 4 Gb (unsigned long int = 4 bytes) Microsoft C++ and Borland C++ for PCs* GNU UNIX compilers => 17179869184 Gb (unsigned long int = 8 bytes)+==========================================================+| END |+==========================================================+
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -