📄 compress.txt
字号:
"D" "C"\ / \ "E"You see here that the coder shouldn't put the "C" at a node...As you see, the largest binary code are those with the longest distancefrom the top of the tree. Finally, the more frequent bytes will be the highestin the tree in order you have the shortest encoding and the less frequent byteswill be the lowest in the tree.From an algorithmic point of view, you make a list of each byte you encounteredin the stream to compress. This list will always be sorted. The zero-occurrencebytes are removed from this list. You take the two bytes with the smallestoccurrences in the list. Whenever two bytes have the same "weight", you take twoof them regardless to their ASCII value. You join them in a node. This node willhave a fictive byte value (256 will be a good one!) and its weight will bethe sum of the two joined bytes. You replace then the two joined bytes withthe fictive byte. And you continue so until you have one byte (fictive or not)in the list. Of course, this process will produce the shortest codes if the listremains sorted. I will not explain with arcana hard maths why the resultis a set of the shortest bytes...Important: I use as convention that the right sub-trees have a weight greateror equal to the weight of the left sub-trees.Example: Let's take a file to compress where we notice the followingoccurrences:Listed bytes | Frequences (Weight)---------------------------------- 0 | 338 255 | 300 31 | 280 77 | 24 115 | 21 83 | 20 222 | 5We will begin by joining the bytes 83 and 222. This will produce a fictive node1 with a weight of 20+5=25.(Fictive 1,25) /\ / \(222,5) (83,20)Listed bytes | Frequences (Weight)---------------------------------- 0 | 338 255 | 300 31 | 280 Fictive 1 | 25 77 | 24 115 | 21Note that the list is sorted... The smallest values in the frequences are 21 and24. That is why we will take the bytes 77 and 115 to build the fictive node 2.(Fictive 2,45) /\ / \(115,21) (77,25)Listed bytes | Frequences (Weight)---------------------------------- 0 | 338 255 | 300 31 | 280 Fictive 2 | 45 Fictive 1 | 25The nodes with smallest weights are the fictive 1 and 2 nodes. These are joinedto build the fictive node 3 whose weight is 40+25=70. (Fictive 3,70) / \ / \ / \ /\ / \ / \ / \(222,5) (83,20) (115,21) (77,25)Listed bytes | Frequences (Weight)---------------------------------- 0 | 338 255 | 300 31 | 280 Fictive 3 | 70The fictive node 3 is linked to the byte 31. Total weight: 280+70=350. (Fictive 4,350) / \ / \ / \ / \ (31,280) / \ / \ /\ / \ / \ / \(222,5) (83,20) (115,21) (77,25)Listed bytes | Frequences (Weight)---------------------------------- Fictive 4 | 350 0 | 338 255 | 300As you see, being that we sort the list, the fictive node 4 has become the firstof the list. We join the bytes 0 and 255 in a same fictive node, the number 5whose weight is 338+300=638.(Fictive 5,638) /\ / \(255,300) (0,338)Listed bytes | Frequences (Weight)---------------------------------- Fictive 5 | 638 Fictive 4 | 350The fictive nodes 4 and 5 are finally joined. Final weight: 638+350=998 bytes.It is actually the total byte number in the initial file: 338+300+24+21+20+5. (Tree,998) 1 / \ 0 <- / \ -> / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ (31,280) (255,300) (0,338) / \ / \ /\ / \ / \ / \(222,5) (83,20) (115,21) (77,25)Bytes | Huffman codes | Frequences | Binary length*Frequence------------------------------------------------------------ 0 | 00b | 338 | 676 255 | 01b | 300 | 600 31 | 10b | 280 | 560 77 | 1101b | 24 | 96 115 | 1100b | 21 | 84 83 | 1110b | 20 | 80 222 | 1111b | 5 | 20Results: Original file size: (338+300+280+24+21+20+5)*8=7904 bits (=998 bytes)versus 676+600+560+96+84+80+20=2116 bits, i.e. 2116/8=265 bytes.Now you know how to code an input stream. The last problem is to decode all thisstuff. Actually, when you meet a binary sequence you can't say whether it comesfrom such byte list or such other one. Furthermore, if you change the occurrenceof one or two bytes, you won't obtain the same resulting binary tree. Try forexample to encode the previous list but with the following occurrences:Listed bytes | Frequences (Weight)---------------------------------- 255 | 418 0 | 300 31 | 100 77 | 24 115 | 21 83 | 20 222 | 5As you can observe it, the resulting binary tree is quite different, we had yetthe same initial bytes. To not be in such a situation we will put an headerin front of all data. I can't comment longly this header but I can sayI minimize it as much as I could. The header is divided into two parts.The first part of this header looks closely to a boolean table (coded more orless in binary to save space) and the second part provide to the decoderthe binary code associated to each byte encountered in the original inputstream.Here is a summary of the header:First part---------- First bit / \ 1 / \ 0 / \ 256 bits set to 0 or 1 5 bits for the number n (minus 1) depending whether the of bytes encountered corresponding byte was in the file to compres in the file to compress | (=> n bits set to 1, \ / n>32) n values of 8-bits (n<=32) \ / \ / \ /Second part |----------- | | +------------->|(n+1) times | |(n bytes of | First bit?the values | / \encountered | 1 / \ 0in the | / \ source file | 8 bits of 5 bits of the + the code | the length length (-1)of a | (-1) of the of the followingfictive | following binarybyte | binary code codeto stop the | (length>32) (length<=32)decoding. | \ /The fictive | \ /is set to | \ /256 in the | |Huffman | binary code-tree of | |encoding) +--------------| | Binary encoding of the source file | Code of end of encoding |With my codecs I can handle binary sequences with a length of 256 bits.This correspond to encode all the input stream from one byte to infinite length.In fact if a byte had a range from 0 to 257 instead of 0 to 255, I would have abug with my codecs with an input stream of at least 370,959,230,771,131,880,927,453,318,055,001,997,489,772,178,180,790,105 bytes !!!Where come this explosive number? In fact, to have a severe bug, I must havea completely unbalanced tree: Tree /\ \ /\ \ /\ \ ... /\ \ /\Let's take the following example:Listed bytes | Frequences (Weight)---------------------------------- 32 | 5 101 | 3 97 | 2 100 | 1 115 | 1This produces the following unbalanced tree: Tree /\(32,5) \ /\ (101,3) \ /\ (97,2) \ /\ (115,1) (100,1)Let's speak about a mathematical series: The Fibonacci series. It is defined asfollowing:{ Fib(0)=0
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -