📄 algorithm.doc
字号:
The compressor achieves an average compression rate of 60% of theoriginal size which is on par with "gzip". It seems that you cannot domuch better for compressing compiled binaries. This means that thebreak even point for using compressed images is reached, once theuncompressed size approaches 1.5kB. We can stuff more than 12kB intoan 8kB EPROM and more than 25kB into an 16kB EPROM. As there is only32kB of RAM for both the uncompressed image and its BSS area, thismeans that 32kB EPROMs will hardly ever be required.The compression algorithm uses a 4kB ring buffer for buffering theuncompressed data. Before compression starts, the ring buffer isfilled with spaces (ASCII character 0x20). The algorithm tries tofind repeated input sequences of a maximum length of 60 bytes. All256 different input bytes plus the 58 (60 minus a threshold of 2)possible repeat lengths form a set of 314 symbols. These symbols areadaptively Huffman encoded. The algorithm starts out with a Huffmanntree that assigns equal code lengths to each of the 314 symbols(slightly favoring the repeat symbols over symbols for regular inputcharacters), but it will be changed whenever the frequency of any ofthe symbols changes. Frequency counts are kept in 16bit words untilthe total number of compressed codes totals 2^15. Then, all frequencycounts will be halfed (rounding to the bigger number). For unrepeatedcharacters (symbols 0..255) the Huffman code is written to the outputstream. For repeated characters the Huffmann code, which denotes thelength of the repeated character sequence, is written out and then theindex in the ring buffer is computed. From this index, the algorithmcomputes the offset relative to the current index into the ringbuffer. Thus, for typical input data, one would expect that short tomedium range offsets are more frequent than extremely short or mediumrange to long range offsets. Thus the 12bit (for a 4kB buffer) offsetvalue is statically Huffman encoded using a precomputed Huffman treethat favors those offset values that are deemed to be morefrequent. The Huffman encoded offset is written to the output datastream, directly following the code that determines the length ofrepeated characters.This algorithm, as implemented in the C example code, looks very goodand its operating parameters are already well optimized. This alsoexplains why it achieves compression ratios comparable with"gzip". Depending on the input data, it sometimes excells considerablybeyond what "gzip -9" does, but this phenomenon does not appear to betypical. There are some flaws with the algorithm, such as the limitedbuffer sizes, the adaptive Huffman tree which takes very long tochange, if the input characters experience a sudden change indistribution, and the static Huffman tree for encoding offsets intothe buffer. The slow changes of the adaptive Huffman tree arepartially counteracted by artifically keeping a 16bit precision forthe frequency counts, but this does not come into play until 32kB ofcompressed data is output, so it does not have any impact on our usefor "etherboot", because the BOOT Prom does not support uncompresseddata of more then 32kB (c.f. doc/spec.doc).Nonetheless, these problems do not seem to affect compression ofcompiled programs very much. Mixing object code with English text,would not work too well though, and the algorithm should be reset inbetween. Actually, we might gain a little improvement, if text anddata segments were compressed individually, but I have notexperimented with this option, yet.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -