📄 appendix_b.txt
字号:
The presented module [word_huffman] uses a fixed number of bytes
to encode each word in the symbol table. This number of bytes can
be selected to be 1, 2, or 3 (thereby limiting the table to a
generous 2 million entries). The module also separates the
generation of a symbol table from the actual
compression/decompression. The module can be used in a context
where various documents get encoded using the same symbol
table--the table presumably generated based on a set of canonical
documents. In this situation, the computational requirement of
symbol table generation can happen just once, and the symbol
table itself need not be transmitted along with each compressed
document. Of course, nothing prevents you from treating the
document being processed currently as said canonical statistical
word source (thereby somewhat improving compression).
In the algorithm utilized by [word_huffman], only high-bit
bytes are utilized in the symbol table. The lower 128 ASCII
characters represent themselves as literals. Any ASCII
character sequence that is not in the symbol table is
represented as itself--including any short words that would not
benefit from encoding. Any high-bit characters that occur in the
original source text are escaped by being preceded by
an 0xFF byte. As a result, high-bit characters are encoded
using two bytes; this technique is clearly only useful for
encoding (mostly) textual files, not binary files. Moreover,
only character values 0x80-0xFE are used by the symbol table
(0xFF -always- signals a literal high-bit character in the
encoding).
The [word_huffman] algorithm is not entirely stateless in the
sense that not every subsequence in a compressed text can be
expanded without additional context. But very little context is
required. Any low-bit character always literally represents
itself. A high-bit character, however, might be either an escaped
literal, a first byte of a symbol table entry, or a non-first
byte of a symbol table entry. In the worst case, where a 3-byte
symbol table is used, it is necessary to look back two bytes from
an arbitrary position in the text to determine the full context.
Normally, only one byte lookback is necessary. In any case, words
in the symbol table are separated from each other in the
uncompressed text by nonalpha low-bit characters (usually
whitespace), so parsing compressed entries is straightforward.
#---------- word_huffman.py ----------#
wordchars = '-_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
def normalize_text(txt):
"Convert non-word characters to spaces"
trans = [' '] * 256
for c in wordchars: trans[ord(c)] = c
return txt.translate(''.join(trans))
def build_histogram(txt, hist={}):
"Incrementally build a histogram table from text source(s)"
for word in txt.split():
hist[word] = hist.get(word, 0)+1
return hist
def optimal_Nbyte(hist, entrylen=2):
"Build optimal word list for nominal symbol table byte-length"
slots = 127**entrylen
words = []
for word, count in hist.items():
gain = count * (len(word)-entrylen)
if gain > 0: words.append((gain, word))
words.sort()
words.reverse()
return [w[1] for w in words[:slots]]
def tables_from_words(words):
"Create symbol tables for compression and expansion"
# Determine ACTUAL best symbol table byte length
if len(words) < 128: entrylen = 1
elif len(words) <= 16129: entrylen = 2
else: entrylen = 3 # assume < ~2M distinct words
comp_table = {}
# Escape hibit characters
for hibit_char in map(chr, range(128,256)):
comp_table[hibit_char] = chr(255)+hibit_char
# Literal low-bit characters
for lowbit_char in map(chr, range(128)):
comp_table[lowbit_char] = lowbit_char
# Add word entries
for word, index in zip(words, range(len(words))):
comp_table[word] = symbol(index, entrylen)
# Reverse dictionary for expansion table
exp_table = {}
for key, val in comp_table.items():
exp_table[val] = key
return (comp_table, exp_table, entrylen)
def symbol(index, entrylen):
"Determine actual symbol from word sequence and symbol length"
if entrylen == 1:
return chr(128+index)
if entrylen == 2:
byte1, byte2 = divmod(index, 128)
return chr(128+byte1)+chr(128+byte2)
if entrylen == 3:
byte1, rem = divmod(index, 16129)
byte2, byte3 = divmod(rem, 128)
return chr(128+byte1)+chr(128+byte2)+chr(128+byte3)
raise ValueError, "symbol byte len must be 1 <= S <=3: "+`entrylen`
def word_Huffman_compress(text, comp_table):
"Compress text based on word-to-symbol table"
comp_text = []
maybe_entry = []
for c in text+chr(0): # force flush of final word
if c in wordchars:
maybe_entry.append(c)
else:
word = ''.join(maybe_entry)
comp_text.append(comp_table.get(word, word))
maybe_entry = []
comp_text.append(comp_table[c])
return ''.join(comp_text[:-1])
def word_Huffman_expand(text, exp_table, entrylen):
"Expand text based on symbol-to-word table"
exp_text = []
offset = 0
end = len(text)
while offset < end:
c = text[offset]
if ord(c) == 255: # escaped highbit character
exp_text.append(text[offset+1])
offset += 2
elif ord(c) >= 128: # symbol table entry
symbol = text[offset:offset+entrylen]
exp_text.append(exp_table[symbol])
offset += entrylen
else:
exp_text.append(c)
offset += 1
return ''.join(exp_text)
def Huffman_find(pat, comp_text, comp_table):
"Find a (plaintext) substring in compressed text"
comp_pat = word_Huffman_compress(pat, comp_table)
return comp_text.find(comp_pat)
if __name__=='__main__':
import sys, glob
big_text = []
for fpat in sys.argv[1:]:
for fname in glob.glob(fpat):
big_text.append(open(fname).read())
big_text = ''.join(big_text)
hist = build_histogram(normalize_text(big_text))
for entrylen in (1, 2, 3):
comp_words = optimal_Nbyte(hist, entrylen)
comp_table, exp_table, entrylen_ = tables_from_words(comp_words)
comp_text = word_Huffman_compress(big_text, comp_table)
exp_text = word_Huffman_expand(comp_text, exp_table, entrylen_)
print "Nominal/actual symbol length (entries): %i/%i (%i)" % \
(entrylen, entrylen_, len(comp_words))
print "Compression ratio: %i%%" % \
((100*len(comp_text))/len(big_text))
if big_text == exp_text:
print "*** Compression/expansion cycle successful!\n"
else:
print "*** Failure in compression/expansion cycle!\n"
# Just for fun, here's a search against compressed text
pos = Huffman_find('Foobar', comp_text, comp_table)
The [word_huffman] module, while simple and fairly short, is
still likely to be useful--and it lays the basis for a
fleshed-out variant. The compression obtained by the algorithm
above is a comparatively modest 50-60 percent of the size of the
original text (in informal tests). But given that locality of
decompression of subsegments is both possible and cheap, there is
nearly no disadvantage to this transformation for stored
documents. Word searches become quicker basically in direct
proportion to the length reduction.
One likely improvement would be to add run-length compression of
whitespace (or generally of nonalpha characters); doing so would
lose none of the direct searchability that this algorithm is
designed around, and in typical electronic natural-language texts
would result in significant additional compression. Moreover, a
pleasant side effect of the [word_huffman] transformation is that
transformed documents become -more- compressible under
Lempel-Ziv-based techniques (i.e., cumulatively). In other words,
there is benefit in precompressing documents with [word-huffman]
if you intend to later compress them with 'gzip', 'zip', or
similar tools.
More aggressive improvements might be obtained by allowing
variable byte-length symbol table entries and/or by claiming some
additional low-bit control codes for the symbol table (and
escaping literals in the original text). You can experiment with
such variations, and your results might vary somewhat depending
upon the details of application-specific canonical texts.
Search capabilities might also be generalized--but this would
require considerably greater effort. In the referenced research
article below, the authors show how to generalize to direct
regular-expression searching against word-based Huffman encoded
texts. The [word_huffman] implementation allows certain
straightforward transformations of regular expressions (where
literal words occur within them) for searching against compressed
documents, but a number of caveats and restrictions apply.
Overcoming most such limitations would involve digging into
Python's underlying regular expression engine, but it is possible
in principle.
SECTION -- References
-------------------------------------------------------------------
A good place to turn for additional theoretical and practical
information on compression is at the '<comp.compression>' FAQ:
<http://www.cis.ohio-state.edu/hypertext/faq/usenet/compression-faq/>.
A research article on word-based Huffman encoding inspired my
simple example of word-based compression. The article "Fast
and Flexible Word Searching on Compressed Text," by Edleno
Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo
Baeza-Yates, can be found at:
<http://citeseer.nj.nec.com/silvademoura00fast.html>.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -