📄 appendix_b.txt
字号:
772 7628 --> 1 1 010 1 00010 010 00011
The nibble encoding would take 28-bits to represent a phone
number; in this particular case, our encoding takes 19-bits. I
introduced spaces into the example above for clarity; you can see
that they are not necessary to unpack the encoding, since the
encoding table will determine whether we have reached the end of
an encoded symbol (but you have to keep track of your place in
the bits).
Huffman encoding is still fairly cheap to decode, cycle-wise. But
it requires a table lookup, so it cannot be quite as cheap as
RLE, however. The encoding side of Huffman is fairly expensive,
though; the whole data set has to be scanned and a frequency
table built up. In some cases a "shortcut" is appropriate with
Huffman coding. Standard Huffman coding applies to a particular
data set being encoded, with the set-specific symbol table
prepended to the output data stream. However, if the whole type
of data encoded--not just the single data set--has the same
regularities, we can opt for a global Huffman table. If we have
such a global Huffman table, we can hard-code the lookups into
our executables, which makes both compression and decompression
quite a bit cheaper (except for the initial global sampling and
hard-coding). For example, if we know our data set would be
English-language prose, letter-frequency tables are well known
and quite consistent across data sets.
SECTION -- Lempel-Ziv Compression
-------------------------------------------------------------------
Probably the most significant lossless-compression technique is
Lempel-Ziv. What is explained here is "LZ78," but LZ77 and other
variants work in a similar fashion. The idea in LZ78 is to encode
a streaming byte sequence using a dynamic table. At the start of
compressing a bit stream, the LZ table is filled with the actual
symbol set, along with some blank slots. Various size tables are
used, but for our (whitespace-compressed) telephone number
example above, let's suppose that we use a 32-entry table (this
should be OK for our example, although much too small for most
other types of data). First thing, we fill the first ten slots
with our alphabet (digits). As new bytes come in, we first output
an existing entry that grabs the longest sequence possible, then
fill the next available slot with the N+1 length sequence. In the
worst case, we are using 5-bits instead of 4-bits for a single
symbol, but we'll wind up getting to use 5-bits for multiple
symbols in a lot of cases. For example, the machine might do this
(a table slot is noted with square brackets):
#*----------- LZ77 Algorithm --------------#
7 --> Lookup: 7 found --> nothing to add --> keep looking
7 --> Lookup: 77 not found --> add '77' to [11] --> output [7]=00111
2 --> Lookup: 72 not found --> add '72' to [12] --> output [7]=00111
7 --> Lookup: 27 not found --> add '27' to [13] --> output [2]=00010
6 --> Lookup: 76 not found --> add '76' to [14] --> output [7]=00111
2 --> Lookup: 62 not found --> add '62' to [15] --> output [6]=00110
8 --> Lookup: 28 not found --> add '28' to [16] --> output [2]=00010
So far, we've got nothing out of it, but let's continue with
the next phone number:
#*----------- LZ77 Algorithm (cont) -------#
7 --> Lookup: 87 not found --> add '87' to [17] --> output [8]=00100
7 --> Lookup: 77 found --> nothing to add --> keep looking
2 --> Lookup: 772 not found --> add '772' to [18] --> output [11]=01011
8 --> Lookup: 28 found --> nothing to add --> keep looking
6 --> Lookup: 286 not found --> add '286' to [19] --> output [16]=10000
...
The steps should suffice to see the pattern. We have not achieved
any net compression yet, but notice that we've already managed to
use slot 11 and slot 16, thereby getting two symbols with one
output in each case. We've also accumulated the very useful byte
sequence '772' in slot 18, which would prove useful later in the
stream.
What LZ78 does is fill up one symbol table with (hopefully)
helpful entries, then write it, clear it, and start a new one. In
this regard, 32 entries is still probably too small a symbol
table, since that will get cleared before a lot of reuse of '772'
and the like is achieved. But the small symbol table is easy to
illustrate.
In typical data sets, Lempel-Ziv variants achieve much better
compression rates than Huffman or RLE. On the other hand,
Lempel-Ziv variants are very pricey cycle-wise and can use large
tables in memory. Most real-life compression tools and libraries
use a combination of Lempel-Ziv and Huffman techniques.
SECTION -- Solving the Right Problem
-------------------------------------------------------------------
Just as choosing the right algorithm can often create
orders-of-magnitude improvements over even heavily optimized
wrong algorithms, choosing the right data representation is often
even more important than compression methods (which are always a
sort of post hoc optimization of desired features). The simple
data set example used in this appendix is a perfect case where
reconceptualizing the problem would actually be a much better
approach than using -any- of the compression techniques
illustrated.
Think again about what our data represents. It is not a very
general collection of data, and the rigid a priori constraints
allow us to reformulate our whole problem. What we have is a
maximum of 30,000 telephone numbers (7720000 through 7749999),
some of which are active, and others of which are not. We do
not have a "duty," as it were, to produce a full representation
of each telephone number that is active, but simply to indicate
the binary fact that it -is- active. Thinking of the problem
this way, we can simply allocate 30,000 bits of memory and
storage, and have each bit say "yes" or "no" to the presence of
one telephone number. The ordering of the bits in the
bit-array can be simple ascending order from the lowest to the
highest telephone number in the range.
This bit-array solution is the best in almost every respect.
It allocates exactly 3750 bytes to represent the data set; the
various compression techniques will use a varying amount of
storage depending both on the number of telephone numbers in
the set and the efficiency of the compression. But if 10,000
of the 30,000 possible telephone numbers are active, and even a
very efficient compression technique requires several bytes per
telephone number, then the bit-array is an order-of-magnitude
better. In terms of CPU demands, the bit-array is not only
better than any of the discussed compression methods, it is
also quite likely to be better than the naive noncompression
method of listing all the numbers as strings. Stepping through
a bit-array and incrementing a "current-telephone-number"
counter can be done quite efficiently and mostly within the
on-chip cache of a modern CPU.
The lesson to be learned from this very simple example is
certainly not that every problem has some magic shortcut (like
this one does). A lot of problems genuinely require significant
memory, bandwidth, storage, and CPU resources, and in many of
those cases compression techniques can help ease--or shift--those
burdens. But a more moderate lesson could be suggested: Before
compression techniques are employed, it is a good idea to make
sure that one's starting conceptualization of the data
representation is a good one.
SECTION -- A Custom Text Compressor
-------------------------------------------------------------------
Most styles of compression require a decompression pass before
one is able to do something useful with a source document. Many
(de)compressors can operate as a stream, producing only the
needed bytes of a compressed or decompressed stream in sequence.
In some cases, formats even insert recovery or bookkeeping bytes
that allow streams to begin within documents (rather than from
the very beginning). Programmatic wrappers can make compressed
documents or strings look like plaintext ones at the appropriate
API layer. Nonetheless, even streaming decompressors require a
computational overhead to get at the plaintext content of a
compressed document.
An excellent example of a streaming (de)compressor with an API
wrapper is `gzip.GzipFile()`. Although not entirely transparent,
you can compress and decompress documents without any explicit
call to a (de)compression function using this wrapper.
`gzip.GzipFile()` provides a file-like interface, but it is also
easy to operate on a purely in-memory file using the support of
`cStringIO.StringIO()`. For example:
>>> from gzip import GzipFile
>>> from cStringIO import StringIO
>>> sio = StringIO()
>>> writer = GzipFile(None, 'wb', 9, sio)
>>> writer.write('Mary had a little lamb\n')
>>> writer.write('its fleece as white as snow\n')
>>> writer.close()
>>> sio.getvalue()[:20]
'\x1f\x8b\x08\x00k\xc1\x9c<\x02\xff'
>>> reader = GzipFile(None, 'rb', 9, StringIO(sio.getvalue()))
>>> reader.read()[:20]
'Mary had a little la'
>>> reader.seek(30)
>>> reader.read()
'ece as white as snow\n'
One thing this example shows is that the underlying compressed
string is more or less gibberish. Although the file-like API
hides the details from an application programmer, the
decompression process is also stateful in its dependence on a
symbol table built from the byte sequence in the compressed text.
You cannot expect to make sense of a few bytes in the middle of
the compressed text without a knowledge of the prior context.
A different approach to compression can have significant
advantages in operating on natural-language textual sources. A
group of researchers in Brazil and Chile have examined techniques
for "word-based Huffman compression." The general strategy of
these researchers is to treat whole words as the symbol set for a
Huffman table, rather than merely naive byte values. In natural
languages, a limited number of (various length, multibyte) words
occur with a high frequency, and savings result if such words are
represented with shorter byte sequences. In general, such reduced
representation is common to all compression techniques, but
word-based Huffman takes the additional step of retaining byte
boundaries (and uses fixed symbol mapping, as with other Huffman
variants).
A special quality of word-based Huffman compressed text is that
it need not undergo decompression to be searched. This quality
makes it convenient to store textual documents in compressed
form, without incurring the requirement to decompress them before
they are useful. Instead, if one is searching for words directly
contained in the symbol table, one can merely precompress the
search terms, then use standard searching algorithms. Such a
search can be either against an in-memory string or against a
file-like source; in general a search against a precompressed
target will be -faster- than one against an uncompressed text. In
code, one would use snippets similar to:
#*------------- Word-based Huffman compression ----------#
small_text = word_Huffman_compress(big_text)
search_term = "Foobar"
coded_term = word_Huffman_compress(search_term)
offset = small_text.find(coded_term)
coded_context = small_text[offset-10:offset+10+len(search_term)]
plain_context = word_Huffman_expand(coded_context)
A sophisticated implementation of word-based Huffman compression
can obtain better compression sizes than does `zlib`. For
simplicity, the module below sacrifices optimal compression to
the goal of clarity and brevity of code. A fleshed-out
implementation could add a number of features.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -