📄 appendix_b.txt
字号:
APPENDIX -- A DATA COMPRESSION PRIMER
-------------------------------------------------------------------
SECTION -- Introduction
-------------------------------------------------------------------
See Section 2.2.5 for details on compression capabilities
included in the Python standard library. This appendix is
intended to provide readers who are unfamiliar with data
compression a basic background on its techniques and theory. The
final section of this appendix provides a practical
example--accompanied by some demonstration code--of a
Huffman-inspired custom encoding.
Data compression is widely used in a variety of programming
contexts. All popular operating systems and programming
languages have numerous tools and libraries for dealing with
data compression of various sorts. The right choice of
compression tools and libraries for a particular application
depends on the characteristics of the data and application in
question: streaming versus file; expected patterns and
regularities in the data; relative importance of CPU usage,
memory usage, channel demands, and storage requirements; and
other factors.
Just what is data compression, anyway? The short answer is
that data compression removes -redundancy- from data; in
information-theoretic terms, compression increases the
-entropy- of the compressed text. But those statements are
essentially just true by definition. Redundancy can come in a
lot of different forms. Repeated bit sequences ('11111111')
are one type. Repeated byte sequences are another
('XXXXXXXX'). But more often redundancies tend to come on a
larger scale, either regularities of the data set taken as a
whole, or sequences of varying lengths that are relatively
common. Basically, what data compression aims at is finding
algorithmic transformations of data representations that will
produce more compact representations given "typical" data sets.
If this description seems a bit complex to unpack, read on to
find some more practical illustrations.
SECTION -- Lossless and Lossy Compression
-------------------------------------------------------------------
There are actually two fundamentally different "styles" of data
compression: lossless and lossy. This appendix is generally about
lossless compression techniques, but the reader would be served
to understand the distinction first. Lossless compression
involves a transformation of the representation of a data set
such that it is possible to reproduce -exactly- the original data
set by performing a decompression transformation. Lossy
compression is a representation that allows you to reproduce
something "pretty much like" the original data set. As a plus for
the lossy techniques, they can frequently produce far more
compact data representations than lossless compression techniques
can. Most often lossy compression techniques are used for images,
sound files, and video. Lossy compression may be appropriate in
these areas insofar as human observers do not perceive the
literal bit-pattern of a digital image/sound, but rather more
general "gestalt" features of the underlying image/sound.
From the point of view of "normal" data, lossy compression is
not an option. We do not want a program that does "about the
same" thing as the one we wrote. We do not want a database
that contains "about the same" kind of information as what we
put into it. At least not for most purposes (and the writer
knows of few practical uses of lossy compression outside of
what are already approximate mimetic representations of the
real world, likes images and sounds).
SECTION -- A Data Set Example
-------------------------------------------------------------------
For purposes of this appendix, let us start with a specific
hypothetical data representation. Here is an
easy-to-understand example. In the town of Greenfield, MA, the
telephone prefixes are '772-', '773-', and '774-'. (For non-USA
readers: In the USA, local telephone numbers are seven digits
and are conventionally represented in the form ###-####; prefixes
are assigned in geographic blocks.) Suppose also that the
first prefix is the mostly widely assigned of the three. The
suffix portions might be any other digits, in fairly equal
distribution. The data set we are interested in is "the list
of all the telephone numbers currently in active use." One can
imagine various reasons why this might be interesting for
programmatic purposes, but we need not specify that herein.
Initially, the data set we are interested in comes in a
particular data representation: a multicolumn report (perhaps
generated as output of some query or compilation process). The
first few lines of this report might look like:
#*----------- Telephone Number Report ----------------------#
=============================================================
772-7628 772-8601 772-0113 773-3429 774-9833
773-4319 774-3920 772-0893 772-9934 773-8923
773-1134 772-4930 772-9390 774-9992 772-2314
[...]
SECTION -- Whitespace Compression
-------------------------------------------------------------------
Whitespace compression can be characterized most generally as
"removing what we are not interested in." Even though this
technique is technically a lossy-compression technique, it is
still useful for many types of data representations we find in
the real world. For example, even though HTML is far more
readable in a text editor if indentation and vertical spacing
is added, none of this "whitespace" should make any difference
to how the HTML document is rendered by a Web browser. If you
happen to know that an HTML document is destined only for a
Web browser (or for a robot/spider) then it might be a good
idea to take out all the whitespace to make it transmit faster
and occupy less space in storage. What we remove in whitespace
compression never really had any functional purpose to start
with.
In the case of our example in this article, it is possible to
remove quite a bit from the described report. The row of "="
across the top adds nothing functional, nor do the "-" within
numbers, nor the spaces between them. These are all useful for a
person reading the original report, but do not matter once we
think of it as data. What we remove is not precisely whitespace
in traditional terms, but the intent is the same.
Whitespace compression is extremely "cheap" to perform. It is
just a matter of reading a stream of data and excluding a few
specific values from the output stream. In many cases, no
"decompression" step is involved at all. But even where we
would wish to re-create something close to the original
somewhere down the data stream, it should require little in
terms of CPU or memory. What we reproduce may or may not be
exactly what we started with, depending on just what rules and
constraints were involved in the original. An HTML page typed
by a human in a text editor will probably have spacing that is
idiosyncratic. Then again, automated tools often produce
"reasonable" indentation and spacing of HTML. In the case of
the rigid report format in our example, there is no reason that
the original representation could not be precisely produced by
a "decompressing formatter" down the data stream.
SECTION -- Run-Length Encoding
-------------------------------------------------------------------
Run-length encoding (RLE) is the simplest widely used
lossless-compression technique. Like whitespace compression, it
is "cheap"--especially to decode. The idea behind it is that many
data representations consist largely of strings of repeated
bytes. Our example report is one such data representation. It
begins with a string of repeated "=", and has strings of spaces
scattered through it. Rather than represent each character with
its own byte, RLE will (sometimes or always) have an iteration
count followed by the character to be repeated.
If repeated bytes are predominant within the expected data
representation, it might be adequate and efficient to always have
the algorithm specify one or more bytes of iteration count,
followed by one character. However, if one-length character
strings occur, these strings will require two (or more) bytes to
encode them, that is, '00000001 01011000' might be the output
bit stream required for just one ASCII "X" of the input stream.
Then again, a hundred "X" in a row would be output as '01100100
01011000', which is quite good.
What is frequently done in RLE variants is to selectively use
bytes to indicate iterator counts and otherwise just have bytes
represent themselves. At least one byte-value has to be reserved
to do this, but that can be escaped in the output, if needed. For
example, in our example telephone-number report, we know that
everything in the input stream is plain ASCII characters.
Specifically, they all have bit one of their ASCII value as 0. We
could use this first ASCII bit to indicate that an iterator count
was being represented rather than representing a regular
character. The next seven bits of the iterator byte could be used
for the iterator count, and the next byte could represent the
character to be repeated. So, for example, we could represent the
string "YXXXXXXXX" as:
#*----- Run length encoding example -----#
"Y" Iter(8) "X"
01001111 10001000 01011000
This example does not show how to escape iterator byte-values,
nor does it allow iteration of more than 127 occurrences of a
character. Variations on RLE deal with issues such as these,
if needed.
SECTION -- Huffman Encoding
-------------------------------------------------------------------
Huffman encoding looks at the symbol table of a whole data set.
The compression is achieved by finding the "weights" of each
symbol in the data set. Some symbols occur more frequently than
others, so Huffman encoding suggests that the frequent symbols
need not be encoded using as many bits as the less-frequent
symbols. There are variations on Huffman-style encoding, but the
original (and frequent) variation involves looking for the most
common symbol and encoding it using just one bit, say 1. If you
encounter a 0, you know you're on the way to encoding a longer
variable length symbol.
Let's imagine we apply Huffman encoding to our local phone-book
example (assume we have already whitespace-compressed the
report). We might get:
#*----- Huffman encoding example -----#
Encoding Symbol
1 7
010 2
011 3
00000 4
00001 5
00010 6
00011 8
00100 9
00101 0
00111 1
Our initial symbol set of digits could already be
straightforwardly encoded (with no-compression) as 4-bit
sequences (nibbles). The Huffman encoding given will use up to
5-bits for the worst-case symbols, which is obviously worse
than the nibble encoding. However, our best case will use only
-1- bit, and we know that our best case is also the most
frequent case, by having scanned the data set. So we might
encode a particular phone number like:
#*----- Huffman translation example -----#
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -