⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 appendix_b.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 3 页
字号:
APPENDIX -- A DATA COMPRESSION PRIMER
-------------------------------------------------------------------

SECTION -- Introduction
-------------------------------------------------------------------

  See Section 2.2.5 for details on compression capabilities
  included in the Python standard library. This appendix is
  intended to provide readers who are unfamiliar with data
  compression a basic background on its techniques and theory. The
  final section of this appendix provides a practical
  example--accompanied by some demonstration code--of a
  Huffman-inspired custom encoding.

  Data compression is widely used in a variety of programming
  contexts.  All popular operating systems and programming
  languages have numerous tools and libraries for dealing with
  data compression of various sorts.  The right choice of
  compression tools and libraries for a particular application
  depends on the characteristics of the data and application in
  question:  streaming versus file; expected patterns and
  regularities in the data; relative importance of CPU usage,
  memory usage, channel demands, and storage requirements; and
  other factors.

  Just what is data compression, anyway?  The short answer is
  that data compression removes -redundancy- from data; in
  information-theoretic terms, compression increases the
  -entropy- of the compressed text.  But those statements are
  essentially just true by definition.  Redundancy can come in a
  lot of different forms.  Repeated bit sequences ('11111111')
  are one type.  Repeated byte sequences are another
  ('XXXXXXXX').  But more often redundancies tend to come on a
  larger scale, either regularities of the data set taken as a
  whole, or sequences of varying lengths that are relatively
  common.  Basically, what data compression aims at is finding
  algorithmic transformations of data representations that will
  produce more compact representations given "typical" data sets.
  If this description seems a bit complex to unpack, read on to
  find some more practical illustrations.

SECTION -- Lossless and Lossy Compression
-------------------------------------------------------------------

  There are actually two fundamentally different "styles" of data
  compression: lossless and lossy. This appendix is generally about
  lossless compression techniques, but the reader would be served
  to understand the distinction first. Lossless compression
  involves a transformation of the representation of a data set
  such that it is possible to reproduce -exactly- the original data
  set by performing a decompression transformation. Lossy
  compression is a representation that allows you to reproduce
  something "pretty much like" the original data set. As a plus for
  the lossy techniques, they can frequently produce far more
  compact data representations than lossless compression techniques
  can. Most often lossy compression techniques are used for images,
  sound files, and video. Lossy compression may be appropriate in
  these areas insofar as human observers do not perceive the
  literal bit-pattern of a digital image/sound, but rather more
  general "gestalt" features of the underlying image/sound.

  From the point of view of "normal" data, lossy compression is
  not an option.  We do not want a program that does "about the
  same" thing as the one we wrote.  We do not want a database
  that contains "about the same" kind of information as what we
  put into it.  At least not for most purposes (and the writer
  knows of few practical uses of lossy compression outside of
  what are already approximate mimetic representations of the
  real world, likes images and sounds).

SECTION -- A Data Set Example
-------------------------------------------------------------------

  For purposes of this appendix, let us start with a specific
  hypothetical data representation.  Here is an
  easy-to-understand example.  In the town of Greenfield, MA, the
  telephone prefixes are '772-', '773-', and '774-'.  (For non-USA
  readers:  In the USA, local telephone numbers are seven digits
  and are conventionally represented in the form ###-####; prefixes
  are assigned in geographic blocks.)  Suppose also that the
  first prefix is the mostly widely assigned of the three.  The
  suffix portions might be any other digits, in fairly equal
  distribution.  The data set we are interested in is "the list
  of all the telephone numbers currently in active use."  One can
  imagine various reasons why this might be interesting for
  programmatic purposes, but we need not specify that herein.

  Initially, the data set we are interested in comes in a
  particular data representation:  a multicolumn report (perhaps
  generated as output of some query or compilation process).  The
  first few lines of this report might look like:

      #*----------- Telephone Number Report ----------------------#
      =============================================================
      772-7628     772-8601     772-0113     773-3429     774-9833
      773-4319     774-3920     772-0893     772-9934     773-8923
      773-1134     772-4930     772-9390     774-9992     772-2314
      [...]

SECTION -- Whitespace Compression
-------------------------------------------------------------------

  Whitespace compression can be characterized most generally as
  "removing what we are not interested in."  Even though this
  technique is technically a lossy-compression technique, it is
  still useful for many types of data representations we find in
  the real world.  For example, even though HTML is far more
  readable in a text editor if indentation and vertical spacing
  is added, none of this "whitespace" should make any difference
  to how the HTML document is rendered by a Web browser.  If you
  happen to know that an HTML document is destined only for a
  Web browser (or for a robot/spider) then it might be a good
  idea to take out all the whitespace to make it transmit faster
  and occupy less space in storage.  What we remove in whitespace
  compression never really had any functional purpose to start
  with.

  In the case of our example in this article, it is possible to
  remove quite a bit from the described report. The row of "="
  across the top adds nothing functional, nor do the "-" within
  numbers, nor the spaces between them. These are all useful for a
  person reading the original report, but do not matter once we
  think of it as data. What we remove is not precisely whitespace
  in traditional terms, but the intent is the same.

  Whitespace compression is extremely "cheap" to perform.  It is
  just a matter of reading a stream of data and excluding a few
  specific values from the output stream.  In many cases, no
  "decompression" step is involved at all.  But even where we
  would wish to re-create something close to the original
  somewhere down the data stream, it should require little in
  terms of CPU or memory.  What we reproduce may or may not be
  exactly what we started with, depending on just what rules and
  constraints were involved in the original.  An HTML page typed
  by a human in a text editor will probably have spacing that is
  idiosyncratic.  Then again, automated tools often produce
  "reasonable" indentation and spacing of HTML.  In the case of
  the rigid report format in our example, there is no reason that
  the original representation could not be precisely produced by
  a "decompressing formatter" down the data stream.

SECTION -- Run-Length Encoding
-------------------------------------------------------------------

  Run-length encoding (RLE) is the simplest widely used
  lossless-compression technique. Like whitespace compression, it
  is "cheap"--especially to decode. The idea behind it is that many
  data representations consist largely of strings of repeated
  bytes. Our example report is one such data representation. It
  begins with a string of repeated "=", and has strings of spaces
  scattered through it. Rather than represent each character with
  its own byte, RLE will (sometimes or always) have an iteration
  count followed by the character to be repeated.

  If repeated bytes are predominant within the expected data
  representation, it might be adequate and efficient to always have
  the algorithm specify one or more bytes of iteration count,
  followed by one character. However, if one-length character
  strings occur, these strings will require two (or more) bytes to
  encode them, that is, '00000001 01011000' might be the output
  bit stream required for just one ASCII "X" of the input stream.
  Then again, a hundred "X" in a row would be output as '01100100
  01011000', which is quite good.

  What is frequently done in RLE variants is to selectively use
  bytes to indicate iterator counts and otherwise just have bytes
  represent themselves. At least one byte-value has to be reserved
  to do this, but that can be escaped in the output, if needed. For
  example, in our example telephone-number report, we know that
  everything in the input stream is plain ASCII characters.
  Specifically, they all have bit one of their ASCII value as 0. We
  could use this first ASCII bit to indicate that an iterator count
  was being represented rather than representing a regular
  character. The next seven bits of the iterator byte could be used
  for the iterator count, and the next byte could represent the
  character to be repeated. So, for example, we could represent the
  string "YXXXXXXXX" as:

      #*----- Run length encoding example -----#
      "Y"      Iter(8)  "X"
      01001111 10001000 01011000

  This example does not show how to escape iterator byte-values,
  nor does it allow iteration of more than 127 occurrences of a
  character.  Variations on RLE deal with issues such as these,
  if needed.

SECTION -- Huffman Encoding
-------------------------------------------------------------------

  Huffman encoding looks at the symbol table of a whole data set.
  The compression is achieved by finding the "weights" of each
  symbol in the data set. Some symbols occur more frequently than
  others, so Huffman encoding suggests that the frequent symbols
  need not be encoded using as many bits as the less-frequent
  symbols. There are variations on Huffman-style encoding, but the
  original (and frequent) variation involves looking for the most
  common symbol and encoding it using just one bit, say 1. If you
  encounter a 0, you know you're on the way to encoding a longer
  variable length symbol.

  Let's imagine we apply Huffman encoding to our local phone-book
  example (assume we have already whitespace-compressed the
  report). We might get:

      #*----- Huffman encoding example -----#
      Encoding   Symbol
       1           7
       010         2
       011         3
       00000       4
       00001       5
       00010       6
       00011       8
       00100       9
       00101       0
       00111       1

  Our initial symbol set of digits could already be
  straightforwardly encoded (with no-compression) as 4-bit
  sequences (nibbles).  The Huffman encoding given will use up to
  5-bits for the worst-case symbols, which is obviously worse
  than the nibble encoding.  However, our best case will use only
  -1- bit, and we know that our best case is also the most
  frequent case, by having scanned the data set.  So we might
  encode a particular phone number like:

      #*----- Huffman translation example -----#

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -