📄 appendix_b.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 3 页
字号:
      772 7628 --> 1 1 010 1 00010 010 00011

  The nibble encoding would take 28-bits to represent a phone
  number; in this particular case, our encoding takes 19-bits. I
  introduced spaces into the example above for clarity; you can see
  that they are not necessary to unpack the encoding, since the
  encoding table will determine whether we have reached the end of
  an encoded symbol (but you have to keep track of your place in
  the bits).

  Huffman encoding is still fairly cheap to decode, cycle-wise. But
  it requires a table lookup, so it cannot be quite as cheap as
  RLE, however. The encoding side of Huffman is fairly expensive,
  though; the whole data set has to be scanned and a frequency
  table built up. In some cases a "shortcut" is appropriate with
  Huffman coding. Standard Huffman coding applies to a particular
  data set being encoded, with the set-specific symbol table
  prepended to the output data stream. However, if the whole type
  of data encoded--not just the single data set--has the same
  regularities, we can opt for a global Huffman table. If we have
  such a global Huffman table, we can hard-code the lookups into
  our executables, which makes both compression and decompression
  quite a bit cheaper (except for the initial global sampling and
  hard-coding). For example, if we know our data set would be
  English-language prose, letter-frequency tables are well known
  and quite consistent across data sets.

SECTION -- Lempel-Ziv Compression
-------------------------------------------------------------------

  Probably the most significant lossless-compression technique is
  Lempel-Ziv. What is explained here is "LZ78," but LZ77 and other
  variants work in a similar fashion. The idea in LZ78 is to encode
  a streaming byte sequence using a dynamic table. At the start of
  compressing a bit stream, the LZ table is filled with the actual
  symbol set, along with some blank slots. Various size tables are
  used, but for our (whitespace-compressed) telephone number
  example above, let's suppose that we use a 32-entry table (this
  should be OK for our example, although much too small for most
  other types of data). First thing, we fill the first ten slots
  with our alphabet (digits). As new bytes come in, we first output
  an existing entry that grabs the longest sequence possible, then
  fill the next available slot with the N+1 length sequence. In the
  worst case, we are using 5-bits instead of 4-bits for a single
  symbol, but we'll wind up getting to use 5-bits for multiple
  symbols in a lot of cases. For example, the machine might do this
  (a table slot is noted with square brackets):

      #*----------- LZ77 Algorithm --------------#
      7 --> Lookup: 7 found       --> nothing to add    --> keep looking
      7 --> Lookup: 77 not found  --> add '77' to [11]  --> output [7]=00111
      2 --> Lookup: 72 not found  --> add '72' to [12]  --> output [7]=00111
      7 --> Lookup: 27 not found  --> add '27' to [13]  --> output [2]=00010
      6 --> Lookup: 76 not found  --> add '76' to [14]  --> output [7]=00111
      2 --> Lookup: 62 not found  --> add '62' to [15]  --> output [6]=00110
      8 --> Lookup: 28 not found  --> add '28' to [16]  --> output [2]=00010

  So far, we've got nothing out of it, but let's continue with
  the next phone number:

      #*----------- LZ77 Algorithm (cont) -------#
      7 --> Lookup: 87 not found  --> add '87' to [17]  --> output [8]=00100
      7 --> Lookup: 77 found      --> nothing to add    --> keep looking
      2 --> Lookup: 772 not found --> add '772' to [18] --> output [11]=01011
      8 --> Lookup: 28 found      --> nothing to add    --> keep looking
      6 --> Lookup: 286 not found --> add '286' to [19] --> output [16]=10000
      ...

  The steps should suffice to see the pattern. We have not achieved
  any net compression yet, but notice that we've already managed to
  use slot 11 and slot 16, thereby getting two symbols with one
  output in each case. We've also accumulated the very useful byte
  sequence '772' in slot 18, which would prove useful later in the
  stream.

  What LZ78 does is fill up one symbol table with (hopefully)
  helpful entries, then write it, clear it, and start a new one. In
  this regard, 32 entries is still probably too small a symbol
  table, since that will get cleared before a lot of reuse of '772'
  and the like is achieved. But the small symbol table is easy to
  illustrate.

  In typical data sets, Lempel-Ziv variants achieve much better
  compression rates than Huffman or RLE. On the other hand,
  Lempel-Ziv variants are very pricey cycle-wise and can use large
  tables in memory. Most real-life compression tools and libraries
  use a combination of Lempel-Ziv and Huffman techniques.

SECTION -- Solving the Right Problem
-------------------------------------------------------------------

  Just as choosing the right algorithm can often create
  orders-of-magnitude improvements over even heavily optimized
  wrong algorithms, choosing the right data representation is often
  even more important than compression methods (which are always a
  sort of post hoc optimization of desired features). The simple
  data set example used in this appendix is a perfect case where
  reconceptualizing the problem would actually be a much better
  approach than using -any- of the compression techniques
  illustrated.

  Think again about what our data represents.  It is not a very
  general collection of data, and the rigid a priori constraints
  allow us to reformulate our whole problem.  What we have is a
  maximum of 30,000 telephone numbers (7720000 through 7749999),
  some of which are active, and others of which are not.  We do
  not have a "duty," as it were, to produce a full representation
  of each telephone number that is active, but simply to indicate
  the binary fact that it -is- active.  Thinking of the problem
  this way, we can simply allocate 30,000 bits of memory and
  storage, and have each bit say "yes" or "no" to the presence of
  one telephone number.  The ordering of the bits in the
  bit-array can be simple ascending order from the lowest to the
  highest telephone number in the range.

  This bit-array solution is the best in almost every respect.
  It allocates exactly 3750 bytes to represent the data set; the
  various compression techniques will use a varying amount of
  storage depending both on the number of telephone numbers in
  the set and the efficiency of the compression.  But if 10,000
  of the 30,000 possible telephone numbers are active, and even a
  very efficient compression technique requires several bytes per
  telephone number, then the bit-array is an order-of-magnitude
  better.  In terms of CPU demands, the bit-array is not only
  better than any of the discussed compression methods, it is
  also quite likely to be better than the naive noncompression
  method of listing all the numbers as strings.  Stepping through
  a bit-array and incrementing a "current-telephone-number"
  counter can be done quite efficiently and mostly within the
  on-chip cache of a modern CPU.

  The lesson to be learned from this very simple example is
  certainly not that every problem has some magic shortcut (like
  this one does). A lot of problems genuinely require significant
  memory, bandwidth, storage, and CPU resources, and in many of
  those cases compression techniques can help ease--or shift--those
  burdens. But a more moderate lesson could be suggested: Before
  compression techniques are employed, it is a good idea to make
  sure that one's starting conceptualization of the data
  representation is a good one.


SECTION -- A Custom Text Compressor
-------------------------------------------------------------------

  Most styles of compression require a decompression pass before
  one is able to do something useful with a source document. Many
  (de)compressors can operate as a stream, producing only the
  needed bytes of a compressed or decompressed stream in sequence.
  In some cases, formats even insert recovery or bookkeeping bytes
  that allow streams to begin within documents (rather than from
  the very beginning). Programmatic wrappers can make compressed
  documents or strings look like plaintext ones at the appropriate
  API layer. Nonetheless, even streaming decompressors require a
  computational overhead to get at the plaintext content of a
  compressed document.

  An excellent example of a streaming (de)compressor with an API
  wrapper is `gzip.GzipFile()`. Although not entirely transparent,
  you can compress and decompress documents without any explicit
  call to a (de)compression function using this wrapper.
  `gzip.GzipFile()` provides a file-like interface, but it is also
  easy to operate on a purely in-memory file using the support of
  `cStringIO.StringIO()`. For example:

      >>> from gzip import GzipFile
      >>> from cStringIO import StringIO
      >>> sio = StringIO()
      >>> writer = GzipFile(None, 'wb', 9, sio)
      >>> writer.write('Mary had a little lamb\n')
      >>> writer.write('its fleece as white as snow\n')
      >>> writer.close()
      >>> sio.getvalue()[:20]
      '\x1f\x8b\x08\x00k\xc1\x9c<\x02\xff'
      >>> reader = GzipFile(None, 'rb', 9, StringIO(sio.getvalue()))
      >>> reader.read()[:20]
      'Mary had a little la'
      >>> reader.seek(30)
      >>> reader.read()
      'ece as white as snow\n'

  One thing this example shows is that the underlying compressed
  string is more or less gibberish. Although the file-like API
  hides the details from an application programmer, the
  decompression process is also stateful in its dependence on a
  symbol table built from the byte sequence in the compressed text.
  You cannot expect to make sense of a few bytes in the middle of
  the compressed text without a knowledge of the prior context.

  A different approach to compression can have significant
  advantages in operating on natural-language textual sources. A
  group of researchers in Brazil and Chile have examined techniques
  for "word-based Huffman compression." The general strategy of
  these researchers is to treat whole words as the symbol set for a
  Huffman table, rather than merely naive byte values. In natural
  languages, a limited number of (various length, multibyte) words
  occur with a high frequency, and savings result if such words are
  represented with shorter byte sequences. In general, such reduced
  representation is common to all compression techniques, but
  word-based Huffman takes the additional step of retaining byte
  boundaries (and uses fixed symbol mapping, as with other Huffman
  variants).

  A special quality of word-based Huffman compressed text is that
  it need not undergo decompression to be searched. This quality
  makes it convenient to store textual documents in compressed
  form, without incurring the requirement to decompress them before
  they are useful. Instead, if one is searching for words directly
  contained in the symbol table, one can merely precompress the
  search terms, then use standard searching algorithms. Such a
  search can be either against an in-memory string or against a
  file-like source; in general a search against a precompressed
  target will be -faster- than one against an uncompressed text. In
  code, one would use snippets similar to:

      #*------------- Word-based Huffman compression ----------#
      small_text = word_Huffman_compress(big_text)
      search_term = "Foobar"
      coded_term = word_Huffman_compress(search_term)
      offset = small_text.find(coded_term)
      coded_context = small_text[offset-10:offset+10+len(search_term)]
      plain_context = word_Huffman_expand(coded_context)

  A sophisticated implementation of word-based Huffman compression
  can obtain better compression sizes than does `zlib`. For
  simplicity, the module below sacrifices optimal compression to
  the goal of clarity and brevity of code. A fleshed-out
  implementation could add a number of features.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -