⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 appendix_b.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 3 页
字号:

  The presented module [word_huffman] uses a fixed number of bytes
  to encode each word in the symbol table. This number of bytes can
  be selected to be 1, 2, or 3 (thereby limiting the table to a
  generous 2 million entries). The module also separates the
  generation of a symbol table from the actual
  compression/decompression. The module can be used in a context
  where various documents get encoded using the same symbol
  table--the table presumably generated based on a set of canonical
  documents. In this situation, the computational requirement of
  symbol table generation can happen just once, and the symbol
  table itself need not be transmitted along with each compressed
  document. Of course, nothing prevents you from treating the
  document being processed currently as said canonical statistical
  word source (thereby somewhat improving compression).

  In the algorithm utilized by [word_huffman], only high-bit
  bytes are utilized in the symbol table.  The lower 128 ASCII
  characters represent themselves as literals.  Any ASCII
  character sequence that is not in the symbol table is
  represented as itself--including any short words that would not
  benefit from encoding.  Any high-bit characters that occur in the
  original source text are escaped by being preceded by
  an 0xFF byte.  As a result, high-bit characters are encoded
  using two bytes; this technique is clearly only useful for
  encoding (mostly) textual files, not binary files.  Moreover,
  only character values 0x80-0xFE are used by the symbol table
  (0xFF -always- signals a literal high-bit character in the
  encoding).

  The [word_huffman] algorithm is not entirely stateless in the
  sense that not every subsequence in a compressed text can be
  expanded without additional context. But very little context is
  required. Any low-bit character always literally represents
  itself. A high-bit character, however, might be either an escaped
  literal, a first byte of a symbol table entry, or a non-first
  byte of a symbol table entry. In the worst case, where a 3-byte
  symbol table is used, it is necessary to look back two bytes from
  an arbitrary position in the text to determine the full context.
  Normally, only one byte lookback is necessary. In any case, words
  in the symbol table are separated from each other in the
  uncompressed text by nonalpha low-bit characters (usually
  whitespace), so parsing compressed entries is straightforward.

      #---------- word_huffman.py ----------#
      wordchars = '-_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'

      def normalize_text(txt):
          "Convert non-word characters to spaces"
          trans = [' '] * 256
          for c in wordchars: trans[ord(c)] = c
          return txt.translate(''.join(trans))

      def build_histogram(txt, hist={}):
          "Incrementally build a histogram table from text source(s)"
          for word in txt.split():
              hist[word] = hist.get(word, 0)+1
          return hist

      def optimal_Nbyte(hist, entrylen=2):
          "Build optimal word list for nominal symbol table byte-length"
          slots = 127**entrylen
          words = []
          for word, count in hist.items():
              gain = count * (len(word)-entrylen)
              if gain > 0: words.append((gain, word))
          words.sort()
          words.reverse()
          return [w[1] for w in words[:slots]]

      def tables_from_words(words):
          "Create symbol tables for compression and expansion"
          # Determine ACTUAL best symbol table byte length
          if len(words) < 128: entrylen = 1
          elif len(words) <= 16129: entrylen = 2
          else: entrylen = 3 # assume < ~2M distinct words
          comp_table = {}
          # Escape hibit characters
          for hibit_char in map(chr, range(128,256)):
              comp_table[hibit_char] = chr(255)+hibit_char
          # Literal low-bit characters
          for lowbit_char in map(chr, range(128)):
              comp_table[lowbit_char] = lowbit_char
          # Add word entries
          for word, index in zip(words, range(len(words))):
              comp_table[word] = symbol(index, entrylen)
          # Reverse dictionary for expansion table
          exp_table = {}
          for key, val in comp_table.items():
              exp_table[val] = key
          return (comp_table, exp_table, entrylen)

      def symbol(index, entrylen):
          "Determine actual symbol from word sequence and symbol length"
          if entrylen == 1:
              return chr(128+index)
          if entrylen == 2:
              byte1, byte2 = divmod(index, 128)
              return chr(128+byte1)+chr(128+byte2)
          if entrylen == 3:
              byte1, rem = divmod(index, 16129)
              byte2, byte3 = divmod(rem, 128)
              return chr(128+byte1)+chr(128+byte2)+chr(128+byte3)
          raise ValueError, "symbol byte len must be 1 <= S <=3: "+`entrylen`

      def word_Huffman_compress(text, comp_table):
          "Compress text based on word-to-symbol table"
          comp_text = []
          maybe_entry = []
          for c in text+chr(0):   # force flush of final word
              if c in wordchars:
                  maybe_entry.append(c)
              else:
                  word = ''.join(maybe_entry)
                  comp_text.append(comp_table.get(word, word))
                  maybe_entry = []
                  comp_text.append(comp_table[c])
          return ''.join(comp_text[:-1])

      def word_Huffman_expand(text, exp_table, entrylen):
          "Expand text based on symbol-to-word table"
          exp_text = []
          offset = 0
          end = len(text)
          while offset < end:
              c = text[offset]
              if ord(c) == 255:   # escaped highbit character
                  exp_text.append(text[offset+1])
                  offset += 2
              elif ord(c) >= 128: # symbol table entry
                  symbol = text[offset:offset+entrylen]
                  exp_text.append(exp_table[symbol])
                  offset += entrylen
              else:
                  exp_text.append(c)
                  offset += 1
          return ''.join(exp_text)

      def Huffman_find(pat, comp_text, comp_table):
          "Find a (plaintext) substring in compressed text"
          comp_pat = word_Huffman_compress(pat, comp_table)
          return comp_text.find(comp_pat)

      if __name__=='__main__':
          import sys, glob
          big_text = []
          for fpat in sys.argv[1:]:
              for fname in glob.glob(fpat):
                  big_text.append(open(fname).read())
          big_text = ''.join(big_text)
          hist = build_histogram(normalize_text(big_text))
          for entrylen in (1, 2, 3):
              comp_words = optimal_Nbyte(hist, entrylen)
              comp_table, exp_table, entrylen_ = tables_from_words(comp_words)
              comp_text = word_Huffman_compress(big_text, comp_table)
              exp_text = word_Huffman_expand(comp_text, exp_table, entrylen_)
              print "Nominal/actual symbol length (entries): %i/%i (%i)" % \
                    (entrylen, entrylen_, len(comp_words))
              print "Compression ratio: %i%%" % \
                    ((100*len(comp_text))/len(big_text))
              if big_text == exp_text:
                  print "*** Compression/expansion cycle successful!\n"
              else:
                  print "*** Failure in compression/expansion cycle!\n"
              # Just for fun, here's a search against compressed text
              pos = Huffman_find('Foobar', comp_text, comp_table)

  The [word_huffman] module, while simple and fairly short, is
  still likely to be useful--and it lays the basis for a
  fleshed-out variant. The compression obtained by the algorithm
  above is a comparatively modest 50-60 percent of the size of the
  original text (in informal tests). But given that locality of
  decompression of subsegments is both possible and cheap, there is
  nearly no disadvantage to this transformation for stored
  documents. Word searches become quicker basically in direct
  proportion to the length reduction.

  One likely improvement would be to add run-length compression of
  whitespace (or generally of nonalpha characters); doing so would
  lose none of the direct searchability that this algorithm is
  designed around, and in typical electronic natural-language texts
  would result in significant additional compression. Moreover, a
  pleasant side effect of the [word_huffman] transformation is that
  transformed documents become -more- compressible under
  Lempel-Ziv-based techniques (i.e., cumulatively). In other words,
  there is benefit in precompressing documents with [word-huffman]
  if you intend to later compress them with 'gzip', 'zip', or
  similar tools.

  More aggressive improvements might be obtained by allowing
  variable byte-length symbol table entries and/or by claiming some
  additional low-bit control codes for the symbol table (and
  escaping literals in the original text). You can experiment with
  such variations, and your results might vary somewhat depending
  upon the details of application-specific canonical texts.

  Search capabilities might also be generalized--but this would
  require considerably greater effort. In the referenced research
  article below, the authors show how to generalize to direct
  regular-expression searching against word-based Huffman encoded
  texts. The [word_huffman] implementation allows certain
  straightforward transformations of regular expressions (where
  literal words occur within them) for searching against compressed
  documents, but a number of caveats and restrictions apply.
  Overcoming most such limitations would involve digging into
  Python's underlying regular expression engine, but it is possible
  in principle.


SECTION -- References
-------------------------------------------------------------------

  A good place to turn for additional theoretical and practical
  information on compression is at the '<comp.compression>' FAQ:

    <http://www.cis.ohio-state.edu/hypertext/faq/usenet/compression-faq/>.

  A research article on word-based Huffman encoding inspired my
  simple example of word-based compression.  The article "Fast
  and Flexible Word Searching on Compressed Text," by Edleno
  Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo
  Baeza-Yates, can be found at:

    <http://citeseer.nj.nec.com/silvademoura00fast.html>.

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -