📄 chap2.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 5 页
字号:
          elif type == UU:     encode = binascii.b2a_uu
          elif type == BINHEX: encode = binascii.b2a_hqx
          else: raise ValueError, "Encoding must be in UU, BASE64, BINHEX"
          # Second, compress the source if specified
          if compress: s = zlib.compress(s)
          # Third, encode the string, block-by-block
          offset = 0
          blocks = []
          while 1:
              blocks.append(encode(s[offset:offset+type]))
              offset += type
              if offset > len(s):
                  break
          # Fourth, return the concatenated blocks
          return ''.join(blocks)

      def ASCIIdecode(s='', type=BASE64, compress=1):
          """Decode ASCII to a binary string"""
          # First, decide the encoding style
          if type == BASE64:   s = binascii.a2b_base64(s)
          elif type == BINHEX: s = binascii.a2b_hqx(s)
          elif type == UU:
              s = ''.join([binascii.a2b_uu(line) for line in s.split('\n')])
          # Second, decompress the source if specified
          if compress: s = zlib.decompress(s)
          # Third, return the decoded binary string
          return s

      # Encode/decode STDIN for self-test
      if __name__ == '__main__':
          decode, TYPE = 0, BASE64
          for arg in sys.argv:
              if   arg.lower()=='-d': decode = 1
              elif arg.upper()=='UU': TYPE=UU
              elif arg.upper()=='BINHEX': TYPE=BINHEX
              elif arg.upper()=='BASE64': TYPE=BASE64
          if decode:
              print ASCIIdecode(sys.stdin.read(),type=TYPE)
          else:
              print ASCIIencode(sys.stdin.read(),type=TYPE)

  The example above does not attach any headers or delimit the
  encoded block (by design); for that, a wrapper like [uu],
  [mimify], or [MimeWriter] is a better choice.  Or a custom
  wrapper around 'encode_binary.py'.


  PROBLEM: Creating word or letter histograms
  --------------------------------------------------------------------

  A histogram is an analysis of the relative occurrence frequency
  of each of a number of possible values.  In terms of text
  processing, the occurrences in question are almost always
  either words or byte values.  Creating histograms is quite
  simple using Python dictionaries, but the technique is not
  always immediately obvious to people thinking about it.  The
  example below has a good generality, provides several utility
  functions associated with histograms, and can be used in a
  command-line operation mode.

      #---------- histogram.py ----------#
      # Create occurrence counts of words or characters
      # A few utility functions for presenting results
      # Avoids requirement of recent Python features

      from string import split, maketrans, translate, punctuation, digits
      import sys
      from types import *
      import types

      def word_histogram(source):
          """Create histogram of normalized words (no punct or digits)"""
          hist = {}
          trans = maketrans('','')
          if type(source) in (StringType,UnicodeType):  # String-like src
              for word in split(source):
                  word = translate(word, trans, punctuation+digits)
                  if len(word) > 0:
                      hist[word] = hist.get(word,0) + 1
          elif hasattr(source,'read'):                  # File-like src
              try:
                  from xreadlines import xreadlines     # Check for module
                  for line in xreadlines(source):
                      for word in split(line):
                          word = translate(word, trans, punctuation+digits)
                          if len(word) > 0:
                              hist[word] = hist.get(word,0) + 1
              except ImportError:                       # Older Python ver
                  line = source.readline()          # Slow but mem-friendly
                  while line:
                      for word in split(line):
                          word = translate(word, trans, punctuation+digits)
                          if len(word) > 0:
                              hist[word] = hist.get(word,0) + 1
                      line = source.readline()
          else:
              raise TypeError, \
                    "source must be a string-like or file-like object"
          return hist

      def char_histogram(source, sizehint=1024*1024):
          hist = {}
          if type(source) in (StringType,UnicodeType):  # String-like src
              for char in source:
                  hist[char] = hist.get(char,0) + 1
          elif hasattr(source,'read'):                  # File-like src
              chunk = source.read(sizehint)
              while chunk:
                  for char in chunk:
                      hist[char] = hist.get(char,0) + 1
                  chunk = source.read(sizehint)
          else:
              raise TypeError, \
                    "source must be a string-like or file-like object"
          return hist

      def most_common(hist, num=1):
          pairs = []
          for pair in hist.items():
              pairs.append((pair[1],pair[0]))
          pairs.sort()
          pairs.reverse()
          return pairs[:num]

      def first_things(hist, num=1):
          pairs = []
          things = hist.keys()
          things.sort()
          for thing in things:
              pairs.append((thing,hist[thing]))
          pairs.sort()
          return pairs[:num]

      if __name__ == '__main__':
          if len(sys.argv) > 1:
              hist = word_histogram(open(sys.argv[1]))
          else:
              hist = word_histogram(sys.stdin)

          print "Ten most common words:"
          for pair in most_common(hist, 10):
              print '\t', pair[1], pair[0]

          print "First ten words alphabetically:"
          for pair in first_things(hist, 10):
              print '\t', pair[0], pair[1]

          # a more practical command-line version might use:
          # for pair in most_common(hist,len(hist)):
          #     print pair[1],'\t',pair[0]

  Several of the design choices are somewhat arbitrary.  Words
  have all their punctuation stripped to identify "real" words.
  But on the other hand, words are still case-sensitive, which
  may not be what is desired.  The sorting functions
  'first_things()' and 'most_common()' only return an initial
  sublist.  Perhaps it would be better to return the whole list,
  and let the user slice the result.  It is simple to customize
  around these sorts of issues, though.


  PROBLEM: Reading a file backwards by record, line, or paragraph
  --------------------------------------------------------------------

  Reading a file line by line is a common task in Python, or in
  most any language.  Files like server logs, configuration files,
  structured text databases, and others frequently arrange
  information into logical records, one per line.  Very often, the
  job of a program is to perform some calculation on each record
  in turn.

  Python provides a number of convenient methods on file-like
  objects for such line-by-line reading.  `FILE.readlines()`
  reads a whole file at once and returns a list of lines.  The
  technique is very fast, but requires the whole contents of the
  file be kept in memory.  For very large files, this can be a
  problem.  `FILE.readline()` is memory-friendly--it just reads a
  line at a time and can be called repeatedly until the EOF is
  reached--but it is also much slower.  The best solution for
  recent Python versions is `xreadlines.xreadlines()` or
  `FILE.xreadlines()` in Python 2.1+.  These techniques are
  memory-friendly, while still being fast and presenting a
  "virtual list" of lines (by way of Python's new
  generator/iterator interface).

  The above techniques work nicely for reading a file in its
  natural order, but what if you want to start at the end of a
  file and work backwards from there?  This need is frequently
  encountered when you want to read log files that have records
  appended over time (and when you want to look at the most
  recent records first).  It comes up in other situations also.
  There is a very easy technique if memory usage is not an issue:

      >>> open('lines','w').write('\n'.join([`n` for n in range(100)]))
      >>> fp = open('lines')
      >>> lines = fp.readlines()
      >>> lines.reverse()
      >>> for line in lines[1:5]:
      ...     # Processing suite here
      ...     print line,
      ...
      98
      97
      96
      95

  For large input files, however, this technique is not feasible.
  It would be nice to have something analogous to [xreadlines]
  here.  The example below provides a good starting point (the
  example works equally well for file-like objects).

      #---------- read_backwards.py ----------#
      # Read blocks of a file from end to beginning.
      # Blocks may be defined by any delimiter, but the
      #  constants LINE and PARA are useful ones.
      # Works much like the file object method '.readline()':
      #  repeated calls continue to get "next" part, and
      #  function returns empty string once BOF is reached.

      # Define constants
      from os import linesep
      LINE = linesep
      PARA = linesep*2
      READSIZE = 1000

      # Global variables
      buffer = ''

      def read_backwards(fp, mode=LINE, sizehint=READSIZE, _init=[0]):
          """Read blocks of file backwards (return empty string when done)"""
          # Trick of mutable default argument to hold state between calls
          if not _init[0]:
              fp.seek(0,2)
              _init[0] = 1
          # Find a block (using global buffer)
          global buffer
          while 1:
              # first check for block in buffer
              delim = buffer.rfind(mode)
              if delim <> -1:     # block is in buffer, return it
                  block = buffer[delim+len(mode):]
                  buffer = buffer[:delim]
                  return block+mode
              #-- BOF reached, return remainder (or empty string)
              elif fp.tell()==0:
                  block = buffer
                  buffer = ''
                  return block
              else:           # Read some more data into the buffer
                  readsize = min(fp.tell(),sizehint)
                  fp.seek(-readsize,1)
                  buffer = fp.read(readsize) + buffer
                  fp.seek(-readsize,1)

      #-- Self test of read_backwards()
      if __name__ == '__main__':
          # Let's create a test file to read in backwards
          fp = open('lines','wb')
          fp.write(LINE.join(['--- %d ---'%n for n in range(15)]))
          # Now open for reading backwards
          fp = open('lines','rb')
          # Read the blocks in, one per call (block==line by default)
          block = read_backwards(fp)
          while block:
              print block,
              block = read_backwards(fp)

  Notice that -anything- could serve as a block delimiter.  The
  constants provided just happened to work for lines and block
  paragraphs (and block paragraphs only with current OS's style
  of line breaks).  But other delimiters could be used.  It would
  -not- be immediately possible to read backwards word-by-word--a
  space delimiter would come close, but would not be quite right
  for other whitespace.  However, reading a line (and maybe
  reversing its words) is generally good enough.

  Another enhancement is possible with Python 2.2+.  Using the
  new 'yield' keyword, 'read_backwards()' could be programmed as
  an iterator rather than as a multi-call function.  The
  performance will not differ significantly, but the function
  might be expressed more clearly (and a "list-like" interface
  like `FILE.readlines()` makes the application's loop simpler).

  QUESTIONS:

  1.  Write a generator-based version of 'read_backwards()' that
      uses the 'yield' keyword.  Modify the self-test code to
      utilize the generator instead.

  2.  Explore and explain some pitfalls with the use of a mutable
      default value as a function argument.  Explain also how the
      style allows functions to encapsulate data and contrast
      with the encapsulation of class instances.


SECTION 2 -- Standard Modules
------------------------------------------------------------------------

  TOPIC -- Basic String Transformations
  --------------------------------------------------------------------

  The module [string] forms the core of Python's text manipulation
  libraries. That module is certainly the place to look before
  other modules. Most of the methods in the [string] module, you
  should note, have been copied to methods of string objects from
  Python 1.6+. Moreover, methods of string objects are a little bit
  faster to use than are the corresponding module functions. A few
  new methods of string objects do not have equivalents in the
  [string] module, but are still documented here.

  SEE ALSO, [str], [UserString]
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -