📄 chap2.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 5 页
字号:
          else:
              left  = int(sys.argv[1])
              right = int(sys.argv[2])
              just  = sys.argv[3].upper()

              # Simplistic approach to finding initial paragraphs
              for p in sys.stdin.read().split('\n\n'):
                  print reformat_para(p,left,right,just),'\n'

  A number of enhancements are left to readers, if needed.  You
  might want to allow hanging indents or indented first lines, for
  example.  Or paragraphs meeting certain criteria might not be
  appropriate for wrapping (e.g., headers).  A custom application
  might also determine the input paragraphs differently, either
  by a different parsing of an input file, or by generating
  paragraphs internally in some manner.


  PROBLEM: Column statistics for delimited or flat-record files
  --------------------------------------------------------------------

  Data feeds, DBMS dumps, log files, and flat-file databases all
  tend to contain ontologically similar records--one per line--with
  a collection of fields in each record. Usually such fields are
  separated either by a specified delimiter or by specific column
  positions where fields are to occur.

  Parsing these structured text records is quite easy, and
  performing computations on fields is equally straightforward. But
  in working with a variety of such "structured text databases," it
  is easy to keep writing almost the same code over again for each
  variation in format and computation.

  The example below provides a generic framework for every
  similar computation on a structured text database.

      #---------- fields_stats.py ----------#
      # Perform calculations on one or more of the
      # fields in a structured text database.

      import operator
      from types import *
      from xreadlines import xreadlines # req 2.1, but is much faster...
                                        # could use .readline() meth < 2.1
      #-- Symbolic Constants
      DELIMITED = 1
      FLATFILE = 2

      #-- Some sample "statistical" func (in functional programming style)
      nillFunc = lambda lst: None
      toFloat = lambda lst: map(float, lst)
      avg_lst = lambda lst: reduce(operator.add, toFloat(lst))/len(lst)
      sum_lst = lambda lst: reduce(operator.add, toFloat(lst))
      max_lst = lambda lst: reduce(max, toFloat(lst))

      class FieldStats:
          """Gather statistics about structured text database fields

          text_db may be either string (incl. Unicode) or file-like object
          style may be in (DELIMITED, FLATFILE)
          delimiter specifies the field separator in DELIMITED style text_db
          column_positions lists all field positions for FLATFILE style,
                           using one-based indexing (first column is 1).
                     E.g.: (1, 7, 40) would take fields one, two, three
                           from columns 1, 7, 40 respectively.
          field_funcs is a dictionary with column positions as keys,
                      and functions on lists as values.
               E.g.:  {1:avg_lst, 4:sum_lst, 5:max_lst} would specify the
                      average of column one, the sum of column 4, and the
                      max of column 5.  All other cols--incl 2,3, >=6--
                      are ignored.

          """
          def __init__(self,
                       text_db='',
                       style=DELIMITED,
                       delimiter=',',
                       column_positions=(1,),
                       field_funcs={} ):
              self.text_db = text_db
              self.style = style
              self.delimiter = delimiter
              self.column_positions = column_positions
              self.field_funcs = field_funcs

          def calc(self):
              """Calculate the column statistics
              """
              #-- 1st, create a list of lists for data (incl. unused flds)
              used_cols = self.field_funcs.keys()
              used_cols.sort()
              # one-based column naming: column[0] is always unused
              columns = []
              for n in range(1+used_cols[-1]):
                  # hint: '[[]]*num' creates refs to same list
                  columns.append([])

              #-- 2nd, fill lists used for calculated fields
                      # might use a string directly for text_db
              if type(self.text_db) in (StringType,UnicodeType):
                  for line in self.text_db.split('\n'):
                      fields = self.splitter(line)
                      for col in used_cols:
                          field = fields[col-1]   # zero-based index
                          columns[col].append(field)
              else:   # Something file-like for text_db
                  for line in xreadlines(self.text_db):
                      fields = self.splitter(line)
                      for col in used_cols:
                          field = fields[col-1]   # zero-based index
                          columns[col].append(field)

              #-- 3rd, apply the field funcs to column lists
              results = [None] * (1+used_cols[-1])
              for col in used_cols:
                  results[col] = \
                       apply(self.field_funcs[col],(columns[col],))

              #-- Finally, return the result list
              return results

          def splitter(self, line):
              """Split a line into fields according to curr inst specs"""
              if self.style == DELIMITED:
                  return line.split(self.delimiter)
              elif self.style == FLATFILE:
                  fields = []
                  # Adjust offsets to Python zero-based indexing,
                  # and also add final position after the line
                  num_positions = len(self.column_positions)
                  offsets = [(pos-1) for pos in self.column_positions]
                  offsets.append(len(line))
                  for pos in range(num_positions):
                      start = offsets[pos]
                      end = offsets[pos+1]
                      fields.append(line[start:end])
                  return fields
              else:
                  raise ValueError, \
                        "Text database must be DELIMITED or FLATFILE"

      #-- Test data
      # First Name, Last Name, Salary, Years Seniority, Department
      delim = '''
      Kevin,Smith,50000,5,Media Relations
      Tom,Woo,30000,7,Accounting
      Sally,Jones,62000,10,Management
      '''.strip()     # no leading/trailing newlines

      # Comment     First     Last      Salary    Years  Dept
      flat = '''
      tech note     Kevin     Smith     50000     5      Media Relations
      more filler   Tom       Woo       30000     7      Accounting
      yet more...   Sally     Jones     62000     10     Management
      '''.strip()     # no leading/trailing newlines

      #-- Run self-test code
      if __name__ == '__main__':
          getdelim = FieldStats(delim, field_funcs={3:avg_lst,4:max_lst})
          print 'Delimited Calculations:'
          results = getdelim.calc()
          print '  Average salary -', results[3]
          print '  Max years worked -', results[4]

          getflat = FieldStats(flat, field_funcs={3:avg_lst,4:max_lst},
                                     style=FLATFILE,
                                     column_positions=(15,25,35,45,52))
          print 'Flat Calculations:'
          results = getflat.calc()
          print '  Average salary -', results[3]
          print '  Max years worked -', results[4]

  The example above includes some efficiency considerations that
  make it a good model for working with large data sets.  In the
  first place, class 'FieldStats' can (optionally) deal with a
  file-like object, rather than keeping the whole structured text
  database in memory. The generator `xreadlines.xreadlines()` is
  an extremely fast and efficient file reader, but it requires
  Python 2.1+--otherwise use `FILE.readline()` or
  `FILE.readlines()` (for either memory or speed efficiency,
  respectively). Moreover, only the data that is actually of
  interest is collected into lists, in order to save memory.
  However, rather than require multiple passes to collect
  statistics on multiple fields, as many field columns and
  summary functions as wanted can be used in one pass.

  One possible improvement would be to allow multiple summary
  functions against the same field during a pass.  But that is
  left as an exercise to the reader, if she desires to do it.


  PROBLEM: Counting characters, words, lines, and paragraphs
  --------------------------------------------------------------------

  There is a wonderful utility under Unix-like systems called
  'wc'.  What it does is so basic, and so obvious, that it is
  hard to imagine working without it.  'wc' simply counts the
  characters, words, and lines of files (or STDIN).  A few
  command-line options control which results are displayed, but I
  rarely use them.

  In writing this chapter, I found myself on a system without
  'wc', and felt a remedy was in order.  The example below is
  actually an "enhanced" 'wc' since it also counts paragraphs
  (but it lacks the command-line switches).  Unlike the external
  'wc', it is easy to use the technique directly within Python
  and is available anywhere Python is.  The main trick--inasmuch
  as there is one--is a compact use of the `"".join()` and
  `"".split()` methods (`string.join()` and `string.split()` could
  also be used, for example, to be compatible with Python 1.5.2 or
  below).

      #---------- wc.py ----------#
      # Report the chars, words, lines, paragraphs
      # on STDIN or in wildcard filename patterns
      import sys, glob
      if len(sys.argv) > 1:
          c, w, l, p = 0, 0, 0, 0
          for pat in sys.argv[1:]:
              for file in glob.glob(pat):
                  s = open(file).read()
                  wc = len(s), len(s.split()), \
                       len(s.split('\n')), len(s.split('\n\n'))
                  print '\t'.join(map(str, wc)),'\t'+file
                  c, w, l, p = c+wc[0], w+wc[1], l+wc[2], p+wc[3]
          wc = (c,w,l,p)
          print '\t'.join(map(str, wc)), '\tTOTAL'
      else:
          s = sys.stdin.read()
          wc = len(s), len(s.split()), len(s.split('\n')), \
               len(s.split('\n\n'))
          print '\t'.join(map(str, wc)), '\tSTDIN'

  This little functionality could be wrapped up in a function,
  but it is almost too compact to bother with doing so.  Most of
  the work is in the interaction with the shell environment, with
  the counting basically taking only two lines.

  The solution above is quite likely the "one obvious way to do
  it," and therefore Pythonic.  On the other hand a slightly more
  adventurous reader might consider this assignment (if only for
  fun):

      >>> wc = map(len,[s]+map(s.split,(None,'\n','\n\n')))

  A real daredevil might be able to reduce the entire program to
  a single 'print' statement.


  PROBLEM: Transmitting binary data as ASCII
  --------------------------------------------------------------------

  Many channels require that the information that travels over them
  is 7-bit ASCII. Any bytes with a high-order first bit of one will
  be handled unpredictably when transmitting data over protocols
  like Simple Mail Transport Protocol (SMTP), Network News
  Transport Protocol (NNTP), or HTTP (depending on content
  encoding), or even just when displaying them in many standard
  tools like editors. In order to encode 8-bit binary data as
  ASCII, a number of techniques have been invented over time.

  An obvious, but obese, encoding technique is to translate each
  binary byte into its hexadecimal digits. UUencoding is an older
  standard that developed around the need to transmit binary files
  over the Usenet and on BBSs. Binhex is similar technique from
  the MacOS world. In recent years, base64--which is specified by
  RFC1521--has edged out the other styles of encoding. All of the
  techniques are basically 4/3 encodings--that is, four ASCII bytes
  are used to represent three binary bytes--but they differ
  somewhat in line ending and header conventions (as well as in the
  encoding as such). Quoted printable is yet another format, but of
  variable encoding length. In quoted printable encoding, most
  plain ASCII bytes are left unchanged, but a few special
  characters and all high-bit bytes are escaped.

  Python provides modules for all the encoding styles mentioned.
  The high-level wrappers [uu], [binhex], [base64], and [quopri]
  all operate on input and output file-like objects, encoding the
  data therein. They also each have slightly different method names
  and arguments. [binhex], for example, closes its output file
  after encoding, which makes it unusable in conjunction with a
  [cStringIO] file-like object. All of the high-level encoders
  utilize the services of the low-level C module [binascii].
  [binascii], in turn, implements the actual low-level block
  conversions, but assumes that it will be passed the right size
  blocks for a given encoding.

  The standard library, therefore, does not contain quite the
  right intermediate-level functionality for when the goal is
  just encoding the binary data in arbitrary strings.  It is easy
  to wrap that up though:

      #---------- encode_binary.py ----------#
      # Provide encoders for arbitrary binary data
      # in Python strings.  Handles block size issues
      # transparently, and returns a string.
      # Precompression of the input string can reduce
      # or eliminate any size penalty for encoding.

      import sys
      import zlib
      import binascii

      UU = 45
      BASE64 = 57
      BINHEX = sys.maxint

      def ASCIIencode(s='', type=BASE64, compress=1):
          """ASCII encode a binary string"""
          # First, decide the encoding style
          if type == BASE64:   encode = binascii.b2a_base64
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -