📄 chap1.txt

📁 很详细的Python文字处理教程
💻 TXT
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
  THIS IS MAINMATTER

CHAPTER I -- PYTHON BASICS
-------------------------------------------------------------------

  This chapter discusses Python capabilities that are likely to
  be used in text processing applications. For an introduction
  to Python syntax and semantics per se, readers might want to
  skip ahead to Appendix A (A Selective and Impressionistic
  Short Review of Python); Guido van Rossum's _Python Tutorial_
  at <http://python.org/doc/current/tut/tut.html> is also quite
  excellent. The focus here occupies a somewhat higher level:
  not the Python language narrowly, but also not yet specific to
  text processing.

  In Section 1.1, I look at some programming techniques that flow
  out of the Python language itself, but that are usually not
  obvious to Python beginners--and are sometimes not obvious even
  to intermediate Python programmers. The programming techniques
  that are discussed are ones that tend to be applicable to text
  processing contexts--other programming tasks are likely to have
  their own tricks and idioms that are not explicitly documented in
  this book.

  In Section 1.2, I document modules in the Python standard library
  that you will probably use in your text processing application,
  or at the very least want to keep in the back of your mind. A
  number of other Python standard library modules are far enough
  afield of text processing that you are unlikely to use them in
  this type of application. Such remaining modules are documented
  very briefly with one- or two- line descriptions. More details on
  each module can be found with Python's standard documentation.


SECTION 1 -- Techniques and Patterns
------------------------------------------------------------------------

  TOPIC -- Utilizing Higher-Order Functions in Text Processing
  --------------------------------------------------------------------

  This first topic merits a warning. It jumps feet-first into
  higher-order functions (HOFs) at a fairly sophisticated level
  and may be unfamiliar even to experienced Python programmers. Do
  not be too frightened by this first topic--you can understand the
  rest of the book without it. If the functional programming (FP)
  concepts in this topic seem unfamiliar to you, I recommend you
  jump ahead to Appendix A, especially its final section on FP
  concepts.

  In text processing, one frequently acts upon a series of chunks
  of text that are, in a sense, homogeneous.  Most often, these
  chunks are lines, delimited by newline characters--but
  sometimes other sorts of fields and blocks are relevant.
  Moreover, Python has standard functions and syntax for reading
  in lines from a file (sensitive to platform differences).
  Obviously, these chunks are not entirely homogeneous--they can
  contain varying data.  But at the level we worry about during
  processing, each chunk contains a natural parcel of instruction
  or information.

  As an example, consider an imperative style code fragment that
  selects only those lines of text that match a criterion
  'isCond()':

      #*---------- Imperative style line selection ------------#
      selected = []                 # temp list to hold matches
      fp = open(filename):
      for line in fp.readlines():   # Py2.2 -> "for line in fp:"
          if isCond(line):          # (2.2 version reads lazily)
              selected.append(line)
      del line                      # Cleanup transient variable

  There is nothing -wrong- with these few lines (see [xreadlines]
  on efficiency issues).  But it does take a few seconds to read
  through them.  In my opinion, even this small block of lines
  does not parse as a -single thought-, even though its operation
  really is such.  Also the variable 'line' is slightly
  superfluous (and it retains a value as a side effect after the
  loop and also could conceivably step on a previously defined
  value).  In FP style, we could write the simpler:

      #*---------- Functional style line selection ------------#
      selected = filter(isCond, open(filename).readlines())
      # Py2.2 -> filter(isCond, open(filename))

  In the concrete, a textual source that one frequently wants to
  process as a list of lines is a log file.  All sorts of
  applications produce log files, most typically either ones that
  cause system changes that might need to be examined or
  long-running applications that perform actions intermittently.
  For example, the PythonLabs Windows installer for Python 2.2
  produces a file called 'INSTALL.LOG' that contains a list of
  actions taken during the install.  Below is a highly abridged
  copy of this file from one of my computers:

      #------------ INSTALL.LOG sample data file --------------#
      Title: Python 2.2
      Source: C:\DOWNLOAD\PYTHON-2.2.EXE | 02-23-2002 | 01:40:54 | 7074248
      Made Dir: D:\Python22
      File Copy: D:\Python22\UNWISE.EXE | 05-24-2001 | 12:59:30 | | ...
      RegDB Key: Software\Microsoft\Windows\CurrentVersion\Uninstall\Py...
      RegDB Val: Python 2.2
      File Copy: D:\Python22\w9xpopen.exe | 12-21-2001 | 12:22:34 | | ...
      Made Dir: D:\PYTHON22\DLLs
      File Overwrite: C:\WINDOWS\SYSTEM\MSVCRT.DLL | | | | 295000 | 770c8856
      RegDB Root: 2
      RegDB Key: Software\Microsoft\Windows\CurrentVersion\App Paths\Py...
      RegDB Val: D:\PYTHON22\Python.exe
      Shell Link: C:\WINDOWS\Start Menu\Programs\Python 2.2\Uninstall Py...
      Link Info: D:\Python22\UNWISE.EXE | D:\PYTHON22 |  | 0 | 1 | 0 |
      Shell Link: C:\WINDOWS\Start Menu\Programs\Python 2.2\Python ...
      Link Info: D:\Python22\python.exe | D:\PYTHON22 | D:\PYTHON22\...

  You can see that each action recorded belongs to one of several
  types.  A processing application would presumably handle each
  type of action differently (especially since each action has
  different data fields associated with it).  It is easy enough
  to write Boolean functions that identify line types, for example:

      #*------- Boolean "predicative" functions on lines -------#
      def isFileCopy(line):
          return line[:10]=='File Copy:' # or line.startswith(...)
      def isFileOverwrite(line):
          return line[:15]=='File Overwrite:'

  The string method `"".startswith()` is less error prone than an
  initial slice for recent Python versions, but these examples
  are compatible with Python 1.5.  In a slightly more compact
  functional programming style, you can also write these like:

      #*----------- Functional style predicates ---------------#
      isRegDBRoot = lambda line: line[:11]=='RegDB Root:'
      isRegDBKey = lambda line: line[:10]=='RegDB Key:'
      isRegDBVal = lambda line: line[:10]=='RegDB Val:'

  Selecting lines of a certain type is done exactly as above:

      #*----------- Select lines that fill predicate ----------#
      lines = open(r'd:\python22\install.log').readlines()
      regroot_lines = filter(isRegDBRoot, lines)

  But if you want to select upon multiple criteria, an FP style
  can initially become cumbersome.  For example suppose you are
  interested all the "RegDB" lines; you could write a new custom
  function for this filter:

      #*--------------- Find the RegDB lines ------------------#
      def isAnyRegDB(line):
          if   line[:11]=='RegDB Root:': return 1
          elif line[:10]=='RegDB Key:':  return 1
          elif line[:10]=='RegDB Val:':  return 1
          else:                          return 0
      # For recent Pythons, line.startswith(...) is better

  Programming a custom function for each combined condition can
  produce a glut of named functions.  More importantly, each such
  custom function requires a modicum of work to write and has a
  nonzero chance of introducing a bug.  For conditions which
  should be jointly satisfied, you can either write custom
  functions or nest several filters within each other.  For
  example:

      #*------------- Filter on two line predicates -----------#
      shortline = lambda line: len(line) < 25
      short_regvals = filter(shortline, filter(isRegDBVal, lines))

  In this example, we rely on previously defined functions for the
  filter. Any error in the filters will be in either 'shortline()'
  or 'isRegDBVal()', but not independently in some third function
  'isShortRegVal()'. Such nested filters, however, are difficult to
  read--especially if more than two are involved.

  Calls to `map()` are sometimes similarly nested if several
  operations are to be performed on the same string. For a fairly
  trivial example, suppose you wished to reverse, capitalize, and
  normalize whitespace in lines of text. Creating the support
  functions is straightforward, and they could be nested in
  `map()` calls:

      #*------------ Multiple line transformations ------------#
      from string import upper, join, split
      def flip(s):
          a = list(s)
          a.reverse()
          return join(a,'')
      normalize = lambda s: join(split(s),' ')
      cap_flip_norms = map(upper, map(flip, map(normalize, lines)))

  This type of `map()` or `filter()` nest is difficult to read, and
  should be avoided. Moreover, one can sometimes be drawn into
  nesting alternating `map()` and `filter()` calls, making matters
  still worse. For example, suppose you want to perform several
  operations on each of the lines that meet several criteria. To
  avoid this trap, many programmers fall back to a more verbose
  imperative coding style that simply wraps the lists in a few
  loops and creates some temporary variables for intermediate
  results.

  Within a functional programming style, it is nonetheless possible
  to avoid the pitfall of excessive call nesting. The key to doing
  this is an intelligent selection of a few combinatorial
  -higher-order functions-. In general, a higher-order function is
  one that takes as argument or returns as result a function
  object. First-order functions just take some data as arguments
  and produce a datum as an answer (perhaps a data-structure like a
  list or dictionary). In contrast, the "inputs" and "outputs" of a
  HOF are more function objects--ones generally intended to be
  eventually called somewhere later in the program flow.

  One example of a higher-order function is a -function factory-:
  a function (or class) that returns a function, or collection of
  functions, that are somehow "configured" at the time of their
  creation.  The "Hello World" of function factories is an
  "adder" factory.  Like "Hello World," an adder factory exists
  just to show what can be done; it doesn't really -do- anything
  useful by itself.  Pretty much every explanation of function
  factories uses an example such as:

      >>> def adder_factory(n):
      ...    return lambda m, n=n: m+n
      ...
      >>> add10 = adder_factory(10)
      >>> add10
      <function <lambda> at 0x00FB0020>
      >>> add10(4)
      14
      >>> add10(20)
      30
      >>> add5 = adder_factory(5)
      >>> add5(4)
      9

  For text processing tasks, simple function factories are of
  less interest than are -combinatorial- HOFs. The idea of a
  combinatorial higher-order function is to take several (usually
  first-order) functions as arguments and return a new function
  that somehow synthesizes the operations of the argument
  functions. Below is a simple library of combinatorial
  higher-order functions that achieve surprisingly much in a
  small number of lines:

      #------------------- combinatorial.py -------------------#
      from operator import mul, add, truth
      apply_each = lambda fns, args=[]: map(apply, fns, [args]*len(fns))
      bools = lambda lst: map(truth, lst)
      bool_each = lambda fns, args=[]: bools(apply_each(fns, args))
      conjoin = lambda fns, args=[]: reduce(mul, bool_each(fns, args))
      all = lambda fns: lambda arg, fns=fns: conjoin(fns, (arg,))
      both = lambda f,g: all((f,g))
      all3 = lambda f,g,h: all((f,g,h))
      and_ = lambda f,g: lambda x, f=f, g=g: f(x) and g(x)
      disjoin = lambda fns, args=[]: reduce(add, bool_each(fns, args))
      some = lambda fns: lambda arg, fns=fns: disjoin(fns, (arg,))
      either = lambda f,g: some((f,g))
      anyof3 = lambda f,g,h: some((f,g,h))
      compose = lambda f,g: lambda x, f=f, g=g: f(g(x))
      compose3 = lambda f,g,h: lambda x, f=f, g=g, h=h: f(g(h(x)))
      ident = lambda x: x

  Even with just over a dozen lines, many of these combinatorial
  functions are merely convenience functions that wrap other more
  general ones. Let us take a look at how we can use these HOFs to
  simplify some of the earlier examples. The same names are used
  for results, so look above for comparisons:

      #----- Some examples using higher-order functions -----#
      # Don't nest filters, just produce func that does both
      short_regvals = filter(both(shortline, isRegVal), lines)

      # Don't multiply ad hoc functions, just describe need
      regroot_lines = \
          filter(some([isRegDBRoot, isRegDBKey, isRegDBVal]), lines)

      # Don't nest transformations, make one combined transform
      capFlipNorm = compose3(upper, flip, normalize)
      cap_flip_norms = map(capFlipNorm, lines)

  In the example, we bind the composed function 'capFlipNorm' for
  readability. The corresponding `map()` line expresses just the
  -single thought- of applying a common operation to all the lines.
  But the binding also illustrates some of the flexibility of
  combinatorial functions. By condensing the several operations
  previously nested in several `map()` calls, we can save the
  combined operation for reuse elsewhere in the program.

  As a rule of thumb, I recommend not using more than one
  `filter()` and one `map()` in any given line of code. If these
  "list application" functions need to nest more deeply than this,
  readability is preserved by saving results to intermediate names.
  Successive lines of such functional programming style calls
  themselves revert to a more imperative style--but a wonderful
  thing about Python is the degree to which it allows seamless
  combinations of different programming styles. For example:
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -