📄 chap1.txt
字号:
THIS IS MAINMATTER
CHAPTER I -- PYTHON BASICS
-------------------------------------------------------------------
This chapter discusses Python capabilities that are likely to
be used in text processing applications. For an introduction
to Python syntax and semantics per se, readers might want to
skip ahead to Appendix A (A Selective and Impressionistic
Short Review of Python); Guido van Rossum's _Python Tutorial_
at <http://python.org/doc/current/tut/tut.html> is also quite
excellent. The focus here occupies a somewhat higher level:
not the Python language narrowly, but also not yet specific to
text processing.
In Section 1.1, I look at some programming techniques that flow
out of the Python language itself, but that are usually not
obvious to Python beginners--and are sometimes not obvious even
to intermediate Python programmers. The programming techniques
that are discussed are ones that tend to be applicable to text
processing contexts--other programming tasks are likely to have
their own tricks and idioms that are not explicitly documented in
this book.
In Section 1.2, I document modules in the Python standard library
that you will probably use in your text processing application,
or at the very least want to keep in the back of your mind. A
number of other Python standard library modules are far enough
afield of text processing that you are unlikely to use them in
this type of application. Such remaining modules are documented
very briefly with one- or two- line descriptions. More details on
each module can be found with Python's standard documentation.
SECTION 1 -- Techniques and Patterns
------------------------------------------------------------------------
TOPIC -- Utilizing Higher-Order Functions in Text Processing
--------------------------------------------------------------------
This first topic merits a warning. It jumps feet-first into
higher-order functions (HOFs) at a fairly sophisticated level
and may be unfamiliar even to experienced Python programmers. Do
not be too frightened by this first topic--you can understand the
rest of the book without it. If the functional programming (FP)
concepts in this topic seem unfamiliar to you, I recommend you
jump ahead to Appendix A, especially its final section on FP
concepts.
In text processing, one frequently acts upon a series of chunks
of text that are, in a sense, homogeneous. Most often, these
chunks are lines, delimited by newline characters--but
sometimes other sorts of fields and blocks are relevant.
Moreover, Python has standard functions and syntax for reading
in lines from a file (sensitive to platform differences).
Obviously, these chunks are not entirely homogeneous--they can
contain varying data. But at the level we worry about during
processing, each chunk contains a natural parcel of instruction
or information.
As an example, consider an imperative style code fragment that
selects only those lines of text that match a criterion
'isCond()':
#*---------- Imperative style line selection ------------#
selected = [] # temp list to hold matches
fp = open(filename):
for line in fp.readlines(): # Py2.2 -> "for line in fp:"
if isCond(line): # (2.2 version reads lazily)
selected.append(line)
del line # Cleanup transient variable
There is nothing -wrong- with these few lines (see [xreadlines]
on efficiency issues). But it does take a few seconds to read
through them. In my opinion, even this small block of lines
does not parse as a -single thought-, even though its operation
really is such. Also the variable 'line' is slightly
superfluous (and it retains a value as a side effect after the
loop and also could conceivably step on a previously defined
value). In FP style, we could write the simpler:
#*---------- Functional style line selection ------------#
selected = filter(isCond, open(filename).readlines())
# Py2.2 -> filter(isCond, open(filename))
In the concrete, a textual source that one frequently wants to
process as a list of lines is a log file. All sorts of
applications produce log files, most typically either ones that
cause system changes that might need to be examined or
long-running applications that perform actions intermittently.
For example, the PythonLabs Windows installer for Python 2.2
produces a file called 'INSTALL.LOG' that contains a list of
actions taken during the install. Below is a highly abridged
copy of this file from one of my computers:
#------------ INSTALL.LOG sample data file --------------#
Title: Python 2.2
Source: C:\DOWNLOAD\PYTHON-2.2.EXE | 02-23-2002 | 01:40:54 | 7074248
Made Dir: D:\Python22
File Copy: D:\Python22\UNWISE.EXE | 05-24-2001 | 12:59:30 | | ...
RegDB Key: Software\Microsoft\Windows\CurrentVersion\Uninstall\Py...
RegDB Val: Python 2.2
File Copy: D:\Python22\w9xpopen.exe | 12-21-2001 | 12:22:34 | | ...
Made Dir: D:\PYTHON22\DLLs
File Overwrite: C:\WINDOWS\SYSTEM\MSVCRT.DLL | | | | 295000 | 770c8856
RegDB Root: 2
RegDB Key: Software\Microsoft\Windows\CurrentVersion\App Paths\Py...
RegDB Val: D:\PYTHON22\Python.exe
Shell Link: C:\WINDOWS\Start Menu\Programs\Python 2.2\Uninstall Py...
Link Info: D:\Python22\UNWISE.EXE | D:\PYTHON22 | | 0 | 1 | 0 |
Shell Link: C:\WINDOWS\Start Menu\Programs\Python 2.2\Python ...
Link Info: D:\Python22\python.exe | D:\PYTHON22 | D:\PYTHON22\...
You can see that each action recorded belongs to one of several
types. A processing application would presumably handle each
type of action differently (especially since each action has
different data fields associated with it). It is easy enough
to write Boolean functions that identify line types, for example:
#*------- Boolean "predicative" functions on lines -------#
def isFileCopy(line):
return line[:10]=='File Copy:' # or line.startswith(...)
def isFileOverwrite(line):
return line[:15]=='File Overwrite:'
The string method `"".startswith()` is less error prone than an
initial slice for recent Python versions, but these examples
are compatible with Python 1.5. In a slightly more compact
functional programming style, you can also write these like:
#*----------- Functional style predicates ---------------#
isRegDBRoot = lambda line: line[:11]=='RegDB Root:'
isRegDBKey = lambda line: line[:10]=='RegDB Key:'
isRegDBVal = lambda line: line[:10]=='RegDB Val:'
Selecting lines of a certain type is done exactly as above:
#*----------- Select lines that fill predicate ----------#
lines = open(r'd:\python22\install.log').readlines()
regroot_lines = filter(isRegDBRoot, lines)
But if you want to select upon multiple criteria, an FP style
can initially become cumbersome. For example suppose you are
interested all the "RegDB" lines; you could write a new custom
function for this filter:
#*--------------- Find the RegDB lines ------------------#
def isAnyRegDB(line):
if line[:11]=='RegDB Root:': return 1
elif line[:10]=='RegDB Key:': return 1
elif line[:10]=='RegDB Val:': return 1
else: return 0
# For recent Pythons, line.startswith(...) is better
Programming a custom function for each combined condition can
produce a glut of named functions. More importantly, each such
custom function requires a modicum of work to write and has a
nonzero chance of introducing a bug. For conditions which
should be jointly satisfied, you can either write custom
functions or nest several filters within each other. For
example:
#*------------- Filter on two line predicates -----------#
shortline = lambda line: len(line) < 25
short_regvals = filter(shortline, filter(isRegDBVal, lines))
In this example, we rely on previously defined functions for the
filter. Any error in the filters will be in either 'shortline()'
or 'isRegDBVal()', but not independently in some third function
'isShortRegVal()'. Such nested filters, however, are difficult to
read--especially if more than two are involved.
Calls to `map()` are sometimes similarly nested if several
operations are to be performed on the same string. For a fairly
trivial example, suppose you wished to reverse, capitalize, and
normalize whitespace in lines of text. Creating the support
functions is straightforward, and they could be nested in
`map()` calls:
#*------------ Multiple line transformations ------------#
from string import upper, join, split
def flip(s):
a = list(s)
a.reverse()
return join(a,'')
normalize = lambda s: join(split(s),' ')
cap_flip_norms = map(upper, map(flip, map(normalize, lines)))
This type of `map()` or `filter()` nest is difficult to read, and
should be avoided. Moreover, one can sometimes be drawn into
nesting alternating `map()` and `filter()` calls, making matters
still worse. For example, suppose you want to perform several
operations on each of the lines that meet several criteria. To
avoid this trap, many programmers fall back to a more verbose
imperative coding style that simply wraps the lists in a few
loops and creates some temporary variables for intermediate
results.
Within a functional programming style, it is nonetheless possible
to avoid the pitfall of excessive call nesting. The key to doing
this is an intelligent selection of a few combinatorial
-higher-order functions-. In general, a higher-order function is
one that takes as argument or returns as result a function
object. First-order functions just take some data as arguments
and produce a datum as an answer (perhaps a data-structure like a
list or dictionary). In contrast, the "inputs" and "outputs" of a
HOF are more function objects--ones generally intended to be
eventually called somewhere later in the program flow.
One example of a higher-order function is a -function factory-:
a function (or class) that returns a function, or collection of
functions, that are somehow "configured" at the time of their
creation. The "Hello World" of function factories is an
"adder" factory. Like "Hello World," an adder factory exists
just to show what can be done; it doesn't really -do- anything
useful by itself. Pretty much every explanation of function
factories uses an example such as:
>>> def adder_factory(n):
... return lambda m, n=n: m+n
...
>>> add10 = adder_factory(10)
>>> add10
<function <lambda> at 0x00FB0020>
>>> add10(4)
14
>>> add10(20)
30
>>> add5 = adder_factory(5)
>>> add5(4)
9
For text processing tasks, simple function factories are of
less interest than are -combinatorial- HOFs. The idea of a
combinatorial higher-order function is to take several (usually
first-order) functions as arguments and return a new function
that somehow synthesizes the operations of the argument
functions. Below is a simple library of combinatorial
higher-order functions that achieve surprisingly much in a
small number of lines:
#------------------- combinatorial.py -------------------#
from operator import mul, add, truth
apply_each = lambda fns, args=[]: map(apply, fns, [args]*len(fns))
bools = lambda lst: map(truth, lst)
bool_each = lambda fns, args=[]: bools(apply_each(fns, args))
conjoin = lambda fns, args=[]: reduce(mul, bool_each(fns, args))
all = lambda fns: lambda arg, fns=fns: conjoin(fns, (arg,))
both = lambda f,g: all((f,g))
all3 = lambda f,g,h: all((f,g,h))
and_ = lambda f,g: lambda x, f=f, g=g: f(x) and g(x)
disjoin = lambda fns, args=[]: reduce(add, bool_each(fns, args))
some = lambda fns: lambda arg, fns=fns: disjoin(fns, (arg,))
either = lambda f,g: some((f,g))
anyof3 = lambda f,g,h: some((f,g,h))
compose = lambda f,g: lambda x, f=f, g=g: f(g(x))
compose3 = lambda f,g,h: lambda x, f=f, g=g, h=h: f(g(h(x)))
ident = lambda x: x
Even with just over a dozen lines, many of these combinatorial
functions are merely convenience functions that wrap other more
general ones. Let us take a look at how we can use these HOFs to
simplify some of the earlier examples. The same names are used
for results, so look above for comparisons:
#----- Some examples using higher-order functions -----#
# Don't nest filters, just produce func that does both
short_regvals = filter(both(shortline, isRegVal), lines)
# Don't multiply ad hoc functions, just describe need
regroot_lines = \
filter(some([isRegDBRoot, isRegDBKey, isRegDBVal]), lines)
# Don't nest transformations, make one combined transform
capFlipNorm = compose3(upper, flip, normalize)
cap_flip_norms = map(capFlipNorm, lines)
In the example, we bind the composed function 'capFlipNorm' for
readability. The corresponding `map()` line expresses just the
-single thought- of applying a common operation to all the lines.
But the binding also illustrates some of the flexibility of
combinatorial functions. By condensing the several operations
previously nested in several `map()` calls, we can save the
combined operation for reuse elsewhere in the program.
As a rule of thumb, I recommend not using more than one
`filter()` and one `map()` in any given line of code. If these
"list application" functions need to nest more deeply than this,
readability is preserved by saving results to intermediate names.
Successive lines of such functional programming style calls
themselves revert to a more imperative style--but a wonderful
thing about Python is the degree to which it allows seamless
combinations of different programming styles. For example:
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -