📄 chap2.txt
字号:
CHAPTER II -- BASIC STRING OPERATIONS
-------------------------------------------------------------------
The cheapest, fastest and most reliable components of a
computer system are those that aren't there.
--Gordon Bell, Encore Computer Corporation
If you are writing programs in Python to accomplish text
processing tasks, most of what you need to know is in this
chapter. Sure, you will probably need to know how to do some
basic things with pipes, files, and arguments to get your text
to process (covered in Chapter 1); but for actually
-processing- the text you have gotten, the [string] module and
string methods--and Python's basic data structures--do most
all of what you need done, almost all the time. To a lesser
extent, the various custom modules to perform encodings,
encryptions, and compressions are handy to have around (and you
certainly do not want the work of implementing them yourself).
But at the heart of text processing are basic transformations of
bits of text. That's what [string] functions and string
methods do.
There are a lot of interesting techniques elsewhere in this
book. I wouldn't have written about them if I did not find
them important. But be cautious before doing interesting
things. Specifically, given a fixed task in mind, before
cracking this book open to any of the other chapters, consider
very carefully whether your problem can be solved using the
techniques in this chapter. If you can answer this question
affirmatively, you should usually eschew the complications of
using the higher-level modules and techniques that other
chapters discuss. By all means read all of this book for
the insight and edification that I hope it provides; but still
focus on the "Zen of Python," and prefer simple to complex when
simple is enough.
This chapter does several things. Section 2.1 looks at a number
of common problems in text processing that can (and should) be
solved using (predominantly) the techniques documented in this
chapter. Each of these "Problems" presents working solutions that
can often be adopted with little change to real-life jobs. But a
larger goal is to provide readers with a starting point for
adaptation of the examples. It is not my goal to provide mere
collections of packaged utilities and modules--plenty of those
exist on the Web, and resources like the Vaults of Parnassus
<http://www.vex.net/parnassus/> and the Python Cookbook
<http://aspn.activestate.com/ASPN/Python/Cookbook/> are worth
investigating as part of any project/task (and new and better
utilities will be written between the time I write this and when
you read it). It is better for readers to receive a solid
foundation and starting point from which to develop the
functionality they need for their own projects and tasks. And
even better than spurring adaptation, these examples aim to
encourage contemplation. In presenting examples, this book tries
to embody a way of thinking about problems and an attitude
towards solving them. More than any individual technique, such
ideas are what I would most like to share with readers.
Section 2.2 is a "reference with commentary" on the Python
standard library modules for doing basic text manipulations. The
discussions interspersed with each module try to give some
guidance on why you would want to use a given module or function,
and the reference documentation tries to contain more examples of
actual typical usage than does a plain reference. In many cases,
the examples and discussion of individual functions addresses
common and productive design patterns in Python. The
cross-references are intended to contextualize a given function
(or other thing) in terms of related ones (and to help you decide
which is right for you). The actual listing of functions,
constants, classes, and the like is in alphabetical order within
type of thing.
Section 2.3 in many ways continues Section 2.1, but also provides
some aids for using this book in a learning context. The
problems and solutions presented in Section 2.3 are somewhat more
open-ended than those in Section 2.1. As well, each section
labeled as "Discussion" is followed by one labeled
"Questions." These questions are ones that could be assigned
by a teacher to students; but they are also intended to be
issues that general readers will enjoy and benefit from
contemplating. In many cases, the questions point to
limitations of the approaches initially presented, and ask
readers to think about ways to address or move beyond these
limitations--exactly what readers need to do when writing their
own custom code to accomplish outside tasks. However, each
Discussion in Section 2.3 should stand on its own, even if the
Questions are skipped over by the reader.
SECTION 1 -- Some Common Tasks
------------------------------------------------------------------------
PROBLEM: Quickly sorting lines on custom criteria
--------------------------------------------------------------------
Sorting is one of the real meat-and-potatoes algorithms of text
processing and, in fact, of most programming. Fortunately for
Python developers, the native `[].sort` method is extraordinarily
fast. Moreover, Python lists with almost any heterogeneous
objects as elements can be sorted--Python cannot rely on the
uniform arrays of a language like C (an unfortunate exception to
this general power was introduced in recent Python versions where
comparisons of complex numbers raise a 'TypeError'; and
'[1+1j,2+2j].sort()' dies for the same reason; Unicode strings in
lists can cause similar problems).
SEE ALSO, [complex]
+++
The list sort method is wonderful when you want to sort items in
their "natural" order--or in the order that Python considers
natural, in the case of items of varying types. Unfortunately, a
lot of times, you want to sort things in "unnatural" orders. For
lines of text, in particular, any order that is not simple
alphabetization of the lines is "unnatural." But often text lines
contain meaningful bits of information in positions other than
the first character position: A last name may occur as the second
word of a list of people (for example, with first name as the
first word); an IP address may occur several fields into a server
log file; a money total may occur at position 70 of each line;
and so on. What if you want to sort lines based on this style of
meaningful order that Python doesn't quite understand?
The list sort method `[].sort()` supports an optional custom
comparison function argument. The job this function has is to
return -1 if the first thing should come first, return 0 if the
two things are equal order-wise, and return 1 if the first thing
should come second. The built-in function `cmp()` does this in a
manner identical to the default `[].sort()` (except in terms of
speed, 'lst.sort()' is much faster than 'lst.sort(cmp)'). For
short lists and quick solutions, a custom comparison function is
probably the best thing. In a lot of cases, one can even get by
with an in-line 'lambda' function as the custom comparison
function, which is a pleasant and handy idiom.
When it comes to speed, however, use of custom comparison
functions is fairly awful. Part of the problem is Python's
function call overhead, but a lot of other factors contribute to
the slowness. Fortunately, a technique called "Schwartzian
Transforms" can make for much faster custom sorts. Schwartzian
Transforms are so named after Randal Schwartz, who proposed the
technique for working with Perl; but the technique is equally
applicable to Python.
The pattern involved in the Schwartzian Transform technique
consists of three steps (these can more precisely be called the
Guttman-Rosler Transform, which is based on the Schwartzian
Transform):
1. Transform the list in a reversible way into one that sorts
"naturally."
2. Call Python's native `[].sort()` method.
3. Reverse the transformation in (1) to restore the
original list items (in new sorted order).
The reason this technique works is that, for a list of size N,
it only requires O(2N) transformation operations, which is easy
to amortize over the necessary O(N log N) compare/flip
operations for large lists. The sort dominates computational
time, so anything that makes the sort more efficient is a win
in the limit case (this limit is reached quickly).
Below is an example of a simple, but plausible, custom sorting
algorithm. The sort is on the fourth and subsequent words of
a list of input lines. Lines that are shorter than four words
sort to the bottom. Running the test against a file with about
20,000 lines--about 1 megabyte--performed the Schwartzian
Transform sort in less than 2 seconds, while taking over 12
seconds for the custom comparison function sort (outputs were
verified as identical). Any number of factors will change the
exact relative timings, but a better than six times gain can
generally be expected.
#---------- schwartzian_sort.py ----------#
# Timing test for "sort on fourth word"
# Specifically, two lines >= 4 words will be sorted
# lexographically on the 4th, 5th, etc.. words.
# Any line with fewer than four words will be sorted to
# the end, and will occur in "natural" order.
import sys, string, time
wrerr = sys.stderr.write
# naive custom sort
def fourth_word(ln1,ln2):
lst1 = string.split(ln1)
lst2 = string.split(ln2)
#-- Compare "long" lines
if len(lst1) >= 4 and len(lst2) >= 4:
return cmp(lst1[3:],lst2[3:])
#-- Long lines before short lines
elif len(lst1) >= 4 and len(lst2) < 4:
return -1
#-- Short lines after long lines
elif len(lst1) < 4 and len(lst2) >= 4:
return 1
else: # Natural order
return cmp(ln1,ln2)
# Don't count the read itself in the time
lines = open(sys.argv[1]).readlines()
# Time the custom comparison sort
start = time.time()
lines.sort(fourth_word)
end = time.time()
wrerr("Custom comparison func in %3.2f secs\n" % (end-start))
# open('tmp.custom','w').writelines(lines)
# Don't count the read itself in the time
lines = open(sys.argv[1]).readlines()
# Time the Schwartzian sort
start = time.time()
for n in range(len(lines)): # Create the transform
lst = string.split(lines[n])
if len(lst) >= 4: # Tuple w/ sort info first
lines[n] = (lst[3:], lines[n])
else: # Short lines to end
lines[n] = (['\377'], lines[n])
lines.sort() # Native sort
for n in range(len(lines)): # Restore original lines
lines[n] = lines[n][1]
end = time.time()
wrerr("Schwartzian transform sort in %3.2f secs\n" % (end-start))
# open('tmp.schwartzian','w').writelines(lines)
Only one particular example is presented, but readers should be
able to generalize this technique to any sort they need to
perform frequently or on large files.
PROBLEM: Reformatting paragraphs of text
--------------------------------------------------------------------
While I mourn the decline of plaintext ASCII as a communication
format--and its eclipse by unnecessarily complicated and large
(and often proprietary) formats--there is still plenty of life
left in text files full of prose. READMEs, HOWTOs, email,
Usenet posts, and this book itself are written in plaintext (or
at least something close enough to plaintext that generic
processing techniques are valuable). Moreover, many formats like
HTML and LaTeX are frequently enough hand-edited that their
plaintext appearance is important.
One task that is extremely common when working with prose text
files is reformatting paragraphs to conform to desired margins.
Python 2.3 adds the module [textwrap], which performs more
limited reformatting than the code below. Most of the time, this
task gets done within text editors, which are indeed quite
capable of performing the task. However, sometimes it would be
nice to automate the formatting process. The task is simple
enough that it is slightly surprising that Python has no standard
module function to do this. There -is- the class
`formatter.DumbWriter`, or the possibility of inheriting from and
customizing `formatter.AbstractWriter`. These classes are
discussed in Chapter 5; but frankly, the amount of customization
and sophistication needed to use these classes and their many
methods is way out of proportion for the task at hand.
Below is a simple solution that can be used either as a
command-line tool (reading from STDIN and writing to STDOUT) or
by import to a larger application.
#---------- reformat_para.py ----------#
# Simple paragraph reformatter. Allows specification
# of left and right margins, and of justification style
# (using constants defined in module).
LEFT,RIGHT,CENTER = 'LEFT','RIGHT','CENTER'
def reformat_para(para='',left=0,right=72,just=LEFT):
words = para.split()
lines = []
line = ''
word = 0
end_words = 0
while not end_words:
if len(words[word]) > right-left: # Handle very long words
line = words[word]
word +=1
if word >= len(words):
end_words = 1
else: # Compose line of words
while len(line)+len(words[word]) <= right-left:
line += words[word]+' '
word += 1
if word >= len(words):
end_words = 1
break
lines.append(line)
line = ''
if just==CENTER:
r, l = right, left
return '\n'.join([' '*left+ln.center(r-l) for ln in lines])
elif just==RIGHT:
return '\n'.join([line.rjust(right) for line in lines])
else: # left justify
return '\n'.join([' '*left+line for line in lines])
if __name__=='__main__':
import sys
if len(sys.argv) <> 4:
print "Please specify left_margin, right_marg, justification"
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -