📄 chap2.txt
字号:
elif type == UU: encode = binascii.b2a_uu
elif type == BINHEX: encode = binascii.b2a_hqx
else: raise ValueError, "Encoding must be in UU, BASE64, BINHEX"
# Second, compress the source if specified
if compress: s = zlib.compress(s)
# Third, encode the string, block-by-block
offset = 0
blocks = []
while 1:
blocks.append(encode(s[offset:offset+type]))
offset += type
if offset > len(s):
break
# Fourth, return the concatenated blocks
return ''.join(blocks)
def ASCIIdecode(s='', type=BASE64, compress=1):
"""Decode ASCII to a binary string"""
# First, decide the encoding style
if type == BASE64: s = binascii.a2b_base64(s)
elif type == BINHEX: s = binascii.a2b_hqx(s)
elif type == UU:
s = ''.join([binascii.a2b_uu(line) for line in s.split('\n')])
# Second, decompress the source if specified
if compress: s = zlib.decompress(s)
# Third, return the decoded binary string
return s
# Encode/decode STDIN for self-test
if __name__ == '__main__':
decode, TYPE = 0, BASE64
for arg in sys.argv:
if arg.lower()=='-d': decode = 1
elif arg.upper()=='UU': TYPE=UU
elif arg.upper()=='BINHEX': TYPE=BINHEX
elif arg.upper()=='BASE64': TYPE=BASE64
if decode:
print ASCIIdecode(sys.stdin.read(),type=TYPE)
else:
print ASCIIencode(sys.stdin.read(),type=TYPE)
The example above does not attach any headers or delimit the
encoded block (by design); for that, a wrapper like [uu],
[mimify], or [MimeWriter] is a better choice. Or a custom
wrapper around 'encode_binary.py'.
PROBLEM: Creating word or letter histograms
--------------------------------------------------------------------
A histogram is an analysis of the relative occurrence frequency
of each of a number of possible values. In terms of text
processing, the occurrences in question are almost always
either words or byte values. Creating histograms is quite
simple using Python dictionaries, but the technique is not
always immediately obvious to people thinking about it. The
example below has a good generality, provides several utility
functions associated with histograms, and can be used in a
command-line operation mode.
#---------- histogram.py ----------#
# Create occurrence counts of words or characters
# A few utility functions for presenting results
# Avoids requirement of recent Python features
from string import split, maketrans, translate, punctuation, digits
import sys
from types import *
import types
def word_histogram(source):
"""Create histogram of normalized words (no punct or digits)"""
hist = {}
trans = maketrans('','')
if type(source) in (StringType,UnicodeType): # String-like src
for word in split(source):
word = translate(word, trans, punctuation+digits)
if len(word) > 0:
hist[word] = hist.get(word,0) + 1
elif hasattr(source,'read'): # File-like src
try:
from xreadlines import xreadlines # Check for module
for line in xreadlines(source):
for word in split(line):
word = translate(word, trans, punctuation+digits)
if len(word) > 0:
hist[word] = hist.get(word,0) + 1
except ImportError: # Older Python ver
line = source.readline() # Slow but mem-friendly
while line:
for word in split(line):
word = translate(word, trans, punctuation+digits)
if len(word) > 0:
hist[word] = hist.get(word,0) + 1
line = source.readline()
else:
raise TypeError, \
"source must be a string-like or file-like object"
return hist
def char_histogram(source, sizehint=1024*1024):
hist = {}
if type(source) in (StringType,UnicodeType): # String-like src
for char in source:
hist[char] = hist.get(char,0) + 1
elif hasattr(source,'read'): # File-like src
chunk = source.read(sizehint)
while chunk:
for char in chunk:
hist[char] = hist.get(char,0) + 1
chunk = source.read(sizehint)
else:
raise TypeError, \
"source must be a string-like or file-like object"
return hist
def most_common(hist, num=1):
pairs = []
for pair in hist.items():
pairs.append((pair[1],pair[0]))
pairs.sort()
pairs.reverse()
return pairs[:num]
def first_things(hist, num=1):
pairs = []
things = hist.keys()
things.sort()
for thing in things:
pairs.append((thing,hist[thing]))
pairs.sort()
return pairs[:num]
if __name__ == '__main__':
if len(sys.argv) > 1:
hist = word_histogram(open(sys.argv[1]))
else:
hist = word_histogram(sys.stdin)
print "Ten most common words:"
for pair in most_common(hist, 10):
print '\t', pair[1], pair[0]
print "First ten words alphabetically:"
for pair in first_things(hist, 10):
print '\t', pair[0], pair[1]
# a more practical command-line version might use:
# for pair in most_common(hist,len(hist)):
# print pair[1],'\t',pair[0]
Several of the design choices are somewhat arbitrary. Words
have all their punctuation stripped to identify "real" words.
But on the other hand, words are still case-sensitive, which
may not be what is desired. The sorting functions
'first_things()' and 'most_common()' only return an initial
sublist. Perhaps it would be better to return the whole list,
and let the user slice the result. It is simple to customize
around these sorts of issues, though.
PROBLEM: Reading a file backwards by record, line, or paragraph
--------------------------------------------------------------------
Reading a file line by line is a common task in Python, or in
most any language. Files like server logs, configuration files,
structured text databases, and others frequently arrange
information into logical records, one per line. Very often, the
job of a program is to perform some calculation on each record
in turn.
Python provides a number of convenient methods on file-like
objects for such line-by-line reading. `FILE.readlines()`
reads a whole file at once and returns a list of lines. The
technique is very fast, but requires the whole contents of the
file be kept in memory. For very large files, this can be a
problem. `FILE.readline()` is memory-friendly--it just reads a
line at a time and can be called repeatedly until the EOF is
reached--but it is also much slower. The best solution for
recent Python versions is `xreadlines.xreadlines()` or
`FILE.xreadlines()` in Python 2.1+. These techniques are
memory-friendly, while still being fast and presenting a
"virtual list" of lines (by way of Python's new
generator/iterator interface).
The above techniques work nicely for reading a file in its
natural order, but what if you want to start at the end of a
file and work backwards from there? This need is frequently
encountered when you want to read log files that have records
appended over time (and when you want to look at the most
recent records first). It comes up in other situations also.
There is a very easy technique if memory usage is not an issue:
>>> open('lines','w').write('\n'.join([`n` for n in range(100)]))
>>> fp = open('lines')
>>> lines = fp.readlines()
>>> lines.reverse()
>>> for line in lines[1:5]:
... # Processing suite here
... print line,
...
98
97
96
95
For large input files, however, this technique is not feasible.
It would be nice to have something analogous to [xreadlines]
here. The example below provides a good starting point (the
example works equally well for file-like objects).
#---------- read_backwards.py ----------#
# Read blocks of a file from end to beginning.
# Blocks may be defined by any delimiter, but the
# constants LINE and PARA are useful ones.
# Works much like the file object method '.readline()':
# repeated calls continue to get "next" part, and
# function returns empty string once BOF is reached.
# Define constants
from os import linesep
LINE = linesep
PARA = linesep*2
READSIZE = 1000
# Global variables
buffer = ''
def read_backwards(fp, mode=LINE, sizehint=READSIZE, _init=[0]):
"""Read blocks of file backwards (return empty string when done)"""
# Trick of mutable default argument to hold state between calls
if not _init[0]:
fp.seek(0,2)
_init[0] = 1
# Find a block (using global buffer)
global buffer
while 1:
# first check for block in buffer
delim = buffer.rfind(mode)
if delim <> -1: # block is in buffer, return it
block = buffer[delim+len(mode):]
buffer = buffer[:delim]
return block+mode
#-- BOF reached, return remainder (or empty string)
elif fp.tell()==0:
block = buffer
buffer = ''
return block
else: # Read some more data into the buffer
readsize = min(fp.tell(),sizehint)
fp.seek(-readsize,1)
buffer = fp.read(readsize) + buffer
fp.seek(-readsize,1)
#-- Self test of read_backwards()
if __name__ == '__main__':
# Let's create a test file to read in backwards
fp = open('lines','wb')
fp.write(LINE.join(['--- %d ---'%n for n in range(15)]))
# Now open for reading backwards
fp = open('lines','rb')
# Read the blocks in, one per call (block==line by default)
block = read_backwards(fp)
while block:
print block,
block = read_backwards(fp)
Notice that -anything- could serve as a block delimiter. The
constants provided just happened to work for lines and block
paragraphs (and block paragraphs only with current OS's style
of line breaks). But other delimiters could be used. It would
-not- be immediately possible to read backwards word-by-word--a
space delimiter would come close, but would not be quite right
for other whitespace. However, reading a line (and maybe
reversing its words) is generally good enough.
Another enhancement is possible with Python 2.2+. Using the
new 'yield' keyword, 'read_backwards()' could be programmed as
an iterator rather than as a multi-call function. The
performance will not differ significantly, but the function
might be expressed more clearly (and a "list-like" interface
like `FILE.readlines()` makes the application's loop simpler).
QUESTIONS:
1. Write a generator-based version of 'read_backwards()' that
uses the 'yield' keyword. Modify the self-test code to
utilize the generator instead.
2. Explore and explain some pitfalls with the use of a mutable
default value as a function argument. Explain also how the
style allows functions to encapsulate data and contrast
with the encapsulation of class instances.
SECTION 2 -- Standard Modules
------------------------------------------------------------------------
TOPIC -- Basic String Transformations
--------------------------------------------------------------------
The module [string] forms the core of Python's text manipulation
libraries. That module is certainly the place to look before
other modules. Most of the methods in the [string] module, you
should note, have been copied to methods of string objects from
Python 1.6+. Moreover, methods of string objects are a little bit
faster to use than are the corresponding module functions. A few
new methods of string objects do not have equivalents in the
[string] module, but are still documented here.
SEE ALSO, [str], [UserString]
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -