📄 chap2.txt
字号:
else:
left = int(sys.argv[1])
right = int(sys.argv[2])
just = sys.argv[3].upper()
# Simplistic approach to finding initial paragraphs
for p in sys.stdin.read().split('\n\n'):
print reformat_para(p,left,right,just),'\n'
A number of enhancements are left to readers, if needed. You
might want to allow hanging indents or indented first lines, for
example. Or paragraphs meeting certain criteria might not be
appropriate for wrapping (e.g., headers). A custom application
might also determine the input paragraphs differently, either
by a different parsing of an input file, or by generating
paragraphs internally in some manner.
PROBLEM: Column statistics for delimited or flat-record files
--------------------------------------------------------------------
Data feeds, DBMS dumps, log files, and flat-file databases all
tend to contain ontologically similar records--one per line--with
a collection of fields in each record. Usually such fields are
separated either by a specified delimiter or by specific column
positions where fields are to occur.
Parsing these structured text records is quite easy, and
performing computations on fields is equally straightforward. But
in working with a variety of such "structured text databases," it
is easy to keep writing almost the same code over again for each
variation in format and computation.
The example below provides a generic framework for every
similar computation on a structured text database.
#---------- fields_stats.py ----------#
# Perform calculations on one or more of the
# fields in a structured text database.
import operator
from types import *
from xreadlines import xreadlines # req 2.1, but is much faster...
# could use .readline() meth < 2.1
#-- Symbolic Constants
DELIMITED = 1
FLATFILE = 2
#-- Some sample "statistical" func (in functional programming style)
nillFunc = lambda lst: None
toFloat = lambda lst: map(float, lst)
avg_lst = lambda lst: reduce(operator.add, toFloat(lst))/len(lst)
sum_lst = lambda lst: reduce(operator.add, toFloat(lst))
max_lst = lambda lst: reduce(max, toFloat(lst))
class FieldStats:
"""Gather statistics about structured text database fields
text_db may be either string (incl. Unicode) or file-like object
style may be in (DELIMITED, FLATFILE)
delimiter specifies the field separator in DELIMITED style text_db
column_positions lists all field positions for FLATFILE style,
using one-based indexing (first column is 1).
E.g.: (1, 7, 40) would take fields one, two, three
from columns 1, 7, 40 respectively.
field_funcs is a dictionary with column positions as keys,
and functions on lists as values.
E.g.: {1:avg_lst, 4:sum_lst, 5:max_lst} would specify the
average of column one, the sum of column 4, and the
max of column 5. All other cols--incl 2,3, >=6--
are ignored.
"""
def __init__(self,
text_db='',
style=DELIMITED,
delimiter=',',
column_positions=(1,),
field_funcs={} ):
self.text_db = text_db
self.style = style
self.delimiter = delimiter
self.column_positions = column_positions
self.field_funcs = field_funcs
def calc(self):
"""Calculate the column statistics
"""
#-- 1st, create a list of lists for data (incl. unused flds)
used_cols = self.field_funcs.keys()
used_cols.sort()
# one-based column naming: column[0] is always unused
columns = []
for n in range(1+used_cols[-1]):
# hint: '[[]]*num' creates refs to same list
columns.append([])
#-- 2nd, fill lists used for calculated fields
# might use a string directly for text_db
if type(self.text_db) in (StringType,UnicodeType):
for line in self.text_db.split('\n'):
fields = self.splitter(line)
for col in used_cols:
field = fields[col-1] # zero-based index
columns[col].append(field)
else: # Something file-like for text_db
for line in xreadlines(self.text_db):
fields = self.splitter(line)
for col in used_cols:
field = fields[col-1] # zero-based index
columns[col].append(field)
#-- 3rd, apply the field funcs to column lists
results = [None] * (1+used_cols[-1])
for col in used_cols:
results[col] = \
apply(self.field_funcs[col],(columns[col],))
#-- Finally, return the result list
return results
def splitter(self, line):
"""Split a line into fields according to curr inst specs"""
if self.style == DELIMITED:
return line.split(self.delimiter)
elif self.style == FLATFILE:
fields = []
# Adjust offsets to Python zero-based indexing,
# and also add final position after the line
num_positions = len(self.column_positions)
offsets = [(pos-1) for pos in self.column_positions]
offsets.append(len(line))
for pos in range(num_positions):
start = offsets[pos]
end = offsets[pos+1]
fields.append(line[start:end])
return fields
else:
raise ValueError, \
"Text database must be DELIMITED or FLATFILE"
#-- Test data
# First Name, Last Name, Salary, Years Seniority, Department
delim = '''
Kevin,Smith,50000,5,Media Relations
Tom,Woo,30000,7,Accounting
Sally,Jones,62000,10,Management
'''.strip() # no leading/trailing newlines
# Comment First Last Salary Years Dept
flat = '''
tech note Kevin Smith 50000 5 Media Relations
more filler Tom Woo 30000 7 Accounting
yet more... Sally Jones 62000 10 Management
'''.strip() # no leading/trailing newlines
#-- Run self-test code
if __name__ == '__main__':
getdelim = FieldStats(delim, field_funcs={3:avg_lst,4:max_lst})
print 'Delimited Calculations:'
results = getdelim.calc()
print ' Average salary -', results[3]
print ' Max years worked -', results[4]
getflat = FieldStats(flat, field_funcs={3:avg_lst,4:max_lst},
style=FLATFILE,
column_positions=(15,25,35,45,52))
print 'Flat Calculations:'
results = getflat.calc()
print ' Average salary -', results[3]
print ' Max years worked -', results[4]
The example above includes some efficiency considerations that
make it a good model for working with large data sets. In the
first place, class 'FieldStats' can (optionally) deal with a
file-like object, rather than keeping the whole structured text
database in memory. The generator `xreadlines.xreadlines()` is
an extremely fast and efficient file reader, but it requires
Python 2.1+--otherwise use `FILE.readline()` or
`FILE.readlines()` (for either memory or speed efficiency,
respectively). Moreover, only the data that is actually of
interest is collected into lists, in order to save memory.
However, rather than require multiple passes to collect
statistics on multiple fields, as many field columns and
summary functions as wanted can be used in one pass.
One possible improvement would be to allow multiple summary
functions against the same field during a pass. But that is
left as an exercise to the reader, if she desires to do it.
PROBLEM: Counting characters, words, lines, and paragraphs
--------------------------------------------------------------------
There is a wonderful utility under Unix-like systems called
'wc'. What it does is so basic, and so obvious, that it is
hard to imagine working without it. 'wc' simply counts the
characters, words, and lines of files (or STDIN). A few
command-line options control which results are displayed, but I
rarely use them.
In writing this chapter, I found myself on a system without
'wc', and felt a remedy was in order. The example below is
actually an "enhanced" 'wc' since it also counts paragraphs
(but it lacks the command-line switches). Unlike the external
'wc', it is easy to use the technique directly within Python
and is available anywhere Python is. The main trick--inasmuch
as there is one--is a compact use of the `"".join()` and
`"".split()` methods (`string.join()` and `string.split()` could
also be used, for example, to be compatible with Python 1.5.2 or
below).
#---------- wc.py ----------#
# Report the chars, words, lines, paragraphs
# on STDIN or in wildcard filename patterns
import sys, glob
if len(sys.argv) > 1:
c, w, l, p = 0, 0, 0, 0
for pat in sys.argv[1:]:
for file in glob.glob(pat):
s = open(file).read()
wc = len(s), len(s.split()), \
len(s.split('\n')), len(s.split('\n\n'))
print '\t'.join(map(str, wc)),'\t'+file
c, w, l, p = c+wc[0], w+wc[1], l+wc[2], p+wc[3]
wc = (c,w,l,p)
print '\t'.join(map(str, wc)), '\tTOTAL'
else:
s = sys.stdin.read()
wc = len(s), len(s.split()), len(s.split('\n')), \
len(s.split('\n\n'))
print '\t'.join(map(str, wc)), '\tSTDIN'
This little functionality could be wrapped up in a function,
but it is almost too compact to bother with doing so. Most of
the work is in the interaction with the shell environment, with
the counting basically taking only two lines.
The solution above is quite likely the "one obvious way to do
it," and therefore Pythonic. On the other hand a slightly more
adventurous reader might consider this assignment (if only for
fun):
>>> wc = map(len,[s]+map(s.split,(None,'\n','\n\n')))
A real daredevil might be able to reduce the entire program to
a single 'print' statement.
PROBLEM: Transmitting binary data as ASCII
--------------------------------------------------------------------
Many channels require that the information that travels over them
is 7-bit ASCII. Any bytes with a high-order first bit of one will
be handled unpredictably when transmitting data over protocols
like Simple Mail Transport Protocol (SMTP), Network News
Transport Protocol (NNTP), or HTTP (depending on content
encoding), or even just when displaying them in many standard
tools like editors. In order to encode 8-bit binary data as
ASCII, a number of techniques have been invented over time.
An obvious, but obese, encoding technique is to translate each
binary byte into its hexadecimal digits. UUencoding is an older
standard that developed around the need to transmit binary files
over the Usenet and on BBSs. Binhex is similar technique from
the MacOS world. In recent years, base64--which is specified by
RFC1521--has edged out the other styles of encoding. All of the
techniques are basically 4/3 encodings--that is, four ASCII bytes
are used to represent three binary bytes--but they differ
somewhat in line ending and header conventions (as well as in the
encoding as such). Quoted printable is yet another format, but of
variable encoding length. In quoted printable encoding, most
plain ASCII bytes are left unchanged, but a few special
characters and all high-bit bytes are escaped.
Python provides modules for all the encoding styles mentioned.
The high-level wrappers [uu], [binhex], [base64], and [quopri]
all operate on input and output file-like objects, encoding the
data therein. They also each have slightly different method names
and arguments. [binhex], for example, closes its output file
after encoding, which makes it unusable in conjunction with a
[cStringIO] file-like object. All of the high-level encoders
utilize the services of the low-level C module [binascii].
[binascii], in turn, implements the actual low-level block
conversions, but assumes that it will be passed the right size
blocks for a given encoding.
The standard library, therefore, does not contain quite the
right intermediate-level functionality for when the goal is
just encoding the binary data in arbitrary strings. It is easy
to wrap that up though:
#---------- encode_binary.py ----------#
# Provide encoders for arbitrary binary data
# in Python strings. Handles block size issues
# transparently, and returns a string.
# Precompression of the input string can reduce
# or eliminate any size penalty for encoding.
import sys
import zlib
import binascii
UU = 45
BASE64 = 57
BINHEX = sys.maxint
def ASCIIencode(s='', type=BASE64, compress=1):
"""ASCII encode a binary string"""
# First, decide the encoding style
if type == BASE64: encode = binascii.b2a_base64
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -