📄 chap3.txt
字号:
--------------------------------------------------------------------
A common typo in prose texts is doubled words (hopefully they
have been edited out of this book except in those few cases
where they are intended). The same error occurs to a lesser
extent in programming language code, configuration files, or
data feeds. Regular expressions are well-suited to detecting
this occurrence, which just amounts to a backreference to a
word pattern. It's easy to wrap the regex in a small utility
with a few extra features:
#---------- dupwords.py ----------#
# Detect doubled words and display with context
# Include words doubled across lines but within paras
import sys, re, glob
for pat in sys.argv[1:]:
for file in glob.glob(pat):
newfile = 1
for para in open(file).read().split('\n\n'):
dups = re.findall(r'(?m)(^.*(\b\w+\b)\s*\b\2\b.*$)', para)
if dups:
if newfile:
print '%s\n%s\n' % ('-'*70,file)
newfile = 0
for dup in dups:
print '[%s] -->' % dup[1], dup[0]
This particular version grabs the line or lines on which
duplicates occur and prints them for context (along with a prompt
for the duplicate itself). Variations are straightforward. The
assumption made by 'dupwords.py' is that a doubled word that
spans a line (from the end of one to the beginning of another,
ignoring whitespace) is a real doubling; but a duplicate that
spans paragraphs is not likewise noteworthy.
PROBLEM: Checking for server errors:
--------------------------------------------------------------------
Web servers are a ubiquitous source of information nowadays.
But finding URLs that lead to real documents is largely
hit-or-miss. Every Web maintainer seems to reorganize her site
every month or two, thereby breaking bookmarks and hyperlinks.
As bad as the chaos is for plain Web surfers, it is worse for
robots faced with the difficult task of recognizing the
difference between content and errors. By-the-by, it is easy
to accumulate downloaded Web pages that consist of error
messages rather than desired content.
In principle, Web servers can and should return error codes
indicating server errors. But in practice, Web servers almost
always return dynamically generated results pages for erroneous
requests. Such pages are basically perfectly normal HTML pages
that just happen to contain text like "Error 404: File not
found!" Most of the time these pages are a bit fancier than
this, containing custom graphics and layout, links to site
homepages, JavaScript code, cookies, meta tags, and all sorts
of other stuff. It is actually quite amazing just how much
many Web servers send in response to requests for nonexistent
URLs.
Below is a very simple Python script to examine just what Web
servers return on valid or invalid requests. Getting an error
page is usually as simple as asking for a page called
'http://somewebsite.com/phony-url' or the like (anything that
doesn't really exist). [urllib] is discussed in Chapter 5, but
its details are not important here.
#---------- url_examine.py ----------#
import sys
from urllib import urlopen
if len(sys.argv) > 1:
fpin = urlopen(sys.argv[1])
print fpin.geturl()
print fpin.info()
print fpin.read()
else:
print "No specified URL"
Given the diversity of error pages you might receive, it is
difficult or impossible to create a regular expression (or any
program) that determines with certainty whether a given HTML
document is an error page. Furthermore, some sites choose to
generate pages that are not really quite errors, but not
really quite content either (e.g, generic directories of site
information with suggestions on how to get to content). But
some heuristics come quite close to separating content from
errors. One noteworthy heuristic is that the interesting
errors are almost always 404 or 403 (not a sure thing, but good
enough to make smart guesses). Below is a utility to rate the
"error probability" of HTML documents:
#---------- error_page.py ----------#
import re, sys
page = sys.stdin.read()
# Mapping from patterns to probability contribution of pattern
err_pats = {r'(?is)<TITLE>.*?(404|403).*?ERROR.*?</TITLE>': 0.95,
r'(?is)<TITLE>.*?ERROR.*?(404|403).*?</TITLE>': 0.95,
r'(?is)<TITLE>ERROR</TITLE>': 0.30,
r'(?is)<TITLE>.*?ERROR.*?</TITLE>': 0.10,
r'(?is)<META .*?(404|403).*?ERROR.*?>': 0.80,
r'(?is)<META .*?ERROR.*?(404|403).*?>': 0.80,
r'(?is)<TITLE>.*?File Not Found.*?</TITLE>': 0.80,
r'(?is)<TITLE>.*?Not Found.*?</TITLE>': 0.40,
r'(?is)<BODY.*(404|403).*</BODY>': 0.10,
r'(?is)<H1>.*?(404|403).*?</H1>': 0.15,
r'(?is)<BODY.*not found.*</BODY>': 0.10,
r'(?is)<H1>.*?not found.*?</H1>': 0.15,
r'(?is)<BODY.*the requested URL.*</BODY>': 0.10,
r'(?is)<BODY.*the page you requested.*</BODY>': 0.10,
r'(?is)<BODY.*page.{1,50}unavailable.*</BODY>': 0.10,
r'(?is)<BODY.*request.{1,50}unavailable.*</BODY>': 0.10,
r'(?i)does not exist': 0.10,
}
err_score = 0
for pat, prob in err_pats.items():
if err_score > 0.9: break
if re.search(pat, page):
# print pat, prob
err_score += prob
if err_score > 0.90: print 'Page is almost surely an error report'
elif err_score > 0.75: print 'It is highly likely page is an error report'
elif err_score > 0.50: print 'Better-than-even odds page is error report'
elif err_score > 0.25: print 'Fair indication page is an error report'
else: print 'Page is probably real content'
Tested against a fair number of sites, a collection like this of
regular expression searches and threshold confidences works
quite well. Within the author's own judgment of just what is
really an error page, 'erro_page.py' has gotten no false
positives and always arrived at at least the lowest warning
level for every true error page.
The patterns chosen are all fairly simple, and both the
patterns and their weightings were determined entirely
subjectively by the author. But something like this weighted
hit-or-miss technique can be used to solve many "fuzzy logic"
matching problems (most having nothing to do with Web server
errors).
Code like that above can form a general approach to more
complete applications. But for what it is worth, the scripts
'url_examine.py' and 'error_page.py' may be used directly
together by piping from the first to the second. For example:
#*------ Using ex_error_page.py -----#
% python urlopen.py http://gnosis.cx/nonesuch | python ex_error_page.py
Page is almost surely an error report
PROBLEM: Reading lines with continuation characters
--------------------------------------------------------------------
Many configuration files and other types of computer code are
line oriented, but also have a facility to treat multiple lines
as if they were a single logical line. In processing such a
file it is usually desirable as a first step to turn all these
logical lines into actual newline-delimited lines (or more
likely, to transform both single and continued lines as
homogeneous list elements to iterate through later). A
continuation character is generally required to be the -last-
thing on a line before a newline, or possibly the last thing
other than some whitespace. A small (and very partial) table
of continuation characters used by some common and uncommon
formats is listed below:
#*----- Common continuation characters -----#
\ Python, JavaScript, C/C++, Bash, TCL, Unix config
_ Visual Basic, PAW
& Lyris, COBOL, IBIS
; Clipper, TOP
- XSPEC, NetREXX
= Oracle Express
Most of the formats listed are programming languages, and
parsing them takes quite a bit more than just identifying the
lines. More often, it is configuration files of various sorts
that are of interest in simple parsing, and most of the time
these files use a common Unix-style convention of using
trailing backslashes for continuation lines.
One -could- manage to parse logical lines with a [string]
module approach that looped through lines and performed
concatenations when needed. But a greater elegance is served
by reducing the problem to a single regular expression. The
module below provides this:
#---------- logical_lines.py ----------#
# Determine the logical lines in a file that might have
# continuation characters. 'logical_lines()' returns a
# list. The self-test prints the logical lines as
# physical lines (for all specified files and options).
import re
def logical_lines(s, continuation='\\', strip_trailing_space=0):
c = continuation
if strip_trailing_space:
s = re.sub(r'(?m)(%s)(\s+)$'%[c], r'\1', s)
pat_log = r'(?sm)^.*?$(?<!%s)'%[c] # e.g. (?sm)^.*?$(?<!\\)
return [t.replace(c+'\n','') for t in re.findall(pat_log, s)]
if __name__ == '__main__':
import sys
files, strip, contin = ([], 0, '\\')
for arg in sys.argv[1:]:
if arg[:-1] == '--continue=': contin = arg[-1]
elif arg[:-1] == '-c': contin = arg[-1]
elif arg in ('--string','-s'): strip = 1
else: files.append(arg)
if not files: files.append(sys.stdin)
for file in files:
s = open(sys.argv[1]).read()
print '\n'.join(logical_lines(s, contin, strip))
The comment in the 'pat_log' definition shows a bit just how
cryptic regular expressions can be at times. The comment is
the pattern that is used for the default value of
'continuation'. But as dense as it is with symbols, you can
still read it by proceeding slowly, left to right. Let us try
a version of the same line with the verbose modifier and
comments:
>>> pat = r'''
... (?x) # This is the verbose version
... (?s) # In the pattern, let "." match newlines, if needed
... (?m) # Allow ^ and $ to match every begin- and end-of-line
... ^ # Start the match at the beginning of a line
... .*? # Non-greedily grab everything until the first place
... # where the rest of the pattern matches (if possible)
... $ # End the match at an end-of-line
... (?<! # Only count as a match if the enclosed pattern was not
... # the immediately last thing seen (negative lookbehind)
... \\) # It wasn't an (escaped) backslash'''
PROBLEM: Identifying URLs and email addresses in texts
--------------------------------------------------------------------
A neat feature of many Internet and news clients is their
automatic identification of resources that the applications can
act upon. For URL resources, this usually means making the links
"clickable"; for an email address it usually means launching a
new letter to the person at the address. Depending on the nature
of an application, you could perform other sorts of actions for
each identified resource. For a text processing application, the
use of a resource is likely to be something more batch-oriented:
extraction, transformation, indexing, or the like.
Fully and precisely implementing RFC1822 (for email addresses)
or RFC1738 (for URLs) is possible within regular expressions.
But doing so is probably even more work than is really needed
to identify 99% of resources. Moreover, a significant number
of resources in the "real world" are not strictly compliant
with the relevant RFCs--most applications give a certain leeway
to "almost correct" resource identifiers. The utility below
tries to strike approximately the same balance of other
well-implemented and practical applications: get -almost-
everything that was intended to look like a resource, and
-almost- nothing that was intended not to:
#---------- find_urls.py ----------#
# Functions to identify and extract URLs and email addresses
import re, fileinput
pat_url = re.compile( r'''
(?x)( # verbose identify URLs within text
(http|ftp|gopher) # make sure we find a resource type
:// # ...needs to be followed by colon-slash-slash
(\w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
(/?| # could be just the domain name (maybe w/ slash)
[^ \n\r"]+ # or stuff then space, newline, tab, quote
[\w/]) # resource name ends in alphanumeric or slash
(?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
) # end of match group
''')
pat_email = re.compile(r'''
(?xm) # verbose identify URLs in text (and multiline)
(?=^.{11} # Mail header matcher
(?<!Message-ID:| # rule out Message-ID's as best possible
In-Reply-To)) # ...and also In-Reply-To
(.*?)( # must grab to email to allow prior lookbehind
([A-Za-z0-9-]+\.)? # maybe an initial part: DAVID.mertz@gnosis.cx
[A-Za-z0-9-]+ # definitely some local user: MERTZ@gnosis.cx
@ # ...needs an at sign in the middle
(\w+\.?){2,} # at l
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -