📄 chap3.txt
字号:
u'A:b:C:d:X:y:Z'
>>> ':'.join(re.findall(ur'(?u)\b\w\b', u))
u'\u05d0:A:b:C:d:\u03c9:X:y:Z'
-*-
Backreferencing in replacement patterns is very powerful, but
it is easy to use many groups in a complex regular expression,
which can be confusing to identify. It is often more legible
to refer to the parts of a replacement pattern in sequential
order. To handle this issue, Python's [re] patterns allow
"grouping without backreferencing."
A group that should not also be treated as a backreference has
a question mark colon at the beginning of the group, as in
'(?:pattern)'. In fact, you can use this syntax even when your
backreferences are in the search pattern itself:
>>> from re_new import re_new
>>> s = 'A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93'
>>> re_new(r'([A-Z])(?:-[a-z]{3}-)([0-9]*)', r'\1\2', s)
{A37} # B:abcd:142 # {C66} # {D93}
>>> # Groups that are not of interest excluded from backref
...
>>> re_new(r'([A-Z])(-[a-z]{3}-)([0-9]*)', r'\1\2', s)
{A-xyz-} # B:abcd:142 # {C-wxy-} # {D-qrs-}
>>> # One could lose track of groups in a complex pattern
...
-*-
Python offers a particularly handy syntax for really complex
pattern backreferences. Rather than just play with the
numbering of matched groups, you can give them a name. Above
we pointed out the syntax for named backreferences in the
pattern space; for example, '(?P=name)'. However, a bit different
syntax is necessary in replacement patterns. For that, we use
the '\g' operator along with angle brackets and a name. For
example:
>>> from re_new import re_new
>>> s = "A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93"
>>> re_new(r'(?P<prefix>[A-Z])(-[a-z]{3}-)(?P<id>[0-9]*)',
... r'\g<prefix>\g<id>',s)
{A37} # B:abcd:142 # {C66} # {D93}
-*-
Another trick of advanced regular expression tools is
"lookahead assertions." These are similar to regular grouped
subexpression, except they do not actually grab what they
match. There are two advantages to using lookahead assertions.
On the one hand, a lookahead assertion can function in a
similar way to a group that is not backreferenced; that is, you
can match something without counting it in backreferences.
More significantly, however, a lookahead assertion can specify
that the next chunk of a pattern has a certain form, but let a
different (more general) subexpression actually grab it
(usually for purposes of backreferencing that other
subexpression).
There are two kinds of lookahead assertions: positive and
negative. As you would expect, a positive assertion specifies
that something does come next, and a negative one specifies
that something does not come next. Emphasizing their
connection with non-backreferenced groups, the syntax for
lookahead assertions is similar: '(?=pattern)' for positive
assertions, and '(?!pattern)' for negative assertions.
>>> from re_new import re_new
>>> s = 'A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93'
>>> # Assert that three lowercase letters occur after CAP-DASH
...
>>> re_new(r'([A-Z]-)(?=[a-z]{3})([\w\d]*)', r'\2\1', s)
{xyz37A-} # B-ab6142 # C-Wxy66 # {qrs93D-}
>>> # Assert three lowercase letts do NOT occur after CAP-DASH
...
>>> re_new(r'([A-Z]-)(?![a-z]{3})([\w\d]*)', r'\2\1', s)
A-xyz37 # {ab6142B-} # {Wxy66C-} # D-qrs93
-*-
Along with lookahead assertions, Python 2.0+ adds "lookbehind
assertions." The idea is similar--a pattern is of interest
only if it is (or is not) preceded by some other pattern.
Lookbehind assertions are somewhat more restricted than
lookahead assertions because they may only look backwards by a
fixed number of character positions. In other words, no
general quantifiers are allowed in lookbehind assertions.
Still, some patterns are most easily expressed using lookbehind
assertions.
As with lookahead assertions, lookbehind assertions come in a
negative and a positive flavor. The former assures that a certain
pattern does -not- precede the match, the latter assures that
the pattern -does- precede the match.
>>> from re_new import re_new
>>> re_show('Man', 'Manhandled by The Man')
{Man}handled by The {Man}
>>> re_show('(?<=The )Man', 'Manhandled by The Man')
Manhandled by The {Man}
>>> re_show('(?<!The )Man', 'Manhandled by The Man')
{Man}handled by The Man
-*-
In the later examples we have started to see just how
complicated regular expressions can get. These examples are
not the half of it. It is possible to do some almost absurdly
difficult-to-understand things with regular expression (but
ones that are nonetheless useful).
There are two basic facilities that Python's "verbose" modifier
("x") uses in clarifying expressions. One is allowing regular
expressions to continue over multiple lines (by ignoring
whitespace like trailing spaces and newlines). The second is
allowing comments within regular expressions. When patterns
get complicated, do both!
The example given is a fairly typical example of a complicated,
but well-structured and well-commented, regular expression:
>>> from re_show import re_show
>>> s = '''The URL for my site is: http://mysite.com/mydoc.html. You
... might also enjoy ftp://yoursite.com/index.html for a good
... place to download files.'''
>>> pat = r''' (?x)( # verbose identify URLs within text
... (http|ftp|gopher) # make sure we find a resource type
... :// # ...needs to be followed by colon-slash-slash
... [^ \n\r]+ # some stuff then space, newline, tab is URL
... \w # URL always ends in alphanumeric char
... (?=[\s\.,]) # assert: followed by whitespace/period/comma
... ) # end of match group'''
>>> re_show(pat, s)
The URL for my site is: {http://mysite.com/mydoc.html}. You
might also enjoy {ftp://yoursite.com/index.html} for a good
place to download files.
SECTION 1 -- Some Common Tasks
------------------------------------------------------------------------
PROBLEM: Making a text block flush left
--------------------------------------------------------------------
For visual clarity or to identify the role of text, blocks of
text are often indented--especially in prose-oriented documents
(but log files, configuration files, and the like might also
have unused initial fields). For downstream purposes,
indentation is often irrelevant, or even outright
incorrect, since the indentation is not part of the text itself
but only a decoration of the text. However, it often makes
matters even worse to perform the very most naive
transformation of indented text--simply remove leading
whitespace from every line. While block indentation may be
decoration, the relative indentations of lines within blocks
may serve important or essential functions (for example, the
blocks of text might be Python source code).
The general procedure you need to take in maximally unindenting
a block of text is fairly simple. But it is easy to throw more
code at it than is needed, and arrive at some inelegant and
slow nested loops of `string.find()` and `string.replace()`
operations. A bit of cleverness in the use of regular
expressions--combined with the conciseness of a functional
programming (FP) style--can give you a quick, short, and direct
transformation.
#---------- flush_left.py ----------#
# Remove as many leading spaces as possible from whole block
from re import findall,sub
# What is the minimum line indentation of a block?
indent = lambda s: reduce(min,map(len,findall('(?m)^ *(?=\S)',s)))
# Remove the block-minimum indentation from each line?
flush_left = lambda s: sub('(?m)^ {%d}' % indent(s),'',s)
if __name__ == '__main__':
import sys
print flush_left(sys.stdin.read())
The 'flush_left()' function assumes that blocks are indented
with spaces. If tabs are used--or used combined with
spaces--an initial pass through the utility 'untabify.py' (which
can be found at '$PYTHONPATH/tools/scripts/') can convert
blocks to space-only indentation.
A helpful adjunct to 'flush_left()' is likely to be the
'reformat_para()' function that was presented in Chapter 2,
Problem 2. Between the two of these, you could get a good part of
the way towards a "batch-oriented word processor." (What other
capabilities would be most useful?)
PROBLEM: Summarizing command-line option documentation
--------------------------------------------------------------------
Documentation of command-line options to programs is usually
in semi-standard formats in places like manpages, docstrings,
READMEs and the like. In general, within documentation you
expect to see command-line options indented a bit, followed by
a bit more indentation, followed by one or more lines of
description, and usually ended by a blank line. This style is
readable for users browsing documentation, but is of
sufficiently complexity and variability that regular
expressions are well suited to finding the right descriptions
(simple string methods fall short).
A specific scenario where you might want a summary of
command-line options is as an aid to understanding
configuration files that call multiple child commands. The
file '/etc/inetd.conf' on Unix-like systems is a good example
of such a configuration file. Moreover, configuration files
themselves often have enough complexity and variability within
them that simple string methods have difficulty parsing them.
The utility below will look for every service launched by
'/etc/inetd.conf' and present to STDOUT summary documentation
of all the options used when the services are started.
#---------- show_services.py ----------#
import re, os, string, sys
def show_opts(cmdline):
args = string.split(cmdline)
cmd = args[0]
if len(args) > 1:
opts = args[1:]
# might want to check error output, so use popen3()
(in_, out_, err) = os.popen3('man %s | col -b' % cmd)
manpage = out_.read()
if len(manpage) > 2: # found actual documentation
print '\n%s' % cmd
for opt in opts:
pat_opt = r'(?sm)^\s*'+opt+r'.*?(?=\n\n)'
opt_doc = re.search(pat_opt, manpage)
if opt_doc is not None:
print opt_doc.group()
else: # try harder for something relevant
mentions = []
for para in string.split(manpage,'\n\n'):
if re.search(opt, para):
mentions.append('\n%s' % para)
if not mentions:
print '\n ',opt,' '*9,'Option docs not found'
else:
print '\n ',opt,' '*9,'Mentioned in below para:'
print '\n'.join(mentions)
else: # no manpage available
print cmdline
print ' No documentation available'
def services(fname):
conf = open(fname).read()
pat_srv = r'''(?xm)(?=^[^#]) # lns that are not commented out
(?:(?:[\w/]+\s+){6}) # first six fields ignored
(.*$) # to end of ln is servc launch'''
return re.findall(pat_srv, conf)
if __name__ == '__main__':
for service in services(sys.argv[1]):
show_opts(service)
The particular tasks performed by 'show_opts()' and 'services()'
are somewhat specific to Unix-like systems, but the general
techniques are more broadly applicable. For example, the
particular comment character and number of fields in
'/etc/inetd.conf' might be different for other launch scripts,
but the use of regular expressions to find the launch commands
would apply elsewhere. If the 'man' and 'col' utilities are not
on the relevant system, you might do something equivalent, such
as reading in the docstrings from Python modules with similar
option descriptions (most of the samples in '$PYTHONPATH/tools/'
use compatible documentation, for example).
Another thing worth noting is that even where regular expressions
are used in parsing some data, you need not do everything with
regular expressions. The simple `string.split()` operation to
identify paragraphs in 'show_opts()' is still the quickest and
easiest technique, even though `re.split()` could do the same
thing.
Note: Along the lines of paragraph splitting, here is a thought
problem. What is a regular expression that matches every whole
paragraph that contains within it some smaller pattern 'pat'? For
purposes of the puzzle, assume that a paragraph is some text that
both starts and ends with doubled newlines ("\n\n").
PROBLEM: Detecting duplicate words
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -