📄 chap3.txt
字号:
CHAPTER III -- REGULAR EXPRESSIONS
-------------------------------------------------------------------
Regular expressions allow extremely valuable text processing
techniques, but ones that warrant careful explanation. Python's
[re] module, in particular, allows numerous enhancements to basic
regular expressions (such as named backreferences, lookahead
assertions, backreference skipping, non-greedy quantifiers, and
others). A solid introduction to the subtleties of regular
expressions is valuable to programmers engaged in text processing
tasks.
The prequel of this chapter contains a tutorial on regular
expressions that allows a reader unfamiliar with regular
expressions to move quickly from simple to complex elements of
regular expression syntax. This tutorial is aimed primarily at
beginners, but programmers familiar with regular expressions in
other programming tools can benefit from a quick read of the
tutorial, which explicates the particular regular expression
dialect in Python.
It is important to note up-front that regular expressions,
while very powerful, also have limitations. In brief, regular
expressions cannot match patterns that nest to arbitrary
depths. If that statement does not make sense, read Chapter 4,
which discusses parsers--to a large extent, parsing exists to
address the limitations of regular expressions. In general, if
you have doubts about whether a regular expression is
sufficient for your task, try to understand the examples in
Chapter 4, particularly the discussion of how you might spell a
floating point number.
Section 3.1 examines a number of text processing problems that
are solved most naturally using regular expression. As in
other chapters, the solutions presented to problems can
generally be adopted directly as little utilities for performing
tasks. However, as elsewhere, the larger goal in presenting
problems and solutions is to address a style of thinking about
a wider class of problems than those whose solutions are
presented directly in this book. Readers who are interested
in a range of ready utilities and modules will probably want to
check additional resources on the Web, such as the Vaults of
Parnassus <http://www.vex.net/parnassus/> and the Python
Cookbook <http://aspn.activestate.com/ASPN/Python/Cookbook/>.
Section 3.2 is a "reference with commentary" on the Python
standard library modules for doing regular expression tasks.
Several utility modules and backward-compatibility regular
expression engines are available, but for most readers, the only
important module will be [re] itself. The discussions
interspersed with each module try to give some guidance on why
you would want to use a given module or function, and the
reference documentation tries to contain more examples of actual
typical usage than does a plain reference. In many cases, the
examples and discussion of individual functions address common
and productive design patterns in Python. The cross-references
are intended to contextualize a given function (or other thing)
in terms of related ones (and to help a reader decide which is
right for her). The actual listing of functions, constants,
classes, and the like are in alphabetical order within each
category.
SECTION 0 -- A Regular Expression Tutorial
------------------------------------------------------------------------
Some people, when confronted with a problem, think "I know,
I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski, '<alt.religion.emacs>' (08/12/1997)
TOPIC -- Just What is a Regular Expression, Anyway?
--------------------------------------------------------------------
Many readers will have some background with regular
expressions, but some will not have any. Those with
experience using regular expressions in other languages (or in
Python) can probably skip this tutorial section. But readers
new to regular expressions (affectionately called 'regexes' by
users) should read this section; even some with experience can
benefit from a refresher.
A regular expression is a compact way of describing complex
patterns in texts. You can use them to search for patterns
and, once found, to modify the patterns in complex ways. They
can also be used to launch programmatic actions that depend on
patterns.
Jamie Zawinski's tongue-in-cheek comment in the epigram is
worth thinking about. Regular expressions are amazingly
powerful and deeply expressive. That is the very reason that
writing them is just as error-prone as writing any other
complex programming code. It is always better to solve a
genuinely simple problem in a simple way; when you go beyond
simple, think about regular expressions.
A large number of tools other than Python incorporate regular
expressions as part of their functionality. Unix-oriented
command-line tools like 'grep', 'sed', and 'awk' are mostly
wrappers for regular expression processing. Many text editors
allow search and/or replacement based on regular expressions.
Many programming languages, especially other scripting languages
such as Perl and TCL, build regular expressions into the heart of
the language. Even most command-line shells, such as Bash or the
Windows-console, allow restricted regular expressions as part of
their command syntax.
There are some variations in regular expression syntax between
different tools that use them, but for the most part regular
expressions are a "little language" that gets embedded inside
bigger languages like Python. The examples in this tutorial
section (and the documentation in the rest of the chapter) will
focus on Python syntax, but most of this chapter transfers
easily to working with other programming languages and tools.
As with most of this book, examples will be illustrated by use of
Python interactive shell sessions that readers can type
themselves, so that they can play with variations on the
examples. However, the [re] module has little reason to include a
function that simply illustrates matches in the shell. Therefore,
the availability of the small wrapper program below is implied in
the examples:
#---------- re_show.py ----------#
import re
def re_show(pat, s):
print re.compile(pat, re.M).sub("{\g<0>}", s.rstrip()),'\n'
s = '''Mary had a little lamb
And everywhere that Mary
went, the lamb was sure
to go'''
Place the code in an external module and 'import' it. Those
new to regular expressions need not worry about what the above
function does for now. It is enough to know that the first
argument to 're_show()' will be a regular expression pattern,
and the second argument will be a string to be matched against.
The matches will treat each line of the string as a separate
pattern for purposes of matching beginnings and ends of lines.
The illustrated matches will be whatever is contained between
curly braces (and is typographically marked for emphasis).
TOPIC -- Matching Patterns in Text: The Basics
--------------------------------------------------------------------
The very simplest pattern matched by a regular expression is a
literal character or a sequence of literal characters. Anything
in the target text that consists of exactly those characters in
exactly the order listed will match. A lowercase character is not
identical with its uppercase version, and vice versa. A space in
a regular expression, by the way, matches a literal space in the
target (this is unlike most programming languages or command-line
tools, where a variable number of spaces separate keywords).
>>> from re_show import re_show, s
>>> re_show('a', s)
M{a}ry h{a}d {a} little l{a}mb.
And everywhere th{a}t M{a}ry
went, the l{a}mb w{a}s sure
to go.
>>> re_show('Mary', s)
{Mary} had a little lamb.
And everywhere that {Mary}
went, the lamb was sure
to go.
-*-
A number of characters have special meanings to regular
expressions. A symbol with a special meaning can be matched,
but to do so it must be prefixed with the backslash character
(this includes the backslash character itself: to match one
backslash in the target, the regular expression should include
'\\'). In Python, a special way of quoting a string is
available that will not perform string interpolation. Since
regular expressions use many of the same backslash-prefixed
codes as do Python strings, it is usually easier to compose
regular expression strings by quoting them as "raw strings"
with an initial "r".
>>> from re_show import re_show
>>> s = '''Special characters must be escaped.*'''
>>> re_show(r'.*', s)
{Special characters must be escaped.*}
>>> re_show(r'\.\*', s)
Special characters must be escaped{.*}
>>> re_show('\\\\', r'Python \ escaped \ pattern')
Python {\} escaped {\} pattern
>>> re_show(r'\\', r'Regex \ escaped \ pattern')
Regex {\} escaped {\} pattern
-*-
Two special characters are used to mark the beginning and end
of a line: caret ("^") and dollarsign ("$"). To match a caret
or dollarsign as a literal character, it must be escaped (i.e.,
precede it by a backslash "\").
An interesting thing about the caret and dollarsign is that
they match zero-width patterns. That is, the length of the
string matched by a caret or dollarsign by itself is zero (but
the rest of the regular expression can still depend on the
zero-width match). Many regular expression tools provide
another zero-width pattern for word-boundary ("\b"). Words
might be divided by whitespace like spaces, tabs, newlines, or
other characters like nulls; the word-boundary pattern matches
the actual point where a word starts or ends, not the
particular whitespace characters.
>>> from re_show import re_show, s
>>> re_show(r'^Mary', s)
{Mary} had a little lamb
And everywhere that Mary
went, the lamb was sure
to go
>>> re_show(r'Mary$', s)
Mary had a little lamb
And everywhere that {Mary}
went, the lamb was sure
to go
>>> re_show(r'$','Mary had a little lamb')
Mary had a little lamb{}
-*-
In regular expressions, a period can stand for any character.
Normally, the newline character is not included, but optional
switches can force inclusion of the newline character also (see
later documentation of [re] module functions). Using a period
in a pattern is a way of requiring that "something" occurs
here, without having to decide what.
Readers who are familiar with DOS command-line wildcards will
know the question mark as filling the role of "some character"
in command masks. But in regular expressions, the
question mark has a different meaning, and the period is used
as a wildcard.
>>> from re_show import re_show, s
>>> re_show(r'.a', s)
{Ma}ry {ha}d{ a} little {la}mb
And everywhere t{ha}t {Ma}ry
went, the {la}mb {wa}s sure
to go
-*-
A regular expression can have literal characters in it and also
zero-width positional patterns. Each literal character or positional
pattern is an atom in a regular expression. One may also group
several atoms together into a small regular expression that is
part of a larger regular expression. One might be inclined to
call such a grouping a "molecule," but normally it is also
called an atom.
In older Unix-oriented tools like grep, subexpressions must be
grouped with escaped parentheses, for example, '\(Mary\)'. In
Python (as with most more recent tools), grouping is done with
bare parentheses, but matching a literal parenthesis requires
escaping it in the pattern.
>>> from re_show import re_show, s
>>> re_show(r'(Mary)( )(had)', s)
{Mary had} a little lamb
And everywhere that Mary
went, the lamb was sure
to go
>>> re_show(r'\(.*\)', 'spam (and eggs)')
spam {(and eggs)}
-*-
Rather than name only a single character, a pattern in a
regular expression can match any of a set of characters.
A set of characters can be given as a simple list inside square
brackets, for example, '[aeiou]' will match any single lowercase
vowel. For letter or number ranges it may also have the first and
last letter of a range, with a dash in the middle; for example,
'[A-Ma-m]' will match any lowercase or uppercase letter in the
first half of the alphabet.
Python (as with many tools) provides escape-style shortcuts to
the most commonly used character class, such as '\s' for a
whitespace character and '\d' for a digit. One could always
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -