📄 chap3.txt
字号:
define these character classes with square brackets, but the
shortcuts can make regular expressions more compact and more
readable.
>>> from re_show import re_show, s
>>> re_show(r'[a-z]a', s)
Mary {ha}d a little {la}mb
And everywhere t{ha}t Mary
went, the {la}mb {wa}s sure
to go
-*-
The caret symbol can actually have two different meanings in regular
expressions. Most of the time, it means to match the zero-length
pattern for line beginnings. But if it is used at the beginning of a
character class, it reverses the meaning of the character class.
Everything not included in the listed character set is matched.
>>> from re_show import re_show, s
>>> re_show(r'[^a-z]a', s)
{Ma}ry had{ a} little lamb
And everywhere that {Ma}ry
went, the lamb was sure
to go
-*-
Using character classes is a way of indicating that either one
thing or another thing can occur in a particular spot. But
what if you want to specify that either of two whole
subexpressions occur in a position in the regular expression?
For that, you use the alternation operator, the vertical bar
("|"). This is the symbol that is also used to indicate a pipe
in Unix/DOS shells and is sometimes called the pipe character.
The pipe character in a regular expression indicates an
alternation between everything in the group enclosing it. What
this means is that even if there are several groups to the left
and right of a pipe character, the alternation greedily asks
for everything on both sides. To select the scope of the
alternation, you must define a group that encompasses the
patterns that may match. The example illustrates this:
>>> from re_show import re_show
>>> s2 = 'The pet store sold cats, dogs, and birds.'
>>> re_show(r'cat|dog|bird', s2)
The pet store sold {cat}s, {dog}s, and {bird}s.
>>> s3 = '=first first= # =second second= # =first= # =second='
>>> re_show(r'=first|second=', s3)
{=first} first= # =second {second=} # {=first}= # ={second=}
>>> re_show(r'(=)(first)|(second)(=)', s3)
{=first} first= # =second {second=} # {=first}= # ={second=}
>>> re_show(r'=(first|second)=', s3)
=first first= # =second second= # {=first=} # {=second=}
-*-
One of the most powerful and common things you can do with
regular expressions is to specify how many times an atom occurs
in a complete regular expression. Sometimes you want to
specify something about the occurrence of a single character,
but very often you are interested in specifying the occurrence
of a character class or a grouped subexpression.
There is only one quantifier included with "basic" regular
expression syntax, the asterisk ("*"); in English this has the
meaning "some or none" or "zero or more." If you want to
specify that any number of an atom may occur as part of a
pattern, follow the atom by an asterisk.
Without quantifiers, grouping expressions doesn't really serve
as much purpose, but once we can add a quantifier to a
subexpression we can say something about the occurrence of the
subexpression as a whole. Take a look at the example:
>>> from re_show import re_show
>>> s = '''Match with zero in the middle: @@
... Subexpression occurs, but...: @=!=ABC@
... Lots of occurrences: @=!==!==!==!==!=@
... Must repeat entire pattern: @=!==!=!==!=@'''
>>> re_show(r'@(=!=)*@', s)
Match with zero in the middle: {@@}
Subexpression occurs, but...: @=!=ABC@
Lots of occurrences: {@=!==!==!==!==!=@}
Must repeat entire pattern: @=!==!=!==!=@
TOPIC -- Matching Patterns in Text: Intermediate
--------------------------------------------------------------------
In a certain way, the lack of any quantifier symbol after an atom
quantifies the atom anyway: It says the atom occurs exactly once.
Extended regular expressions add a few other useful numbers to
"once exactly" and "zero or more times." The plus sign ("+")
means "one or more times" and the question mark ("?") means
"zero or one times." These quantifiers are by far the most
common enumerations you wind up using.
If you think about it, you can see that the extended regular
expressions do not actually let you "say" anything the basic
ones do not. They just let you say it in a shorter and more
readable way. For example, '(ABC)+' is equivalent to
'(ABC)(ABC)*', and 'X(ABC)?Y' is equivalent to 'XABCY|XY'. If
the atoms being quantified are themselves complicated grouped
subexpressions, the question mark and plus sign can make things
a lot shorter.
>>> from re_show import re_show
>>> s = '''AAAD
... ABBBBCD
... BBBCD
... ABCCD
... AAABBBC'''
>>> re_show(r'A+B*C?D', s)
{AAAD}
{ABBBBCD}
BBBCD
ABCCD
AAABBBC
-*-
Using extended regular expressions, you can specify arbitrary
pattern occurrence counts using a more verbose syntax than the
question mark, plus sign, and asterisk quantifiers. The curly
braces ("{" and "}") can surround a precise count of how many
occurrences you are looking for.
The most general form of the curly-brace quantification uses two
range arguments (the first must be no larger than the second, and
both must be non-negative integers). The occurrence count is
specified this way to fall between the minimum and maximum
indicated (inclusive). As shorthand, either argument may be left
empty: If so, the minimum/maximum is specified as zero/infinity,
respectively. If only one argument is used (with no comma in
there), exactly that number of occurrences are matched.
>>> from re_show import re_show
>>> s2 = '''aaaaa bbbbb ccccc
... aaa bbb ccc
... aaaaa bbbbbbbbbbbbbb ccccc'''
>>> re_show(r'a{5} b{,6} c{4,8}', s2)
{aaaaa bbbbb ccccc}
aaa bbb ccc
aaaaa bbbbbbbbbbbbbb ccccc
>>> re_show(r'a+ b{3,} c?', s2)
{aaaaa bbbbb c}cccc
{aaa bbb c}cc
{aaaaa bbbbbbbbbbbbbb c}cccc
>>> re_show(r'a{5} b{6,} c{4,8}', s2)
aaaaa bbbbb ccccc
aaa bbb ccc
{aaaaa bbbbbbbbbbbbbb ccccc}
-*-
One powerful option in creating search patterns is specifying
that a subexpression that was matched earlier in a regular
expression is matched again later in the expression. We do
this using backreferences. Backreferences are named by the
numbers 1 through 99, preceded by the backslash/escape
character when used in this manner. These backreferences refer
to each successive group in the match pattern, as in
'(one)(two)(three) \1\2\3'. Each numbered backreference refers
to the group that, in this example, has the word corresponding
to the number.
It is important to note something the example illustrates. What
gets matched by a backreference is the same literal string
matched the first time, even if the pattern that matched the
string could have matched other strings. Simply repeating the
same grouped subexpression later in the regular expression does
not match the same targets as using a backreference (but you have
to decide what it is you actually want to match in either case).
Backreferences refer back to whatever occurred in the previous
grouped expressions, in the order those grouped expressions
occurred. Up to 99 numbered backreferences may be used. However,
Python also allows naming backreferences, which can make it much
clearer what the backreferences are pointing to. The initial
pattern group must begin with '?P<name>', and the corresponding
backreference must contain '(?P=name)'.
>>> from re_show import re_show
>>> s2 = '''jkl abc xyz
... jkl xyz abc
... jkl abc abc
... jkl xyz xyz
... '''
>>> re_show(r'(abc|xyz) \1', s2)
jkl abc xyz
jkl xyz abc
jkl {abc abc}
jkl {xyz xyz}
>>> re_show(r'(abc|xyz) (abc|xyz)', s2)
jkl {abc xyz}
jkl {xyz abc}
jkl {abc abc}
jkl {xyz xyz}
>>> re_show(r'(?P<let3>abc|xyz) (?P=let3)', s2)
jkl abc xyz
jkl xyz abc
jkl {abc abc}
jkl {xyz xyz}
-*-
Quantifiers in regular expressions are greedy. That is, they
match as much as they possibly can.
Probably the easiest mistake to make in composing regular
expressions is to match too much. When you use a quantifier,
you want it to match everything (of the right sort) up to the
point where you want to finish your match. But when using the
'*', '+', or numeric quantifiers, it is easy to forget that the
last bit you are looking for might occur later in a line than
the one you are interested in.
>>> from re_show import re_show
>>> s2 = '''-- I want to match the words that start
... -- with 'th' and end with 's'.
... this
... thus
... thistle
... this line matches too much
... '''
>>> re_show(r'th.*s', s2)
-- I want to match {the words that s}tart
-- wi{th 'th' and end with 's}'.
{this}
{thus}
{this}tle
{this line matches} too much
-*-
Often if you find that regular expressions are matching too much,
a useful procedure is to reformulate the problem in your mind.
Rather than thinking about, "What am I trying to match later in
the expression?" ask yourself, "What do I need to avoid matching
in the next part?" This often leads to more parsimonious pattern
matches. Often the way to avoid a pattern is to use the
complement operator and a character class. Look at the example,
and think about how it works.
The trick here is that there are two different ways of
formulating almost the same sequence. Either you can think you
want to keep matching -until- you get to XYZ, or you can think you
want to keep matching -unless- you get to XYZ. These are subtly
different.
For people who have thought about basic probability, the same
pattern occurs. The chance of rolling a 6 on a die in one roll is
1/6. What is the chance of rolling a 6 in six rolls? A naive
calculation puts the odds at 1/6+1/6+1/6+1/6+1/6+1/6, or 100
percent. This is wrong, of course (after all, the chance after
twelve rolls isn't 200 percent). The correct calculation is, "How
do I avoid rolling a 6 for six rolls?" (i.e.,
5/6*5/6*5/6*5/6*5/6*5/6, or about 33 percent). The chance of
getting a 6 is the same chance as not avoiding it (or about 66
percent). In fact, if you imagine transcribing a series of die
rolls, you could apply a regular expression to the written
record, and similar thinking applies.
>>> from re_show import re_show
>>> s2 = '''-- I want to match the words that start
... -- with 'th' and end with 's'.
... this
... thus
... thistle
... this line matches too much
... '''
>>> re_show(r'th[^s]*.', s2)
-- I want to match {the words} {that s}tart
-- wi{th 'th' and end with 's}'.
{this}
{thus}
{this}tle
{this} line matches too much
-*-
Not all tools that use regular expressions allow you to modify
target strings. Some simply locate the matched pattern; the
mostly widely used regular expression tool is probably grep,
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -