📄 chap3.txt
字号:
which is a tool for searching only. Text editors, for example,
may or may not allow replacement in their regular expression
search facility.
Python, being a general programming language, allows
sophisticated replacement patterns to accompany matches. Since
Python strings are immutable, [re] functions do not modify string
objects in place, but instead return the modified versions. But
as with functions in the [string] module, one can always rebind a
particular variable to the new string object that results from
[re] modification.
Replacement examples in this tutorial will call a function
're_new()' that is a wrapper for the module function `re.sub()`.
Original strings will be defined above the call, and the modified
results will appear below the call and with the same style of
additional markup of changed areas as 're_show()' used. Be
careful to notice that the curly braces in the results displayed
will not be returned by standard [re] functions, but are only
added here for emphasis (as is the typography). Simply import the
following function in the examples below:
#---------- re_new.py ----------#
import re
def re_new(pat, rep, s):
print re.sub(pat, '{'+rep+'}', s)
-*-
Let us take a look at a couple of modification examples that
build on what we have already covered. This one simply
substitutes some literal text for some other literal text. Notice
that `string.replace()` can achieve the same result and will be
faster in doing so.
>>> from re_new import re_new
>>> s = 'The zoo had wild dogs, bobcats, lions, and other wild cats.'
>>> re_new('cat','dog',s)
The zoo had wild dogs, bob{dog}s, lions, and other wild {dog}s.
-*-
Most of the time, if you are using regular expressions to modify a
target text, you will want to match more general patterns than just
literal strings. Whatever is matched is what gets replaced (even if it
is several different strings in the target):
>>> from re_new import re_new
>>> s = 'The zoo had wild dogs, bobcats, lions, and other wild cats.'
>>> re_new('cat|dog','snake',s)
The zoo had wild {snake}s, bob{snake}s, lions, and other wild {snake}s.
>>> re_new(r'[a-z]+i[a-z]*','nice',s)
The zoo had {nice} dogs, bobcats, {nice}, and other {nice} cats.
-*-
It is nice to be able to insert a fixed string everywhere a
pattern occurs in a target text. But frankly, doing that is
not very context sensitive. A lot of times, we do not want
just to insert fixed strings, but rather to insert something
that bears much more relation to the matched patterns.
Fortunately, backreferences come to our rescue here. One can
use backreferences in the pattern matches themselves, but it is
even more useful to be able to use them in replacement
patterns. By using replacement backreferences, one can pick
and choose from the matched patterns to use just the parts of
interest.
As well as backreferencing, the examples below illustrate the
importance of whitespace in regular expressions. In most
programming code, whitespace is merely aesthetic. But the
examples differ solely in an extra space within the arguments
to the second call--and the return value is importantly
different.
>>> from re_new import re_new
>>> s = 'A37 B4 C107 D54112 E1103 XXX'
>>> re_new(r'([A-Z])([0-9]{2,4})',r'\2:\1',s)
{37:A} B4 {107:C} {5411:D}2 {1103:E} XXX
>>> re_new(r'([A-Z])([0-9]{2,4}) ',r'\2:\1 ',s)
{37:A }B4 {107:C }D54112 {1103:E }XXX
-*-
This tutorial has already warned about the danger of matching
too much with regular expression patterns. But the danger is
so much more serious when one does modifications, that it is
worth repeating. If you replace a pattern that matches a
larger string than you thought of when you composed the
pattern, you have potentially deleted some important data from
your target.
It is always a good idea to try out regular expressions on
diverse target data that is representative of production usage.
Make sure you are matching what you think you are matching. A
stray quantifier or wildcard can make a surprisingly wide
variety of texts match what you thought was a specific pattern.
And sometimes you just have to stare at your pattern for a
while, or find another set of eyes, to figure out what is
really going on even after you see what matches. Familiarity
might breed contempt, but it also instills competence.
TOPIC -- Advanced Regular Expression Extensions
--------------------------------------------------------------------
Some very useful enhancements to basic regular expressions are
included with Python (and with many other tools). Many of
these do not strictly increase the power of Python's regular
expressions, but they -do- manage to make expressing them far
more concise and clear.
Earlier in the tutorial, the problems of matching too much were
discussed, and some workarounds were suggested. Python is nice
enough to make this easier by providing optional "non-greedy"
quantifiers. These quantifiers grab as little as possible
while still matching whatever comes next in the pattern
(instead of as much as possible).
Non-greedy quantifiers have the same syntax as regular greedy
ones, except with the quantifier followed by a question mark.
For example, a non-greedy pattern might look like:
'A[A-Z]*?B'. In English, this means "match an A, followed by
only as many capital letters as are needed to find a B."
One little thing to look out for is the fact that the pattern
'[A-Z]*?.' will always match zero capital letters. No longer
matches are ever needed to find the following "any character"
pattern. If you use non-greedy quantifiers, watch out for
matching too little, which is a symmetric danger.
>>> from re_show import re_show
>>> s = '''-- I want to match the words that start
... -- with 'th' and end with 's'.
... this line matches just right
... this # thus # thistle'''
>>> re_show(r'th.*s',s)
-- I want to match {the words that s}tart
-- wi{th 'th' and end with 's}'.
{this line matches jus}t right
{this # thus # this}tle
>>> re_show(r'th.*?s',s)
-- I want to match {the words} {that s}tart
-- wi{th 'th' and end with 's}'.
{this} line matches just right
{this} # {thus} # {this}tle
>>> re_show(r'th.*?s ',s)
-- I want to match {the words }that start
-- with 'th' and end with 's'.
{this }line matches just right
{this }# {thus }# thistle
-*-
Modifiers can be used in regular expressions or as arguments to
many of the functions in [re]. A modifier affects, in one way
or another, the interpretation of a regular expression pattern.
A modifier, unlike an atom, is global to the particular
match--in itself, a modifier doesn't match anything, it instead
constrains or directs what the atoms match.
When used directly within a regular expression pattern, one or
more modifiers begin the whole pattern, as in '(?Limsux)'. For
example, to match the word 'cat' without regard to the case of
the letters, one could use '(?i)cat'. The same modifiers may
be passed in as the last argument as bitmasks (i.e., with a '|'
between each modifier), but only to some functions in the [re]
module, not to all. For example, the two calls below are
equivalent:
>>> import re
>>> re.search(r'(?Li)cat','The Cat in the Hat').start()
4
>>> re.search(r'cat','The Cat in the Hat',re.L|re.I).start()
4
However, some function calls in [re] have no argument for
modifiers. In such cases, you should either use the modifier
prefix pseudo-group or pre-compile the regular expression
rather than use it in string form. For example:
>>> import re
>>> re.split(r'(?i)th','Brillig and The Slithy Toves')
['Brillig and ', 'e Sli', 'y Toves']
>>> re.split(re.compile('th',re.I),'Brillig and the Slithy Toves')
['Brillig and ', 'e Sli', 'y Toves']
See the [re] module documentation for details on which
functions take which arguments.
-*-
The listed modifiers below are used in [re] expressions. Users
of other regular expression tools may be accustomed to a 'g'
option for "global" matching. These other tools take a line of
text as their default unit, and "global" means to match
multiple lines. Python takes the actual passed string as its
unit, so "global" is simply the default. To operate on a
single line, either the regular expressions have to be tailored
to look for appropriate begin-line and end-line characters, or
the strings being operated on should be split first using
`string.split()` or other means.
#*--------- Regular expression modifiers ---------------#
* L (re.L) - Locale customization of \w, \W, \b, \B
* i (re.I) - Case-insensitive match
* m (re.M) - Treat string as multiple lines
* s (re.S) - Treat string as single line
* u (re.U) - Unicode customization of \w, \W, \b, \B
* x (re.X) - Enable verbose regular expressions
The single-line option ("s") allows the wildcard to match a
newline character (it won't otherwise). The multiple-line
option ("m") causes "^" and "$" to match the beginning and end
of each line in the target, not just the begin/end of the
target as a whole (the default). The insensitive option ("i")
ignores differences between the case of letters. The Locale
and Unicode options ("L" and "u") give different
interpretations to the word-boundary ("\b") and alphanumeric
("\w") escaped patterns--and their inverse forms ("\B" and
"\W").
The verbose option ("x") is somewhat different from the others.
Verbose regular expressions may contain nonsignificant
whitespace and inline comments. In a sense, this is also just
a different interpretation of regular expression patterns, but
it allows you to produce far more easily readable complex
patterns. Some examples follow in the sections below.
-*-
Let's take a look first at how case-insensitive and single-line
options change the match behavior.
>>> from re_show import re_show
>>> s = '''MAINE # Massachusetts # Colorado #
... mississippi # Missouri # Minnesota #'''
>>> re_show(r'M.*[ise] ', s)
{MAINE # Massachusetts }# Colorado #
mississippi # {Missouri }# Minnesota #
>>> re_show(r'(?i)M.*[ise] ', s)
{MAINE # Massachusetts }# Colorado #
{mississippi # Missouri }# Minnesota #
>>> re_show(r'(?si)M.*[ise] ', s)
{MAINE # Massachusetts # Colorado #
mississippi # Missouri }# Minnesota #
Looking back to the definition of 're_show()', we can see it
was defined to explicitly use the multiline option. So
patterns displayed with 're_show()' will always be multiline.
Let us look at a couple of examples that use `re.findall()`
instead.
>>> from re_show import re_show
>>> s = '''MAINE # Massachusetts # Colorado #
... mississippi # Missouri # Minnesota #'''
>>> re_show(r'(?im)^M.*[ise] ', s)
{MAINE # Massachusetts }# Colorado #
{mississippi # Missouri }# Minnesota #
>>> import re
>>> re.findall(r'(?i)^M.*[ise] ', s)
['MAINE # Massachusetts ']
>>> re.findall(r'(?im)^M.*[ise] ', s)
['MAINE # Massachusetts ', 'mississippi # Missouri ']
-*-
Matching word characters and word boundaries depends on exactly
what gets counted as being alphanumeric. Character codepages
for letters outside the (US-English) ASCII range differ among
national alphabets. Python versions are configured to a
particular locale, and regular expressions can optionally use
the current one to match words.
Of greater long-term significance is the [re] module's ability
(after Python 2.0) to look at the Unicode categories of
characters, and decide whether a character is alphabetic based on
that category. Locale settings work OK for European diacritics,
but for non-Roman sets, Unicode is clearer and less error prone.
The "u" modifier controls whether Unicode alphabetic characters
are recognized or merely ASCII ones:
>>> import re
>>> alef, omega = unichr(1488), unichr(969)
>>> u = alef +' A b C d '+omega+' X y Z'
>>> u, len(u.split()), len(u)
(u'\u05d0 A b C d \u03c9 X y Z', 9, 17)
>>> ':'.join(re.findall(ur'\b\w\b', u))
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -