📄 ch10.htm
字号:
<BLOCKQUOTE>
The <TT>\B</TT> meta-sequeNCe will
match everywhere except at a word boundary.
</BLOCKQUOTE>
<BLOCKQUOTE>
<B>Quoting Meta-Characters:</B> You can match meta-characters
literally by eNClosing them in a <TT>\Q..\E</TT>
sequeNCe. This will let you avoid using the backslash character
to escape all meta-characters, and your code will be easier to
read.
</BLOCKQUOTE>
<BLOCKQUOTE>
<B>Extended Syntax:</B> The (?...) sequeNCe lets you use an extended
version of the regular expression syntax. The different options
are discussed in the section, "Example: Extension Syntax,"
later in this chapter.
</BLOCKQUOTE>
<BLOCKQUOTE>
<B>Combinations:</B> Any of the preceding components can be combined
with any other to create simple or complex patterns.
</BLOCKQUOTE>
<P>
The power of patterns is that you don't always know in advaNCe
the value of the string that you will be searching. If you need
to match the first word in a string that was read in from a file,
you probably have no idea how long it might be; therefore, you
need to build a pattern. You might start with the <TT>\w</TT>
symbolic character class, which will match any single alphanumeric
or underscore character. So, assuming that the string is in the
<TT>$_</TT> variable, you can match
a one-character word like this:
<BLOCKQUOTE>
<PRE>
m/\w/;
</PRE>
</BLOCKQUOTE>
<P>
If you need to match both a one-character word and a two-character
word, you can do this:
<BLOCKQUOTE>
<PRE>
m/\w|\w\w/;
</PRE>
</BLOCKQUOTE>
<P>
This pattern says to match a single word character or two consecutive
word characters. You could continue to add alternation components
to match the different lengths of words that you might expect
to see, but there is a better way.
<P>
You can use the <TT>+</TT> quantifier
to say that the match should succeed only if the component is
matched one or more times. It is used this way:
<BLOCKQUOTE>
<PRE>
m/\w+/;
</PRE>
</BLOCKQUOTE>
<P>
If the value of <TT>$_</TT> was <TT>"AAA
BBB"</TT>, then <TT>m/\w+/;</TT>
would match the <TT>"AAA"</TT>
in the string. If <TT>$_</TT> was
blank, full of white space, or full of other non-word characters,
an undefined value would be returned.
<P>
The preceding pattern will let you determine if <TT>$_</TT>
contains a word but does not let you know what the word is. In
order to accomplish that, you need to eNClose the matching components
inside parentheses. For example:
<BLOCKQUOTE>
<PRE>
m/(\w+)/;
</PRE>
</BLOCKQUOTE>
<P>
By doing this, you force Perl to store the matched string into
the $1 variable. The $1 variable can be considered as pattern
memory.
<P>
This introduction to pattern components describes most of the
details you need to know in order to create your own patterns
or regular expressions. However, some of the components deserve
a bit more study. The next few sections look at character classes,
quantifiers, pattern memory, pattern precedeNCe, and the extension
syntax. Then the rest of the chapter is devoted to showing specific
examples of when to use the different components.
<H3><A NAME="ExampleCharacterClasses">
Example: Character Classes</A></H3>
<P>
A character class defines a type of character. The character class
<TT>[0123456789]</TT> defines the
class of decimal digits, and <TT>[0-9a-f]</TT>
defines the class of hexadecimal digits. Notice that you can use
a dash to define a range of consecutive characters. Character
classes let you match any of a range of characters; you don't
know in advaNCe which character will be matched. This capability
to match non-specific characters is what meta-characters are all
about.
<P>
You can use variable interpolation inside the character class,
but you must be careful when doing so. For example,
<BLOCKQUOTE>
<PRE>
$_ = "AAABBBccC";
$charList = "ADE";
print "matched" if m/[$charList]/;
</PRE>
</BLOCKQUOTE>
<P>
will display
<BLOCKQUOTE>
<PRE>
matched
</PRE>
</BLOCKQUOTE>
<P>
This is because the variable interpolation results in a character
class of <TT>[ADE]</TT>. If you use
the variable as one-half of a character range, you need to ensure
that you don't mix numbers and digits. For example,
<BLOCKQUOTE>
<PRE>
$_ = "AAABBBccC";
$charList = "ADE";
print "matched" if m/[$charList-9]/;
</PRE>
</BLOCKQUOTE>
<P>
will result in the following error message when executed:
<BLOCKQUOTE>
<PRE>
/[ADE-9]/: invalid [] range in regexp at test.pl line 4.
</PRE>
</BLOCKQUOTE>
<P>
At times, it's necessary to match on any character except for
a given character list. This is done by complementing the character
class with the caret. For example,
<BLOCKQUOTE>
<PRE>
$_ = "AAABBBccC";
print "matched" if m/[^ABC]/;
</PRE>
</BLOCKQUOTE>
<P>
will display nothing. This match returns true only if a character
besides <TT>A</TT>, <TT>B</TT>,
or <TT>C</TT> is in the searched string.
If you complement a list with just the letter <TT>A</TT>,
<BLOCKQUOTE>
<PRE>
$_ = "AAABBBccC";
print "matched" if m/[^A]/;
</PRE>
</BLOCKQUOTE>
<P>
then the string <TT>"matched"</TT>
will be displayed because <TT>B</TT>
and <TT>C</TT> are part of the string-in
other words, a character besides the letter <TT>A</TT>.
<P>
Perl has shortcuts for some character classes that are frequently
used. Here is a list of what I call symbolic character classes:
<BR>
<p>
<CENTER>
<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR><TD WIDTH=67><CENTER><TT><B><FONT FACE="Courier">\w</FONT></B></TT></CENTER>
</TD><TD WIDTH=523>This symbol matches any alphanumeric character or the underscore character. It is equivalent to the character class <TT>[a-zA-Z0-9_]</TT>.
</TD></TR>
<TR><TD WIDTH=67><CENTER><TT><B><FONT FACE="Courier">\W</FONT></B></TT></CENTER>
</TD><TD WIDTH=523>This symbol matches every character that the <TT>\w</TT> symbol does not. In other words, it is the complement of <TT>\w</TT>. It is equivalent to <TT>[^a-zA-Z0-9_]</TT>.
</TD></TR>
<TR><TD WIDTH=67><CENTER><TT><B><FONT FACE="Courier">\s</FONT></B></TT></CENTER>
</TD><TD WIDTH=523>This symbol matches any space, tab, or newline character. It is equivalent to <TT>[\t \n]</TT>.
</TD></TR>
<TR><TD WIDTH=67><CENTER><TT><B><FONT FACE="Courier">\S</FONT></B></TT></CENTER>
</TD><TD WIDTH=523>This symbol matches any non-whitespace character. It is equivalent to <TT>[^\t \n]</TT>.
</TD></TR>
<TR><TD WIDTH=67><CENTER><TT><B><FONT FACE="Courier">\d</FONT></B></TT></CENTER>
</TD><TD WIDTH=523>This symbol matches any digit. It is equivalent to <TT>[0-9]</TT>.
</TD></TR>
<TR><TD WIDTH=67><CENTER><TT><B><FONT FACE="Courier">\D</FONT></B></TT></CENTER>
</TD><TD WIDTH=523>This symbol matches any non-digit character. It is equivalent to <TT>[^0-9]</TT>.
</TD></TR>
</TABLE>
</CENTER>
<P>
<P>
You can use these symbols inside other character classes, but
not as endpoints of a range. For example, you can do the following:
<BLOCKQUOTE>
<PRE>
$_ = "\tAAA";
print "matched" if m/[\d\s]/;
</PRE>
</BLOCKQUOTE>
<P>
which will display
<BLOCKQUOTE>
<PRE>
matched
</PRE>
</BLOCKQUOTE>
<P>
because the value of <TT>$_</TT> iNCludes
the tab character.<BR>
<p>
<CENTER>
<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR><TD><B>Tip</B></TD></TR>
<TR><TD>
<BLOCKQUOTE>
Meta-characters that appear inside the square brackets that define a character class are used in their literal sense. They lose their meta-meaning. This may be a little confusing at first. In fact, I have a tendeNCy to forget this when evaluating
patterns.</BLOCKQUOTE>
</TD></TR>
</TABLE>
</CENTER>
<P>
<p>
<CENTER>
<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR><TD><B>Note</B></TD></TR>
<TR><TD>
<BLOCKQUOTE>
I think that most of the confusion regarding regular expressions lies in the fact that each character of a pattern might have several possible meanings. The caret could be an aNChor, it could be a caret, or it could be used to complement a character
class. Therefore, it is vital that you decide which context any given pattern character or symbol is in before assigning a meaning to it.</BLOCKQUOTE>
</TD></TR>
</TABLE>
</CENTER>
<P>
<H3><A NAME="ExampleQuantifiers">
Example: Quantifiers</A></H3>
<P>
Perl provides several different quantifiers that let you specify
how many times a given component must be present before the match
is true. They are used when you don't know in advaNCe how many
characters need to be matched. Table 10.6 lists the different
quantifiers that can be used.<BR>
<P>
<CENTER><B>Table 10.6 The Six Types of Quantifiers</B></CENTER>
<p>
<CENTER>
<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR><TD WIDTH=91><CENTER><I>Quantifier</I></CENTER></TD><TD WIDTH=492><I>Description</I>
</TD></TR>
<TR><TD WIDTH=91><CENTER>*</CENTER></TD><TD WIDTH=492>The component must be present zero or more times.
</TD></TR>
<TR><TD WIDTH=91><CENTER>+</CENTER></TD><TD WIDTH=492>The component must be present one or more times.
</TD></TR>
<TR><TD WIDTH=91><CENTER>?</CENTER></TD><TD WIDTH=492>The component must be present zero or one times.
</TD></TR>
<TR><TD WIDTH=91><CENTER>{n}</CENTER></TD><TD WIDTH=492>The component must be present n times.
</TD></TR>
<TR><TD WIDTH=91><CENTER>{n,}</CENTER></TD><TD WIDTH=492>The component must be present at least n times.
</TD></TR>
<TR><TD WIDTH=91><CENTER>{n,m}</CENTER></TD><TD WIDTH=492>The component must be present at least n times and no more than m times.
</TD></TR>
</TABLE>
</CENTER>
<P>
<P>
If you need to match a word whose length is unknown, you need
to use the <TT>+</TT> quantifier.
You can't use an <TT>*</TT> because
a zero length word makes no sense. So, the match statement might
look like this:
<BLOCKQUOTE>
<PRE>
m/\w+/;
</PRE>
</BLOCKQUOTE>
<P>
This pattern will match <TT>"QQQ"</TT>
and <TT>"AAAAA"</TT> but
not <TT>""</TT> or <TT>"
BBB"</TT>. In order to account for the leading white
space, which may or may not be at the beginning of a string, you
need to use the asterisk (<TT>*</TT>)
quantifier in conjuNCtion with the <TT>\s</TT>
symbolic character class in the following way:
<BLOCKQUOTE>
<PRE>
m/\s*\w+/; <BR>
</PRE>
</BLOCKQUOTE>
<p>
<CENTER>
<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR><TD><B>Tip</B></TD></TR>
<TR><TD>
<BLOCKQUOTE>
Be careful when using the <TT>*</TT> quantifier because it can match an empty string, which might not be your intention. The pattern <TT>/b*/</TT> will match any string-even one without any <TT>b</TT> characters.
</BLOCKQUOTE>
</TD></TR>
</TABLE>
</CENTER>
<P>
<P>
At times, you may need to match an exact number of components.
The following match statement will be true only if five words
are present in the <TT>$_</TT> variable:
<BLOCKQUOTE>
<PRE>
$_ = "AA AB AC AD AE";
m/(\w+\s+){5}/;
</PRE>
</BLOCKQUOTE>
<P>
In this example, we are matching at least one word character followed
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -