📄 ch7.htm
字号:
/[0123456789]/</FONT></TT>
</BLOCKQUOTE>
<P>
The <TT><FONT FACE="Courier">[]</FONT></TT> operator can be used
with other items in the pattern. Consider these two sample statements,
which do the same thing:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/a[0123456789]/ # matches a, followed
by any digit,<BR>
/a[0-9]/ # matches a, followed by any digit,<BR>
/[a-zA-Z]/ # a letter of the alphabet.</FONT></TT>
</BLOCKQUOTE>
<P>
The range <TT><FONT FACE="Courier">[a-z]</FONT></TT> matches any
lowercase letter, and the range <TT><FONT FACE="Courier">[A-Z]</FONT></TT>
matches any uppercase letter. The following pattern matches <TT><FONT FACE="Courier">aA</FONT></TT>,
<TT><FONT FACE="Courier">bX</FONT></TT>, and so on:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/[a-z][A-Z]/</FONT></TT>
</BLOCKQUOTE>
<P>
To match three or more letter matches, it would be very cumbersome
to write something <BR>
like this:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/[a-zA-Z][a-zA-Z][a-zA-Z]/</FONT></TT>
</BLOCKQUOTE>
<P>
This is where the special characters in Perl pattern searching
come into play.
<H3><A NAME="SpecialCharactersinPerlPatternSearc">Special Characters
in Perl Pattern Searches</A></H3>
<P>
Here is a list of all the special characters in search strings
(I'll go into the detail of how they work later):
<UL>
<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">.</FONT></TT>
character matches one character.
<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">+</FONT></TT>
character matches one or more occurrences of a character.
<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">?</FONT></TT>
character matches zero or one occurrences of a character.
<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">*</FONT></TT>
character matches zero or more occurrences of a character.
<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">-</FONT></TT>
character is used to specify ranges in characters.
<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">[]</FONT></TT>
characters define a class of characters.
<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">^</FONT></TT>
character matches the beginning of a line.
<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">$</FONT></TT>
character matches the end of a line.
<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">{}</FONT></TT>
characters specify the number of occurrences of a character.
<LI><FONT COLOR=#000000>The</FONT> <TT><FONT FACE="Courier">|</FONT></TT>
character is the <TT><FONT FACE="Courier">OR</FONT></TT> operator
for more than one pattern.
</UL>
<P>
The plus (<TT><FONT FACE="Courier">+</FONT></TT>) character specifies
"one or more of the preceding characters." Patterns
containing <TT><FONT FACE="Courier">+</FONT></TT> always try to
match as many characters they can. For example, the pattern <TT><FONT FACE="Courier">/ka+/</FONT></TT>
matches any of these strings:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">kamran #
returns "ka"<BR>
kaamran # returns "kaa"
<BR>
kaaaamran # returns "kaaaa"</FONT></TT>
</BLOCKQUOTE>
<P>
Another way to use the <TT><FONT FACE="Courier">+</FONT></TT>
operator is for matching more than one space. For example, Listing
7.7 takes an input line and splits the words into an array. Items
in the array generated by this code will not include any items
generated by matching more than one consecutive space. The match
<TT><FONT FACE="Courier">/ +/</FONT></TT> specifies "one
or more space(s)."
<HR>
<BLOCKQUOTE>
<B>Listing 7.7. Using the pattern matching </B><TT><B><FONT FACE="Courier">+</FONT></B></TT><B>
operator.<BR>
</B>
</BLOCKQUOTE>
<BLOCKQUOTE>
<TT><FONT FACE="Courier">1 #!/usr/bin/perl<BR>
2 $input = <STDIN>;<BR>
3 chop ($input);<BR>
4 @words = split (/ +/, $input);<BR>
5 foreach $i (@words) {<BR>
6 print $i . "\n";<BR>
7 }</FONT></TT>
</BLOCKQUOTE>
<HR>
<P>
If you do not use the <TT><FONT FACE="Courier">+</FONT></TT> sign
to signify more than one space in the pattern, you'll wind up
with an array item for each white space that immediately follows
a white space. The pattern <TT><FONT FACE="Courier">/ /</FONT></TT>
specifies the start of a new word as soon as it sees a white space.
If there are two spaces together, the next white space will trigger
the start of a new word. By using the <TT><FONT FACE="Courier">+</FONT></TT>
sign, you are saying "one or more white space together"
is the start of a new word.<P>
<CENTER>
<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR VALIGN=TOP><TD ><B>Tip</B></TD></TR>
<TR VALIGN=TOP><TD >
<BLOCKQUOTE>
If you are going to repeatedly search one scalar variable, call the <TT><FONT FACE="Courier">study()</FONT></TT> function on the scalar. The syntax is <TT><FONT FACE="Courier">study ($scalar);</FONT></TT>. Only one variable can be used with <TT><FONT
FACE="Courier">study()</FONT></TT> at one time.
</BLOCKQUOTE>
</TD></TR>
</TABLE></CENTER>
<P>
<P>
The asterisk (<TT><FONT FACE="Courier">*</FONT></TT>) special
character matches zero or more occurrences of any preceding character.
The asterisk can also be used with the <TT><FONT FACE="Courier">[]</FONT></TT>
classes:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/9*/ # matches
an empty word, 9, 99, 999, ... and so on<BR>
/79*/ # matches 7, 79, 799, 7999, ... and so
on<BR>
/ab*/ # matches a, ab, abb, abbb, ... and so
on</FONT></TT>
</BLOCKQUOTE>
<P>
Because the asterisk matches zero or more occurrences, the pattern
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/[0-9]*/</FONT></TT>
</BLOCKQUOTE>
<P>
will match a number or an empty line! So do not confuse the asterisk
with the plus operator. Consider this statement:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">@words = split (/[\t\n ]*/, $list);</FONT></TT>
</BLOCKQUOTE>
<P>
This matches zero or more occurrences of the space, newline, or
tab character. What this translates to in Perl is "match
every character." You'll wind up with an array of strings,
each of them one character long, of the all the characters in
the input line.
<P>
The <TT><FONT FACE="Courier">?</FONT></TT> character matches zero
or one occurrence of any preceding character. For example, the
following pattern will match <TT><FONT FACE="Courier">Apple</FONT></TT>
or <TT><FONT FACE="Courier">Aple</FONT></TT>, but not <TT><FONT FACE="Courier">Appple</FONT></TT>:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/Ap?le/</FONT></TT>
</BLOCKQUOTE>
<P>
Let's look at a sample pattern that searches the use of hashes,
arrays, and possibly the use of handles. The code in Listing 7.8
will be enhanced in the next two sections. For the moment, let's
use the code in Listing 7.8 to see how the asterisk operator works
in pattern matches.
<HR>
<BLOCKQUOTE>
<B>Listing 7.8. Using the asterisk operator.<BR>
</B>
</BLOCKQUOTE>
<BLOCKQUOTE>
<TT><FONT FACE="Courier"> 1 #!/usr/bin/perl<BR>
2 # We will finish this program in the next section.<BR>
3 $scalars = 0;
<BR>
4 $hashes = 0;<BR>
5 $arrays = 0;
<BR>
6 $handles = 0;<BR>
7<BR>
8 while (<STDIN>) {<BR>
9
@words = split (/[\(\)\t ]+/);<BR>
10 foreach $token (@words) {<BR>
11 if ($token =~ /\$[_a-zA-Z][_0-9a-zA-Z]*/)
{<BR>
12 #
print ("$token is a legal scalar variable\n");<BR>
13 $scalars++;
<BR>
14 } elsif ($token =~ /@[_a-zA-Z][_0-9a-zA-Z]*/)
{<BR>
15 #
print ("$token is a legal array variable\n");<BR>
16 $arrays++;
<BR>
17 } elsif ($token =~ /%[_a-zA-Z][_0-9A-Z]*/)
{<BR>
18 #
print ("$token is a legal hash variable\n");<BR>
19 $hashes++;
<BR>
20 } elsif ($token =~ /\<[A-Z][_0-9A-Z]*\>/)
{<BR>
21 #
print ("$token is probably a file handle\n");<BR>
22 $handles++;
<BR>
23 }<BR>
24 }<BR>
25 }<BR>
26<BR>
27 print " This file used scalars $scalars times\n";
<BR>
28 print " This file used arrays $arrays times\n";
<BR>
29 print " This file used hashes $hashes times\n";<BR>
30 print " This file used handles $handles times\n";</FONT></TT>
</BLOCKQUOTE>
<HR>
<P>
Lines 9 and 10 split the incoming stream into words. Note how
the pattern in line 9 splits words at spaces, tabs, and in between
parentheses. At line 11, we are looking for a word that starts
with a <TT><FONT FACE="Courier">$</FONT></TT>, has a non-numeric
character or underscore as the first character, and is followed
by an alphanumeric string or underscores.
<P>
At lines 14 and 17, the same pattern is applied, with the exception
of an at (<TT><FONT FACE="Courier">@</FONT></TT>) sign and a hash
(<TT><FONT FACE="Courier">#</FONT></TT>) sign are looked for instead
of a dollar (<TT><FONT FACE="Courier">$</FONT></TT>) sign in order
to search for arrays and hashes, respectively. At line 20, the
file handle is assumed to a word in all caps, not starting with
an underscore, but with alphanumeric characters in it.
<P>
The previous listing can get legal names if the pattern is anywhere
in a word. However, we want the search to be limited to word boundaries.
For example, right now the script cannot distinguish between the
following three lines of input because they all match the <TT><FONT FACE="Courier">/\$[a-zA-Z][_0-9a-zA-Z]*/</FONT></TT>
somewhere in them:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">$catacomb<BR>
OBJ::$catacomb<BR>
#$catacomb#</FONT></TT>
</BLOCKQUOTE>
<P>
White spaces do not include tabs, newlines, and so on. Here are
the special characters to use in pattern matching to signify these
characters:<P>
<CENTER>
<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR VALIGN=TOP><TD WIDTH=92><TT><FONT FACE="Courier">\t</FONT></TT></TD>
<TD WIDTH=134>Tab</TD></TR>
<TR VALIGN=TOP><TD WIDTH=92><TT><FONT FACE="Courier">\n</FONT></TT></TD>
<TD WIDTH=134>Newline</TD></TR>
<TR VALIGN=TOP><TD WIDTH=92><TT><FONT FACE="Courier">\r</FONT></TT></TD>
<TD WIDTH=134>Carriage return</TD></TR>
<TR VALIGN=TOP><TD WIDTH=92><TT><FONT FACE="Courier">\f</FONT></TT></TD>
<TD WIDTH=134>Form feed.</TD></TR>
<TR VALIGN=TOP><TD WIDTH=92><TT><FONT FACE="Courier">\\</FONT></TT></TD>
<TD WIDTH=134>Backslash (\)</TD></TR>
<TR VALIGN=TOP><TD WIDTH=92><TT><FONT FACE="Courier">\Q</FONT></TT> and <TT><FONT FACE="Courier">\E</FONT></TT>
</TD><TD WIDTH=134>Pattern delimiters</TD></TR>
</TABLE></CENTER>
<P>
<P>
In general, you can escape any special character in a pattern
with the backslash (<TT><FONT FACE="Courier">\</FONT></TT>). The
backslash itself is escaped with another backslash. The <TT><FONT FACE="Courier">\Q</FONT></TT>
and <TT><FONT FACE="Courier">\E</FONT></TT> characters are used
in Perl to delimit the interpretation of any special characters.
When the Perl interpreter sees <TT><FONT FACE="Courier">\Q</FONT></TT>,
every character following <TT><FONT FACE="Courier">\Q</FONT></TT>
is not interpreted and is used literally until the pattern terminates
or Perl sees <TT><FONT FACE="Courier">\E</FONT></TT>. Here are
a few examples:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/\Q^Section$/ # match the string "^Section$"
literally.<BR>
/^Section$/ # match a line with the solitary
word Section in it.<BR>
/\Q^Section$/ # match a line which ends with ^Section </FONT></TT>
</BLOCKQUOTE>
<P>
To further clarify where the variable begins and ends, you can
use these anchors:<P>
<CENTER>
<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR VALIGN=TOP><TD WIDTH=42><TT><FONT FACE="Courier">\A</FONT></TT></TD>
<TD WIDTH=235>Match at beginning of string only</TD></TR>
<TR VALIGN=TOP><TD WIDTH=42><TT><FONT FACE="Courier">\Z</FONT></TT></TD>
<TD WIDTH=235>Match at end of string only</TD></TR>
<TR VALIGN=TOP><TD WIDTH=42><TT><FONT FACE="Courier">\b</FONT></TT></TD>
<TD WIDTH=235>Match on word boundary</TD></TR>
<TR VALIGN=TOP><TD WIDTH=42><TT><FONT FACE="Courier">\B</FONT></TT></TD>
<TD WIDTH=235>Match inside word</TD></TR>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -