📄 ch7.htm

📁 《Perl 5 Unreleased》
💻 HTM
📖 第 1 页 / 共 5 页
字号:
/[0123456789]/</FONT></TT>

</BLOCKQUOTE>

<P>

The <TT><FONT FACE="Courier">[]</FONT></TT> operator can be used

with other items in the pattern. Consider these two sample statements,

which do the same thing:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/a[0123456789]/  # matches a, followed

by any digit,<BR>

/a[0-9]/  # matches a, followed by any digit,<BR>

/[a-zA-Z]/  # a letter of the alphabet.</FONT></TT>

</BLOCKQUOTE>

<P>

The range <TT><FONT FACE="Courier">[a-z]</FONT></TT> matches any

lowercase letter, and the range <TT><FONT FACE="Courier">[A-Z]</FONT></TT>

matches any uppercase letter. The following pattern matches <TT><FONT FACE="Courier">aA</FONT></TT>,

<TT><FONT FACE="Courier">bX</FONT></TT>, and so on:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/[a-z][A-Z]/</FONT></TT>

</BLOCKQUOTE>

<P>

To match three or more letter matches, it would be very cumbersome

to write something <BR>

like this:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/[a-zA-Z][a-zA-Z][a-zA-Z]/</FONT></TT>

</BLOCKQUOTE>

<P>

This is where the special characters in Perl pattern searching

come into play. 

<H3><A NAME="SpecialCharactersinPerlPatternSearc">Special Characters

in Perl Pattern Searches</A></H3>

<P>

Here is a list of all the special characters in search strings

(I'll go into the detail of how they work later):

<UL>

<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">.</FONT></TT>

character matches one character.

<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">+</FONT></TT>

character matches one or more occurrences of a character.

<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">?</FONT></TT>

character matches zero or one occurrences of a character.

<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">*</FONT></TT>

character matches zero or more occurrences of a character.

<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">-</FONT></TT>

character is used to specify ranges in characters.

<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">[]</FONT></TT>

characters define a class of characters.

<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">^</FONT></TT>

character matches the beginning of a line.

<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">$</FONT></TT>

character matches the end of a line.

<LI><FONT COLOR=#000000>The </FONT><TT><FONT FACE="Courier">{}</FONT></TT>

characters specify the number of occurrences of a character.

<LI><FONT COLOR=#000000>The</FONT> <TT><FONT FACE="Courier">|</FONT></TT>

character is the <TT><FONT FACE="Courier">OR</FONT></TT> operator

for more than one pattern.

</UL>

<P>

The plus (<TT><FONT FACE="Courier">+</FONT></TT>) character specifies

&quot;one or more of the preceding characters.&quot; Patterns

containing <TT><FONT FACE="Courier">+</FONT></TT> always try to

match as many characters they can. For example, the pattern <TT><FONT FACE="Courier">/ka+/</FONT></TT>

matches any of these strings:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">kamran&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;#

returns &quot;ka&quot;<BR>

kaamran&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# returns &quot;kaa&quot;

<BR>

kaaaamran&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# returns &quot;kaaaa&quot;</FONT></TT>

</BLOCKQUOTE>

<P>

Another way to use the <TT><FONT FACE="Courier">+</FONT></TT>

operator is for matching more than one space. For example, Listing

7.7 takes an input line and splits the words into an array. Items

in the array generated by this code will not include any items

generated by matching more than one consecutive space. The match

<TT><FONT FACE="Courier">/ +/</FONT></TT> specifies &quot;one

or more space(s).&quot;

<HR>

<BLOCKQUOTE>

<B>Listing 7.7. Using the pattern matching </B><TT><B><FONT FACE="Courier">+</FONT></B></TT><B>

operator.<BR>

</B>

</BLOCKQUOTE>

<BLOCKQUOTE>

<TT><FONT FACE="Courier">1 #!/usr/bin/perl<BR>

2 $input = &lt;STDIN&gt;;<BR>

3 chop ($input);<BR>

4 @words = split (/ +/, $input);<BR>

5 foreach $i (@words) {<BR>

6&nbsp;&nbsp;&nbsp;&nbsp; print $i . &quot;\n&quot;;<BR>

7&nbsp;&nbsp;&nbsp;&nbsp; }</FONT></TT>

</BLOCKQUOTE>

<HR>

<P>

If you do not use the <TT><FONT FACE="Courier">+</FONT></TT> sign

to signify more than one space in the pattern, you'll wind up

with an array item for each white space that immediately follows

a white space. The pattern <TT><FONT FACE="Courier">/ /</FONT></TT>

specifies the start of a new word as soon as it sees a white space.

If there are two spaces together, the next white space will trigger

the start of a new word. By using the <TT><FONT FACE="Courier">+</FONT></TT>

sign, you are saying &quot;one or more white space together&quot;

is the start of a new word.<P>

<CENTER>

<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>

<TR VALIGN=TOP><TD ><B>Tip</B></TD></TR>

<TR VALIGN=TOP><TD >

<BLOCKQUOTE>

If you are going to repeatedly search one scalar variable, call the <TT><FONT FACE="Courier">study()</FONT></TT> function on the scalar. The syntax is <TT><FONT FACE="Courier">study ($scalar);</FONT></TT>. Only one variable can be used with <TT><FONT 

FACE="Courier">study()</FONT></TT> at one time.

</BLOCKQUOTE>



</TD></TR>

</TABLE></CENTER>

<P>

<P>

The asterisk (<TT><FONT FACE="Courier">*</FONT></TT>) special

character matches zero or more occurrences of any preceding character.

The asterisk can also be used with the <TT><FONT FACE="Courier">[]</FONT></TT>

classes:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/9*/&nbsp;&nbsp;&nbsp;&nbsp;# matches

an empty word, 9, 99, 999, ... and so on<BR>

/79*/&nbsp;&nbsp;&nbsp;# matches 7, 79, 799, 7999, ... and so

on<BR>

/ab*/&nbsp;&nbsp;&nbsp;# matches a, ab, abb, abbb, ... and so

on</FONT></TT>

</BLOCKQUOTE>

<P>

Because the asterisk matches zero or more occurrences, the pattern

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/[0-9]*/</FONT></TT>

</BLOCKQUOTE>

<P>

will match a number or an empty line! So do not confuse the asterisk

with the plus operator. Consider this statement:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">@words = split (/[\t\n ]*/, $list);</FONT></TT>

</BLOCKQUOTE>

<P>

This matches zero or more occurrences of the space, newline, or

tab character. What this translates to in Perl is &quot;match

every character.&quot; You'll wind up with an array of strings,

each of them one character long, of the all the characters in

the input line.

<P>

The <TT><FONT FACE="Courier">?</FONT></TT> character matches zero

or one occurrence of any preceding character. For example, the

following pattern will match <TT><FONT FACE="Courier">Apple</FONT></TT>

or <TT><FONT FACE="Courier">Aple</FONT></TT>, but not <TT><FONT FACE="Courier">Appple</FONT></TT>:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/Ap?le/</FONT></TT>

</BLOCKQUOTE>

<P>

Let's look at a sample pattern that searches the use of hashes,

arrays, and possibly the use of handles. The code in Listing 7.8

will be enhanced in the next two sections. For the moment, let's

use the code in Listing 7.8 to see how the asterisk operator works

in pattern matches.

<HR>

<BLOCKQUOTE>

<B>Listing 7.8. Using the asterisk operator.<BR>

</B>

</BLOCKQUOTE>

<BLOCKQUOTE>

<TT><FONT FACE="Courier">&nbsp;1 #!/usr/bin/perl<BR>

&nbsp;2 # We will finish this program in the next section.<BR>

&nbsp;3 $scalars =&nbsp;&nbsp;0;

<BR>

&nbsp;4 $hashes =&nbsp;&nbsp;0;<BR>

&nbsp;5 $arrays =&nbsp;&nbsp;0;

<BR>

&nbsp;6 $handles =&nbsp;&nbsp;0;<BR>

&nbsp;7<BR>

&nbsp;8 while (&lt;STDIN&gt;) {<BR>

&nbsp;9&nbsp;&nbsp;&nbsp;&nbsp;

@words = split (/[\(\)\t ]+/);<BR>

10&nbsp;&nbsp;&nbsp;&nbsp; foreach $token (@words) {<BR>

11&nbsp;&nbsp;&nbsp;&nbsp; if ($token =~ /\$[_a-zA-Z][_0-9a-zA-Z]*/)

{<BR>

12&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;#

print (&quot;$token is a legal scalar variable\n&quot;);<BR>

13&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$scalars++;

<BR>

14&nbsp;&nbsp;&nbsp;&nbsp; } elsif ($token =~ /@[_a-zA-Z][_0-9a-zA-Z]*/)

{<BR>

15&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;#

print (&quot;$token is a legal array variable\n&quot;);<BR>

16&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$arrays++;

<BR>

17&nbsp;&nbsp;&nbsp;&nbsp; } elsif ($token =~ /%[_a-zA-Z][_0-9A-Z]*/)

{<BR>

18&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;#

print (&quot;$token is a legal hash variable\n&quot;);<BR>

19&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$hashes++;

<BR>

20&nbsp;&nbsp;&nbsp;&nbsp; } elsif ($token =~ /\&lt;[A-Z][_0-9A-Z]*\&gt;/)

{<BR>

21&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;#

print (&quot;$token is probably a file handle\n&quot;);<BR>

22&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$handles++;

<BR>

23&nbsp;&nbsp;&nbsp;&nbsp; }<BR>

24&nbsp;&nbsp;&nbsp;&nbsp;}<BR>

25 }<BR>

26<BR>

27 print &quot; This file used scalars $scalars times\n&quot;;

<BR>

28 print &quot; This file used arrays&nbsp;&nbsp;$arrays&nbsp;&nbsp;times\n&quot;;

<BR>

29 print &quot; This file used hashes $hashes times\n&quot;;<BR>

30 print &quot; This file used handles $handles times\n&quot;;</FONT></TT>

</BLOCKQUOTE>

<HR>

<P>

Lines 9 and 10 split the incoming stream into words. Note how

the pattern in line 9 splits words at spaces, tabs, and in between

parentheses. At line 11, we are looking for a word that starts

with a <TT><FONT FACE="Courier">$</FONT></TT>, has a non-numeric

character or underscore as the first character, and is followed

by an alphanumeric string or underscores.

<P>

At lines 14 and 17, the same pattern is applied, with the exception

of an at (<TT><FONT FACE="Courier">@</FONT></TT>) sign and a hash

(<TT><FONT FACE="Courier">#</FONT></TT>) sign are looked for instead

of a dollar (<TT><FONT FACE="Courier">$</FONT></TT>) sign in order

to search for arrays and hashes, respectively. At line 20, the

file handle is assumed to a word in all caps, not starting with

an underscore, but with alphanumeric characters in it.

<P>

The previous listing can get legal names if the pattern is anywhere

in a word. However, we want the search to be limited to word boundaries.

For example, right now the script cannot distinguish between the

following three lines of input because they all match the <TT><FONT FACE="Courier">/\$[a-zA-Z][_0-9a-zA-Z]*/</FONT></TT>

somewhere in them:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">$catacomb<BR>

OBJ::$catacomb<BR>

#$catacomb#</FONT></TT>

</BLOCKQUOTE>

<P>

White spaces do not include tabs, newlines, and so on. Here are

the special characters to use in pattern matching to signify these

characters:<P>

<CENTER>

<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>

<TR VALIGN=TOP><TD WIDTH=92><TT><FONT FACE="Courier">\t</FONT></TT></TD>

<TD WIDTH=134>Tab</TD></TR>

<TR VALIGN=TOP><TD WIDTH=92><TT><FONT FACE="Courier">\n</FONT></TT></TD>

<TD WIDTH=134>Newline</TD></TR>

<TR VALIGN=TOP><TD WIDTH=92><TT><FONT FACE="Courier">\r</FONT></TT></TD>

<TD WIDTH=134>Carriage return</TD></TR>

<TR VALIGN=TOP><TD WIDTH=92><TT><FONT FACE="Courier">\f</FONT></TT></TD>

<TD WIDTH=134>Form feed.</TD></TR>

<TR VALIGN=TOP><TD WIDTH=92><TT><FONT FACE="Courier">\\</FONT></TT></TD>

<TD WIDTH=134>Backslash (\)</TD></TR>

<TR VALIGN=TOP><TD WIDTH=92><TT><FONT FACE="Courier">\Q</FONT></TT> and <TT><FONT FACE="Courier">\E</FONT></TT>

</TD><TD WIDTH=134>Pattern delimiters</TD></TR>

</TABLE></CENTER>

<P>

<P>

In general, you can escape any special character in a pattern

with the backslash (<TT><FONT FACE="Courier">\</FONT></TT>). The

backslash itself is escaped with another backslash. The <TT><FONT FACE="Courier">\Q</FONT></TT>

and <TT><FONT FACE="Courier">\E</FONT></TT> characters are used

in Perl to delimit the interpretation of any special characters.

When the Perl interpreter sees <TT><FONT FACE="Courier">\Q</FONT></TT>,

every character following <TT><FONT FACE="Courier">\Q</FONT></TT>

is not interpreted and is used literally until the pattern terminates

or Perl sees <TT><FONT FACE="Courier">\E</FONT></TT>. Here are

a few examples:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/\Q^Section$/ # match the string &quot;^Section$&quot;

literally.<BR>

/^Section$/&nbsp;&nbsp;&nbsp;# match a line with the solitary

word Section in it.<BR>

/\Q^Section$/ # match a line which ends with ^Section </FONT></TT>

</BLOCKQUOTE>

<P>

To further clarify where the variable begins and ends, you can

use these anchors:<P>

<CENTER>

<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>

<TR VALIGN=TOP><TD WIDTH=42><TT><FONT FACE="Courier">\A</FONT></TT></TD>

<TD WIDTH=235>Match at beginning of string only</TD></TR>

<TR VALIGN=TOP><TD WIDTH=42><TT><FONT FACE="Courier">\Z</FONT></TT></TD>

<TD WIDTH=235>Match at end of string only</TD></TR>

<TR VALIGN=TOP><TD WIDTH=42><TT><FONT FACE="Courier">\b</FONT></TT></TD>

<TD WIDTH=235>Match on word boundary</TD></TR>

<TR VALIGN=TOP><TD WIDTH=42><TT><FONT FACE="Courier">\B</FONT></TT></TD>

<TD WIDTH=235>Match inside word</TD></TR>
💿 文件大小 1200 K
👤 上传用户 cz6891297
📂 所属分类其他书籍
🏷️ 相关标签

#Unreleased #Perl
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -