📄 ch7.htm
字号:
a pair of characters, the first character is not part of a word
and the second character is part of the word. This means that
the second character is the beginning of a word. Again, a word
boundary exists between the first and second characters matched
by the pattern. Therefore, you are at the start of a word.
<H3><A NAME="ThequotemetaFunction">The <TT><FONT SIZE=4 FACE="Courier">quotemeta</FONT></TT><FONT SIZE=4>
Function</FONT></A></H3>
<P>
The <TT><FONT FACE="Courier">quotemeta</FONT></TT> function puts
a backslash in front of any non-word character in a given string.
Here's the syntax for <TT><FONT FACE="Courier">quotemeta</FONT></TT>:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">$<BR>
newstring = quotemeta($oldstring);</FONT></TT>
</BLOCKQUOTE>
<P>
The action of the <TT><FONT FACE="Courier">quotemeta</FONT></TT>
string can best be described using regular expressions as
<BLOCKQUOTE>
<TT><FONT FACE="Courier">$string =~ s/(\W)/\\$1/g;</FONT></TT>
</BLOCKQUOTE>
<H3><A NAME="SpecifyingtheNumberofMatches">Specifying the Number
of Matches</A></H3>
<P>
Sometimes matching once, twice, or more than once is not sufficient
for a particular search. What if you wanted to match from two
to four times? In this case you can use the <TT><FONT FACE="Courier">{
}</FONT></TT> operators in the search function. For example, in
the following pattern you can search for all words that begin
with <TT><FONT FACE="Courier">ch</FONT></TT> followed by two or
three digits followed by <TT><FONT FACE="Courier">.txt</FONT></TT>:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/ch[0-9]{2,3}.txt/</FONT></TT>
</BLOCKQUOTE>
<P>
For exactly three digits after the <TT><FONT FACE="Courier">ch</FONT></TT>
text, you can use this:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/ch[0-9]{ 3}.txt/</FONT></TT>
</BLOCKQUOTE>
<P>
For three or more digits after the <TT><FONT FACE="Courier">ch</FONT></TT>
text, you can use this:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/ch[0-9]{3,}.txt/</FONT></TT>
</BLOCKQUOTE>
<P>
To match any three characters following the <TT><FONT FACE="Courier">ch</FONT></TT>
text, you can use this:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/ch.{3,}.txt/</FONT></TT>
</BLOCKQUOTE>
<H3><A NAME="SpecifyingMoreThanOneChoice">Specifying More Than
One Choice</A></H3>
<P>
Perl enables you to specify more than one choice when attempting
to match a pattern. The pipe symbol (<TT><FONT FACE="Courier">|</FONT></TT>)
works like an <TT><FONT FACE="Courier">OR</FONT></TT> operator,
enabling you to specify two or more patterns to match. For example,
the pattern
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/houston|rockets/</FONT></TT>
</BLOCKQUOTE>
<P>
matches the string <TT><FONT FACE="Courier">houston</FONT></TT>
or the string <TT><FONT FACE="Courier">rockets</FONT></TT>, whichever
comes first. You can use special characters with the patterns.
For example, the pattern <TT><FONT FACE="Courier">/[a-z]+|[0-9]+/</FONT></TT>
matches one or more lowercase letters or one or more digits. The
match for a valid integer in Perl is defined as this:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/\b\d+\b|\b0[xX][\da-fA-F]+\b/)</FONT></TT>
</BLOCKQUOTE>
<P>
There are two alternatives to check for here. The first one is
<TT><FONT FACE="Courier">^\d+</FONT></TT> (that is, check for
one or more digits to cover both octal and decimal digits). The
second <TT><FONT FACE="Courier">^0[xX][\da-fA-F]+$</FONT></TT>
looks for <TT><FONT FACE="Courier">0x</FONT></TT> or <TT><FONT FACE="Courier">0X</FONT></TT>
followed by hex digits. Any other pattern is disregarded. The
delimiting <TT><FONT FACE="Courier">\b</FONT></TT> tags limit
the search to word boundaries.
<H3><A NAME="SearchingaStringforMoreThanOnePat">Searching a String
for More Than One Pattern to Match</A></H3>
<P>
Sometimes it's necessary to search for occurrences for the same
pattern to match at more than one location. You saw earlier in
the example for using <TT><FONT FACE="Courier">substr</FONT></TT>
how we kept the index around between successive searches on one
string. Perl offers another alternative to this problem: the <TT><FONT FACE="Courier">pos()</FONT></TT>
function. The <TT><FONT FACE="Courier">pos</FONT></TT> function
returns the location of the last pattern match in a string. You
can reuse the last match value when using the global (<TT><FONT FACE="Courier">g</FONT></TT>)
pattern matching operator. The syntax for the <TT><FONT FACE="Courier">pos</FONT></TT>
function is
<BLOCKQUOTE>
<TT><FONT FACE="Courier">$offset = pos($string);</FONT></TT>
</BLOCKQUOTE>
<P>
where <TT><FONT FACE="Courier">$string</FONT></TT> is the string
whose pattern is being matched. The returned <TT><FONT FACE="Courier">$offset</FONT></TT>
is the number of characters already matched or skipped.
<P>
Listing 7.11 presents a simple script to search for the letter
<TT><FONT FACE="Courier">n</FONT></TT> in <TT><FONT FACE="Courier">Bananarama</FONT></TT>.
<HR>
<BLOCKQUOTE>
<B>Listing 7.11. Using the </B><TT><B><FONT FACE="Courier">pos</FONT></B></TT><B>
function.<BR>
</B>
</BLOCKQUOTE>
<BLOCKQUOTE>
<TT><FONT FACE="Courier">1 #!/usr/bin/perl<BR>
2 $string = "Bananarama";<BR>
3 while ($string =~ /n/g) {<BR>
4 $offset
= pos($string);<BR>
5 print("Found
an n at $offset\n");<BR>
6 }</FONT></TT>
</BLOCKQUOTE>
<HR>
<P>
Here's the output for this program:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">Found an n at 2<BR>
Found an n at 4<BR>
Found an n at 6<BR>
Found an n at 8<BR>
Found an n at 10</FONT></TT>
</BLOCKQUOTE>
<P>
The starting position for <TT><FONT FACE="Courier">pos()</FONT></TT>
to work does not have to start at 0. Like the <TT><FONT FACE="Courier">substr()</FONT></TT>
function, you can use <TT><FONT FACE="Courier">pos()</FONT></TT>
on the right side of the equal sign. To start a search at position
6, simply type this line before you process the string:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">pos($string) = 5;</FONT></TT>
</BLOCKQUOTE>
<P>
To restart searching from the beginning, reset the value of <TT><FONT FACE="Courier">pos</FONT></TT>
to <TT><FONT FACE="Courier">0</FONT></TT>.
<H2><A NAME="ReusingPortionsofPatterns"><FONT SIZE=5 COLOR=#FF0000>Reusing
Portions of Patterns</FONT></A></H2>
<P>
There will be times when you want to write patterns that address
groups of numbers. For example, a section of comma-delimited data
from the output of a spreadsheet is of this form:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">digits,digits,digits,digits</FONT></TT>
</BLOCKQUOTE>
<P>
A bit repetitive, isn't it? To extract this tidbit of information
from the middle of a document, you could use something like this:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/[\d]+[,.][\d]+[,.][\d]+[,.][\d]+/</FONT></TT>
</BLOCKQUOTE>
<P>
What if there were 10 columns? The pattern would be long, and
you'd be prone to make mistakes.
<P>
Perl provides a macro substitution to allow repetitions of a known
sequence. Every pattern in a matched string that is enclosed in
memory is stored in memory in the order it is declared. To retrieve
a sequence from memory, use the special character <TT><FONT FACE="Courier">\<I>n</I></FONT></TT>,
where <TT><I><FONT FACE="Courier">n</FONT></I></TT> is an integer
representing the <I>n</I>th pattern stored in memory.
<P>
For example, you can write the previous lines using these two
repetitive patterns:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">([\d]+)<BR>
([,.])</FONT></TT>
</BLOCKQUOTE>
<P>
The string that is used for matching the pattern would look like
this:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/([\d]+])([,.])\1\2\1\2\1\2/</FONT></TT>
</BLOCKQUOTE>
<P>
The pattern matched by <TT><FONT FACE="Courier">[\d]+</FONT></TT>
is stored in memory. When the Perl interpreter sees the escape
sequence <TT><FONT FACE="Courier">\1</FONT></TT>, it matches the
first matched pattern. When it sees <TT><FONT FACE="Courier">\2</FONT></TT>,
it matches the second pattern. Pattern sequences are stored in
memory from left to right. As another example, the following matches
a phone number in the United States, which is of the form ###-###-####,
where the # is a digit:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/\d{3}(\-))\d{3}\1\d{2}/</FONT></TT>
</BLOCKQUOTE>
<P>
The pattern sequence memory is preserved only for the length of
the pattern. You can access these variables for a short time,
at least until another pattern match is hit, by examining the
special variables of the form <TT><FONT FACE="Courier">$n</FONT></TT>.
The <TT><FONT FACE="Courier">$n</FONT></TT> variables contain
the value of patterns matched in parentheses right after a match.
The special variable <TT><FONT FACE="Courier">$&</FONT></TT>
contains the entire matched pattern.
<P>
In the previous snippet of code, to get the data matched in columns
into separate variables, you can use something like this excerpt
in a program:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">if (/-?(\d+)\.?(\d+)/) {<BR>
$matchedPart = $&;<BR>
$col_1 = $1;<BR>
$col_2 = $2;<BR>
$col_3 = $3;<BR>
$col_4 = $4;<BR>
}</FONT></TT>
</BLOCKQUOTE>
<P>
The order of precedence when using <TT><FONT FACE="Courier">()</FONT></TT>
is higher than that of other pattern-matching characters. Here
is the order of precedence from high to low:<P>
<CENTER>
<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR VALIGN=TOP><TD WIDTH=167><TT><FONT FACE="Courier">()</FONT></TT></TD>
<TD WIDTH=171>Pattern memory</TD></TR>
<TR VALIGN=TOP><TD WIDTH=167><TT><FONT FACE="Courier">+ * ? {}</FONT></TT>
</TD><TD WIDTH=171>Number of occurrences</TD></TR>
<TR VALIGN=TOP><TD WIDTH=167><TT><FONT FACE="Courier">^ $ \b \B \W \w</FONT></TT>
</TD><TD WIDTH=171>Pattern anchors</TD></TR>
<TR VALIGN=TOP><TD WIDTH=167><TT><FONT FACE="Courier">|</FONT></TT></TD>
<TD WIDTH=171>The <TT><FONT FACE="Courier">OR</FONT></TT> operator
</TD></TR>
</TABLE></CENTER>
<P>
<P>
The pattern-memory special characters <TT><FONT FACE="Courier">()</FONT></TT>
serve as delimiters for the <TT><FONT FACE="Courier">OR</FONT></TT>
operator. The side effect of this delimiting is that the parenthesized
part of the pattern is mapped into a <TT><FONT FACE="Courier">$n</FONT></TT>
register. For example, in the following line, the <TT><FONT FACE="Courier">\1</FONT></TT>
refers to (<TT><FONT FACE="Courier">b|d</FONT></TT>), not the
(<TT><FONT FACE="Courier">a|o</FONT></TT>) matching pattern:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/(b|d)(a|o)(rk).*\1\2\3/</FONT></TT>
</BLOCKQUOTE>
<H2><A NAME="PatternMatchingOptions"><FONT SIZE=5 COLOR=#FF0000>Pattern-Matching
Options</FONT></A></H2>
<P>
There are several pattern-matching options in Perl to control
how strings are matched. You saw these options earlier when I
introduced the syntax for pattern matching. Here are the options:
<P>
<CENTER>
<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR VALIGN=TOP><TD WIDTH=115><TT><FONT FACE="Courier"><center>g</FONT></TT>
</TD><TD WIDTH=243>Match all possible patterns</TD></TR>
<TR VALIGN=TOP><TD WIDTH=115><CENTER><TT><FONT FACE="Courier">i</FONT></TT></CENTER>
</TD><TD WIDTH=243>Ignore case when matching strings</TD></TR>
<TR VALIGN=TOP><TD WIDTH=115><CENTER><TT><FONT FACE="Courier">m</FONT></TT></CENTER>
</TD><TD WIDTH=243>Treat string as multiple lines</TD></TR>
<TR VALIGN=TOP><TD WIDTH=115><CENTER><TT><FONT FACE="Courier">o</FONT></TT></CENTER>
</TD><TD WIDTH=243>Only evaluate once</TD></TR>
<TR VALIGN=TOP><TD WIDTH=115><CENTER><TT><FONT FACE="Courier">s</FONT></TT></CENTER>
</TD><TD WIDTH=243>Treat string as single line</TD></TR>
<TR VALIGN=TOP><TD WIDTH=115><CENTER><TT><FONT FACE="Courier">x</FONT></TT></CENTER>
</TD><TD WIDTH=243>Ignore white space in pattern</TD></TR>
</TABLE></CENTER>
<P>
<P>
All these pattern options must be specified immediately after
the option. For example, the following pattern uses the <TT><FONT FACE="Courier">i</FONT></TT>
option to ignore case:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/first*name/i</FONT></TT>
</BLOCKQUOTE>
<P>
More than one option can be specified at one time and can be specified
in any order.
<P>
The <TT><FONT FACE="Courier">g</FONT></TT> operator tells the
Perl interpreter to match all the possible patterns in a string.
For example, if the string <TT><FONT FACE="Courier">bananarama</FONT></TT>
is searched using the following pattern:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">/.a/g</FONT></TT>
</BLOCKQUOTE>
<P>
it will match <TT><FONT FACE="Courier">ba</FONT></TT>, <TT><FONT FACE="Courier">na</FONT></TT>,
<TT><FONT FACE="Courier">na</FONT></TT>, <TT><FONT FACE="Courier">ra</FONT></TT>,
and <TT><FONT FACE="Courier">ma</FONT></TT>. You can assign the
return of all these matches to an array. Here's an
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -