📄 ch7.htm

📁 《Perl 5 Unreleased》
💻 HTM
📖 第 1 页 / 共 5 页
字号:
上一页 1 2 3 45
a pair of characters, the first character is not part of a word

and the second character is part of the word. This means that

the second character is the beginning of a word. Again, a word

boundary exists between the first and second characters matched

by the pattern. Therefore, you are at the start of a word.

<H3><A NAME="ThequotemetaFunction">The <TT><FONT SIZE=4 FACE="Courier">quotemeta</FONT></TT><FONT SIZE=4>

Function</FONT></A></H3>

<P>

The <TT><FONT FACE="Courier">quotemeta</FONT></TT> function puts

a backslash in front of any non-word character in a given string.

Here's the syntax for <TT><FONT FACE="Courier">quotemeta</FONT></TT>:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">$<BR>

newstring = quotemeta($oldstring);</FONT></TT>

</BLOCKQUOTE>

<P>

The action of the <TT><FONT FACE="Courier">quotemeta</FONT></TT>

string can best be described using regular expressions as 

<BLOCKQUOTE>

<TT><FONT FACE="Courier">$string =~ s/(\W)/\\$1/g;</FONT></TT>

</BLOCKQUOTE>

<H3><A NAME="SpecifyingtheNumberofMatches">Specifying the Number

of Matches</A></H3>

<P>

Sometimes matching once, twice, or more than once is not sufficient

for a particular search. What if you wanted to match from two

to four times? In this case you can use the <TT><FONT FACE="Courier">{

}</FONT></TT> operators in the search function. For example, in

the following pattern you can search for all words that begin

with <TT><FONT FACE="Courier">ch</FONT></TT> followed by two or

three digits followed by <TT><FONT FACE="Courier">.txt</FONT></TT>:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/ch[0-9]{2,3}.txt/</FONT></TT>

</BLOCKQUOTE>

<P>

For exactly three digits after the <TT><FONT FACE="Courier">ch</FONT></TT>

text, you can use this:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/ch[0-9]{ 3}.txt/</FONT></TT>

</BLOCKQUOTE>

<P>

For three or more digits after the <TT><FONT FACE="Courier">ch</FONT></TT>

text, you can use this:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/ch[0-9]{3,}.txt/</FONT></TT>

</BLOCKQUOTE>

<P>

To match any three characters following the <TT><FONT FACE="Courier">ch</FONT></TT>

text, you can use this:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/ch.{3,}.txt/</FONT></TT>

</BLOCKQUOTE>

<H3><A NAME="SpecifyingMoreThanOneChoice">Specifying More Than

One Choice</A></H3>

<P>

Perl enables you to specify more than one choice when attempting

to match a pattern. The pipe symbol (<TT><FONT FACE="Courier">|</FONT></TT>)

works like an <TT><FONT FACE="Courier">OR</FONT></TT> operator,

enabling you to specify two or more patterns to match. For example,

the pattern

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/houston|rockets/</FONT></TT>

</BLOCKQUOTE>

<P>

matches the string <TT><FONT FACE="Courier">houston</FONT></TT>

or the string <TT><FONT FACE="Courier">rockets</FONT></TT>, whichever

comes first. You can use special characters with the patterns.

For example, the pattern <TT><FONT FACE="Courier">/[a-z]+|[0-9]+/</FONT></TT>

matches one or more lowercase letters or one or more digits. The

match for a valid integer in Perl is defined as this:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/\b\d+\b|\b0[xX][\da-fA-F]+\b/)</FONT></TT>

</BLOCKQUOTE>

<P>

There are two alternatives to check for here. The first one is

<TT><FONT FACE="Courier">^\d+</FONT></TT> (that is, check for

one or more digits to cover both octal and decimal digits). The

second <TT><FONT FACE="Courier">^0[xX][\da-fA-F]+$</FONT></TT>

looks for <TT><FONT FACE="Courier">0x</FONT></TT> or <TT><FONT FACE="Courier">0X</FONT></TT>

followed by hex digits. Any other pattern is disregarded. The

delimiting <TT><FONT FACE="Courier">\b</FONT></TT> tags limit

the search to word boundaries.

<H3><A NAME="SearchingaStringforMoreThanOnePat">Searching a String

for More Than One Pattern to Match</A></H3>

<P>

Sometimes it's necessary to search for occurrences for the same

pattern to match at more than one location. You saw earlier in

the example for using <TT><FONT FACE="Courier">substr</FONT></TT>

how we kept the index around between successive searches on one

string. Perl offers another alternative to this problem: the <TT><FONT FACE="Courier">pos()</FONT></TT>

function. The <TT><FONT FACE="Courier">pos</FONT></TT> function

returns the location of the last pattern match in a string. You

can reuse the last match value when using the global (<TT><FONT FACE="Courier">g</FONT></TT>)

pattern matching operator. The syntax for the <TT><FONT FACE="Courier">pos</FONT></TT>

function is

<BLOCKQUOTE>

<TT><FONT FACE="Courier">$offset = pos($string);</FONT></TT>

</BLOCKQUOTE>

<P>

where <TT><FONT FACE="Courier">$string</FONT></TT> is the string

whose pattern is being matched. The returned <TT><FONT FACE="Courier">$offset</FONT></TT>

is the number of characters already matched or skipped. 

<P>

Listing 7.11 presents a simple script to search for the letter

<TT><FONT FACE="Courier">n</FONT></TT> in <TT><FONT FACE="Courier">Bananarama</FONT></TT>.

<HR>

<BLOCKQUOTE>

<B>Listing 7.11. Using the </B><TT><B><FONT FACE="Courier">pos</FONT></B></TT><B>

function.<BR>

</B>

</BLOCKQUOTE>

<BLOCKQUOTE>

<TT><FONT FACE="Courier">1 #!/usr/bin/perl<BR>

2 $string = &quot;Bananarama&quot;;<BR>

3 while ($string =~ /n/g) {<BR>

4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$offset

= pos($string);<BR>

5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print(&quot;Found

an n at $offset\n&quot;);<BR>

6 }</FONT></TT>

</BLOCKQUOTE>

<HR>

<P>

Here's the output for this program:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">Found an n at 2<BR>

Found an n at 4<BR>

Found an n at 6<BR>

Found an n at 8<BR>

Found an n at 10</FONT></TT>

</BLOCKQUOTE>

<P>

The starting position for <TT><FONT FACE="Courier">pos()</FONT></TT>

to work does not have to start at 0. Like the <TT><FONT FACE="Courier">substr()</FONT></TT>

function, you can use <TT><FONT FACE="Courier">pos()</FONT></TT>

on the right side of the equal sign. To start a search at position

6, simply type this line before you process the string:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">pos($string) = 5;</FONT></TT>

</BLOCKQUOTE>

<P>

To restart searching from the beginning, reset the value of <TT><FONT FACE="Courier">pos</FONT></TT>

to <TT><FONT FACE="Courier">0</FONT></TT>.

<H2><A NAME="ReusingPortionsofPatterns"><FONT SIZE=5 COLOR=#FF0000>Reusing

Portions of Patterns</FONT></A></H2>

<P>

There will be times when you want to write patterns that address

groups of numbers. For example, a section of comma-delimited data

from the output of a spreadsheet is of this form:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">digits,digits,digits,digits</FONT></TT>

</BLOCKQUOTE>

<P>

A bit repetitive, isn't it? To extract this tidbit of information

from the middle of a document, you could use something like this:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/[\d]+[,.][\d]+[,.][\d]+[,.][\d]+/</FONT></TT>

</BLOCKQUOTE>

<P>

What if there were 10 columns? The pattern would be long, and

you'd be prone to make mistakes.

<P>

Perl provides a macro substitution to allow repetitions of a known

sequence. Every pattern in a matched string that is enclosed in

memory is stored in memory in the order it is declared. To retrieve

a sequence from memory, use the special character <TT><FONT FACE="Courier">\<I>n</I></FONT></TT>,

where <TT><I><FONT FACE="Courier">n</FONT></I></TT> is an integer

representing the <I>n</I>th pattern stored in memory.

<P>

For example, you can write the previous lines using these two

repetitive patterns:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">([\d]+)<BR>

([,.])</FONT></TT>

</BLOCKQUOTE>

<P>

The string that is used for matching the pattern would look like

this:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/([\d]+])([,.])\1\2\1\2\1\2/</FONT></TT>

</BLOCKQUOTE>

<P>

The pattern matched by <TT><FONT FACE="Courier">[\d]+</FONT></TT>

is stored in memory. When the Perl interpreter sees the escape

sequence <TT><FONT FACE="Courier">\1</FONT></TT>, it matches the

first matched pattern. When it sees <TT><FONT FACE="Courier">\2</FONT></TT>,

it matches the second pattern. Pattern sequences are stored in

memory from left to right. As another example, the following matches

a phone number in the United States, which is of the form ###-###-####,

where the # is a digit:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/\d{3}(\-))\d{3}\1\d{2}/</FONT></TT>

</BLOCKQUOTE>

<P>

The pattern sequence memory is preserved only for the length of

the pattern. You can access these variables for a short time,

at least until another pattern match is hit, by examining the

special variables of the form <TT><FONT FACE="Courier">$n</FONT></TT>.

The <TT><FONT FACE="Courier">$n</FONT></TT> variables contain

the value of patterns matched in parentheses right after a match.

The special variable <TT><FONT FACE="Courier">$&amp;</FONT></TT>

contains the entire matched pattern.

<P>

In the previous snippet of code, to get the data matched in columns

into separate variables, you can use something like this excerpt

in a program:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">if (/-?(\d+)\.?(\d+)/) {<BR>

$matchedPart = $&amp;;<BR>

$col_1 = $1;<BR>

$col_2 = $2;<BR>

$col_3 = $3;<BR>

$col_4 = $4;<BR>

}</FONT></TT>

</BLOCKQUOTE>

<P>

The order of precedence when using <TT><FONT FACE="Courier">()</FONT></TT>

is higher than that of other pattern-matching characters. Here

is the order of precedence from high to low:<P>

<CENTER>

<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>

<TR VALIGN=TOP><TD WIDTH=167><TT><FONT FACE="Courier">()</FONT></TT></TD>

<TD WIDTH=171>Pattern memory</TD></TR>

<TR VALIGN=TOP><TD WIDTH=167><TT><FONT FACE="Courier">+ * ? {}</FONT></TT>

</TD><TD WIDTH=171>Number of occurrences</TD></TR>

<TR VALIGN=TOP><TD WIDTH=167><TT><FONT FACE="Courier">^ $ \b \B \W \w</FONT></TT>

</TD><TD WIDTH=171>Pattern anchors</TD></TR>

<TR VALIGN=TOP><TD WIDTH=167><TT><FONT FACE="Courier">|</FONT></TT></TD>

<TD WIDTH=171>The <TT><FONT FACE="Courier">OR</FONT></TT> operator

</TD></TR>

</TABLE></CENTER>

<P>

<P>

The pattern-memory special characters <TT><FONT FACE="Courier">()</FONT></TT>

serve as delimiters for the <TT><FONT FACE="Courier">OR</FONT></TT>

operator. The side effect of this delimiting is that the parenthesized

part of the pattern is mapped into a <TT><FONT FACE="Courier">$n</FONT></TT>

register. For example, in the following line, the <TT><FONT FACE="Courier">\1</FONT></TT>

refers to (<TT><FONT FACE="Courier">b|d</FONT></TT>), not the

(<TT><FONT FACE="Courier">a|o</FONT></TT>) matching pattern:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/(b|d)(a|o)(rk).*\1\2\3/</FONT></TT>

</BLOCKQUOTE>

<H2><A NAME="PatternMatchingOptions"><FONT SIZE=5 COLOR=#FF0000>Pattern-Matching

Options</FONT></A></H2>

<P>

There are several pattern-matching options in Perl to control

how strings are matched. You saw these options earlier when I

introduced the syntax for pattern matching. Here are the options:

<P>

<CENTER>

<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>

<TR VALIGN=TOP><TD WIDTH=115><TT><FONT FACE="Courier"><center>g</FONT></TT>

</TD><TD WIDTH=243>Match all possible patterns</TD></TR>

<TR VALIGN=TOP><TD WIDTH=115><CENTER><TT><FONT FACE="Courier">i</FONT></TT></CENTER>

</TD><TD WIDTH=243>Ignore case when matching strings</TD></TR>

<TR VALIGN=TOP><TD WIDTH=115><CENTER><TT><FONT FACE="Courier">m</FONT></TT></CENTER>

</TD><TD WIDTH=243>Treat string as multiple lines</TD></TR>

<TR VALIGN=TOP><TD WIDTH=115><CENTER><TT><FONT FACE="Courier">o</FONT></TT></CENTER>

</TD><TD WIDTH=243>Only evaluate once</TD></TR>

<TR VALIGN=TOP><TD WIDTH=115><CENTER><TT><FONT FACE="Courier">s</FONT></TT></CENTER>

</TD><TD WIDTH=243>Treat string as single line</TD></TR>

<TR VALIGN=TOP><TD WIDTH=115><CENTER><TT><FONT FACE="Courier">x</FONT></TT></CENTER>

</TD><TD WIDTH=243>Ignore white space in pattern</TD></TR>

</TABLE></CENTER>

<P>

<P>

All these pattern options must be specified immediately after

the option. For example, the following pattern uses the <TT><FONT FACE="Courier">i</FONT></TT>

option to ignore case:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/first*name/i</FONT></TT>

</BLOCKQUOTE>

<P>

More than one option can be specified at one time and can be specified

in any order.

<P>

The <TT><FONT FACE="Courier">g</FONT></TT> operator tells the

Perl interpreter to match all the possible patterns in a string.

For example, if the string <TT><FONT FACE="Courier">bananarama</FONT></TT>

is searched using the following pattern:

<BLOCKQUOTE>

<TT><FONT FACE="Courier">/.a/g</FONT></TT>

</BLOCKQUOTE>

<P>

it will match <TT><FONT FACE="Courier">ba</FONT></TT>, <TT><FONT FACE="Courier">na</FONT></TT>,

<TT><FONT FACE="Courier">na</FONT></TT>, <TT><FONT FACE="Courier">ra</FONT></TT>,

and <TT><FONT FACE="Courier">ma</FONT></TT>. You can assign the

return of all these matches to an array. Here's an
上一页 1 2 3 45
💿 文件大小 1200 K
👤 上传用户 cz6891297
📂 所属分类其他书籍
🏷️ 相关标签

#Unreleased #Perl
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -