📄 ch10.htm

📁 prrl 5 programs codes in the book
💻 HTM
📖 第 1 页 / 共 5 页
字号:

<BLOCKQUOTE>

The <TT>\B</TT> meta-sequeNCe will

match everywhere except at a word boundary.

</BLOCKQUOTE>

<BLOCKQUOTE>

<B>Quoting Meta-Characters:</B> You can match meta-characters

literally by eNClosing them in a <TT>\Q..\E</TT>

sequeNCe. This will let you avoid using the backslash character

to escape all meta-characters, and your code will be easier to

read.

</BLOCKQUOTE>

<BLOCKQUOTE>

<B>Extended Syntax:</B> The (?...) sequeNCe lets you use an extended

version of the regular expression syntax. The different options

are discussed in the section, &quot;Example: Extension Syntax,&quot;

later in this chapter.

</BLOCKQUOTE>

<BLOCKQUOTE>

<B>Combinations:</B> Any of the preceding components can be combined

with any other to create simple or complex patterns.

</BLOCKQUOTE>

<P>

The power of patterns is that you don't always know in advaNCe

the value of the string that you will be searching. If you need

to match the first word in a string that was read in from a file,

you probably have no idea how long it might be; therefore, you

need to build a pattern. You might start with the <TT>\w</TT>

symbolic character class, which will match any single alphanumeric

or underscore character. So, assuming that the string is in the

<TT>$_</TT> variable, you can match

a one-character word like this:

<BLOCKQUOTE>

<PRE>

m/\w/;

</PRE>

</BLOCKQUOTE>

<P>

If you need to match both a one-character word and a two-character

word, you can do this:

<BLOCKQUOTE>

<PRE>

m/\w|\w\w/;

</PRE>

</BLOCKQUOTE>

<P>

This pattern says to match a single word character or two consecutive

word characters. You could continue to add alternation components

to match the different lengths of words that you might expect

to see, but there is a better way.

<P>

You can use the <TT>+</TT> quantifier

to say that the match should succeed only if the component is

matched one or more times. It is used this way:

<BLOCKQUOTE>

<PRE>

m/\w+/;

</PRE>

</BLOCKQUOTE>

<P>

If the value of <TT>$_</TT> was <TT>&quot;AAA

BBB&quot;</TT>, then <TT>m/\w+/;</TT>

would match the <TT>&quot;AAA&quot;</TT>

in the string. If <TT>$_</TT> was

blank, full of white space, or full of other non-word characters,

an undefined value would be returned.

<P>

The preceding pattern will let you determine if <TT>$_</TT>

contains a word but does not let you know what the word is. In

order to accomplish that, you need to eNClose the matching components

inside parentheses. For example:

<BLOCKQUOTE>

<PRE>

m/(\w+)/;

</PRE>

</BLOCKQUOTE>

<P>

By doing this, you force Perl to store the matched string into

the $1 variable. The $1 variable can be considered as pattern

memory.

<P>

This introduction to pattern components describes most of the

details you need to know in order to create your own patterns

or regular expressions. However, some of the components deserve

a bit more study. The next few sections look at character classes,

quantifiers, pattern memory, pattern precedeNCe, and the extension

syntax. Then the rest of the chapter is devoted to showing specific

examples of when to use the different components.

<H3><A NAME="ExampleCharacterClasses">

Example: Character Classes</A></H3>

<P>

A character class defines a type of character. The character class

<TT>[0123456789]</TT> defines the

class of decimal digits, and <TT>[0-9a-f]</TT>

defines the class of hexadecimal digits. Notice that you can use

a dash to define a range of consecutive characters. Character

classes let you match any of a range of characters; you don't

know in advaNCe which character will be matched. This capability

to match non-specific characters is what meta-characters are all

about.

<P>

You can use variable interpolation inside the character class,

but you must be careful when doing so. For example,

<BLOCKQUOTE>

<PRE>

$_ = &quot;AAABBBccC&quot;;

$charList = &quot;ADE&quot;;

print &quot;matched&quot; if m/[$charList]/;

</PRE>

</BLOCKQUOTE>

<P>

will display

<BLOCKQUOTE>

<PRE>

matched

</PRE>

</BLOCKQUOTE>

<P>

This is because the variable interpolation results in a character

class of <TT>[ADE]</TT>. If you use

the variable as one-half of a character range, you need to ensure

that you don't mix numbers and digits. For example,

<BLOCKQUOTE>

<PRE>

$_ = &quot;AAABBBccC&quot;;

$charList = &quot;ADE&quot;;

print &quot;matched&quot; if m/[$charList-9]/;

</PRE>

</BLOCKQUOTE>

<P>

will result in the following error message when executed:

<BLOCKQUOTE>

<PRE>

/[ADE-9]/: invalid [] range in regexp at test.pl line 4.

</PRE>

</BLOCKQUOTE>

<P>

At times, it's necessary to match on any character except for

a given character list. This is done by complementing the character

class with the caret. For example,

<BLOCKQUOTE>

<PRE>

$_ = &quot;AAABBBccC&quot;;

print &quot;matched&quot; if m/[^ABC]/;

</PRE>

</BLOCKQUOTE>

<P>

will display nothing. This match returns true only if a character

besides <TT>A</TT>, <TT>B</TT>,

or <TT>C</TT> is in the searched string.

If you complement a list with just the letter <TT>A</TT>,

<BLOCKQUOTE>

<PRE>

$_ = &quot;AAABBBccC&quot;;

print &quot;matched&quot; if m/[^A]/;

</PRE>

</BLOCKQUOTE>

<P>

then the string <TT>&quot;matched&quot;</TT>

will be displayed because <TT>B</TT>

and <TT>C</TT> are part of the string-in

other words, a character besides the letter <TT>A</TT>.

<P>

Perl has shortcuts for some character classes that are frequently

used. Here is a list of what I call symbolic character classes:

<BR>



<p>

<CENTER>

<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>

<TR><TD WIDTH=67><CENTER><TT><B><FONT FACE="Courier">\w</FONT></B></TT></CENTER>

</TD><TD WIDTH=523>This symbol matches any alphanumeric character or the underscore character. It is equivalent to the character class <TT>[a-zA-Z0-9_]</TT>.

</TD></TR>

<TR><TD WIDTH=67><CENTER><TT><B><FONT FACE="Courier">\W</FONT></B></TT></CENTER>

</TD><TD WIDTH=523>This symbol matches every character that the <TT>\w</TT> symbol does not. In other words, it is the complement of <TT>\w</TT>. It is equivalent to <TT>[^a-zA-Z0-9_]</TT>.

</TD></TR>

<TR><TD WIDTH=67><CENTER><TT><B><FONT FACE="Courier">\s</FONT></B></TT></CENTER>

</TD><TD WIDTH=523>This symbol matches any space, tab, or newline character. It is equivalent to <TT>[\t \n]</TT>.

</TD></TR>

<TR><TD WIDTH=67><CENTER><TT><B><FONT FACE="Courier">\S</FONT></B></TT></CENTER>

</TD><TD WIDTH=523>This symbol matches any non-whitespace character. It is equivalent to <TT>[^\t \n]</TT>.

</TD></TR>

<TR><TD WIDTH=67><CENTER><TT><B><FONT FACE="Courier">\d</FONT></B></TT></CENTER>

</TD><TD WIDTH=523>This symbol matches any digit. It is equivalent to <TT>[0-9]</TT>.

</TD></TR>

<TR><TD WIDTH=67><CENTER><TT><B><FONT FACE="Courier">\D</FONT></B></TT></CENTER>

</TD><TD WIDTH=523>This symbol matches any non-digit character. It is equivalent to <TT>[^0-9]</TT>.

</TD></TR>

</TABLE>

</CENTER>

<P>

<P>

You can use these symbols inside other character classes, but

not as endpoints of a range. For example, you can do the following:

<BLOCKQUOTE>

<PRE>

$_ = &quot;\tAAA&quot;;

print &quot;matched&quot; if m/[\d\s]/;

</PRE>

</BLOCKQUOTE>

<P>

which will display

<BLOCKQUOTE>

<PRE>

matched

</PRE>

</BLOCKQUOTE>

<P>

because the value of <TT>$_</TT> iNCludes

the tab character.<BR>

<p>

<CENTER>

<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>

<TR><TD><B>Tip</B></TD></TR>

<TR><TD>

<BLOCKQUOTE>

Meta-characters that appear inside the square brackets that define a character class are used in their literal sense. They lose their meta-meaning. This may be a little confusing at first. In fact, I have a tendeNCy to forget this when evaluating 
patterns.</BLOCKQUOTE>



</TD></TR>

</TABLE>

</CENTER>

<P>

<p>

<CENTER>

<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>

<TR><TD><B>Note</B></TD></TR>

<TR><TD>

<BLOCKQUOTE>

I think that most of the confusion regarding regular expressions lies in the fact that each character of a pattern might have several possible meanings. The caret could be an aNChor, it could be a caret, or it could be used to complement a character 
class. Therefore, it is vital that you decide which context any given pattern character or symbol is in before assigning a meaning to it.</BLOCKQUOTE>



</TD></TR>

</TABLE>

</CENTER>

<P>

<H3><A NAME="ExampleQuantifiers">

Example: Quantifiers</A></H3>

<P>

Perl provides several different quantifiers that let you specify

how many times a given component must be present before the match

is true. They are used when you don't know in advaNCe how many

characters need to be matched. Table 10.6 lists the different

quantifiers that can be used.<BR>

<P>

<CENTER><B>Table 10.6&nbsp;&nbsp;The Six Types of Quantifiers</B></CENTER>

<p>

<CENTER>

<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>

<TR><TD WIDTH=91><CENTER><I>Quantifier</I></CENTER></TD><TD WIDTH=492><I>Description</I>

</TD></TR>

<TR><TD WIDTH=91><CENTER>*</CENTER></TD><TD WIDTH=492>The component must be present zero or more times.

</TD></TR>

<TR><TD WIDTH=91><CENTER>+</CENTER></TD><TD WIDTH=492>The component must be present one or more times.

</TD></TR>

<TR><TD WIDTH=91><CENTER>?</CENTER></TD><TD WIDTH=492>The component must be present zero or one times.

</TD></TR>

<TR><TD WIDTH=91><CENTER>{n}</CENTER></TD><TD WIDTH=492>The component must be present n times.

</TD></TR>

<TR><TD WIDTH=91><CENTER>{n,}</CENTER></TD><TD WIDTH=492>The component must be present at least n times.

</TD></TR>

<TR><TD WIDTH=91><CENTER>{n,m}</CENTER></TD><TD WIDTH=492>The component must be present at least n times and no more than m times.

</TD></TR>

</TABLE>

</CENTER>

<P>

<P>

If you need to match a word whose length is unknown, you need

to use the <TT>+</TT> quantifier.

You can't use an <TT>*</TT> because

a zero length word makes no sense. So, the match statement might

look like this:

<BLOCKQUOTE>

<PRE>

m/\w+/;

</PRE>

</BLOCKQUOTE>

<P>

This pattern will match <TT>&quot;QQQ&quot;</TT>

and <TT>&quot;AAAAA&quot;</TT> but

not <TT>&quot;&quot;</TT> or <TT>&quot;

 BBB&quot;</TT>. In order to account for the leading white

space, which may or may not be at the beginning of a string, you

need to use the asterisk (<TT>*</TT>)

quantifier in conjuNCtion with the <TT>\s</TT>

symbolic character class in the following way:

<BLOCKQUOTE>

<PRE>

m/\s*\w+/; <BR>



</PRE>

</BLOCKQUOTE>

<p>

<CENTER>

<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>

<TR><TD><B>Tip</B></TD></TR>

<TR><TD>

<BLOCKQUOTE>

Be careful when using the <TT>*</TT> quantifier because it can match an empty string, which might not be your intention. The pattern <TT>/b*/</TT> will match any string-even one without any <TT>b</TT> characters.

</BLOCKQUOTE>



</TD></TR>

</TABLE>

</CENTER>

<P>

<P>

At times, you may need to match an exact number of components.

The following match statement will be true only if five words

are present in the <TT>$_</TT> variable:

<BLOCKQUOTE>

<PRE>

$_ = &quot;AA AB AC AD AE&quot;;

m/(\w+\s+){5}/;

</PRE>

</BLOCKQUOTE>

<P>

In this example, we are matching at least one word character followed
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -