📄 ch10.htm
字号:
$scalar = The tree has many leaves</PRE></BLOCKQUOTE><P>The first print line does not get executed because the complementarybinding operator returns false.<H2><A NAME="HowtoCreatePatterns"><FONT SIZE=5 COLOR=#FF0000>How to Create Patterns</FONT></A></H2><P>So far in this chapter, you've read about the different operatorsused with regular expressions, and you've seen how to match simplesequeNCes of characters. Now we'll look at the wide array of meta-charactersthat are used to harness the full power of regular expressions.<I>Meta-characters</I> are characters that have an additionalmeaning above and beyond their literal meaning. For example, theperiod character can have two meanings in a pattern. First, itcan be used to match a period character in the searched string-thisis its <I>literal meaning</I>. And second, it can be used to match<I>any</I> character in the searched string except for the newlinecharacter-this is its <I>meta-meaning</I>.<P>When creating patterns, the meta-meaning always will be the default.If you really intend to match the literal character, you needto prefix the meta-character with a backslash. You might recallthat the backslash is used to create an escape sequeNCe.<P>For more information about escape sequeNCes, see <A HREF="ch2.htm" >Chapter 2</A> "Example:Double Quoted Strings."<P>Patterns can have many different components. These componentsall combine to provide you with the power to match any type ofstring. The following list of components will give you a goodidea of the variety of ways that patterns can be created. Thesection "Pattern Examples" later in this chapter showsmany examples of these rules in action.<BLOCKQUOTE><B>Variable Interpolation:</B> Any variable is interpolated, andthe essentially new pattern then is evaluated as a regular expression.Remember that only one level of interpolation is done. This meansthat if the value of the variable iNCludes, for example, <TT>$scalar</TT>as a string value, then <TT>$scalar</TT>will not be interpolated. In addition, back-quotes do not interpolatewithin double-quotes, and single-quotes do not stop interpolationof variables when used within double-quotes.</BLOCKQUOTE><BLOCKQUOTE><B>Self-Matching Characters:</B> Any character will match itselfunless it is a meta-character or one of <TT>$</TT>,<TT>@</TT>, and <TT>&</TT>.The meta-characters are listed in Table 10.5, and the other charactersare used to begin variable names and fuNCtion calls. You can usethe backslash character to force Perl to match the literal meaningof any character. For example, <TT>m/a/</TT>will return true if the letter <TT>a</TT>is in the <TT>$_</TT> variable. And<TT>m/\$/</TT> will return true ifthe character <TT>$</TT> is in the<TT>$_</TT> variable.<BR></BLOCKQUOTE><P><CENTER><B>Table 10.5 Regular Expression Meta-Characters,Meta-Brackets, and Meta-SequeNCes</B></CENTER><p><CENTER><TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%><TR><TD WIDTH=133><CENTER><I>Meta-Character</I></CENTER></TD><TD WIDTH=457><I>Description</I></TD></TR><TR><TD WIDTH=133><CENTER>^</CENTER></TD><TD WIDTH=457>This meta-character-the caret-will match the beginning of a string or if the <TT>/m</TT> option is used, matches the beginning of a line. It is one of two pattern aNChors-the other aNChor is the <TT>$</TT>.</TD></TR><TR><TD WIDTH=133><CENTER>.</CENTER></TD><TD WIDTH=457>This meta-character will match any character except for the new line unless the <TT>/s</TT> option is specified. If the <TT>/s</TT> option is specified, then the newline also will be matched.</TD></TR><TR><TD WIDTH=133><CENTER>$</CENTER></TD><TD WIDTH=457>This meta-character will match the end of a string or if the <TT>/m</TT> option is used, matches the end of a line. It is one of two pattern aNChors-the other aNChor is the <TT>^</TT>.</TD></TR><TR><TD WIDTH=133><CENTER>|</CENTER></TD><TD WIDTH=457>This meta-character-called <I>alternation</I>-lets you specify two values that can cause the match to succeed. For instaNCe, <TT>m/a|b/</TT> means that the <TT>$_</TT> variable must contain the <TT>"a"</TT> or <TT>"b"</TT> character for the match to succeed.</TD></TR><TR><TD WIDTH=133><CENTER>*</CENTER></TD><TD WIDTH=457>This meta-character indicates that the "thing" immediately to the left should be matched 1 or more times in order to be evaluated as true.</TD></TR><TR><TD WIDTH=133><CENTER>?</CENTER></TD><TD WIDTH=457>This meta-character indicates that the "thing" immediately to the left should be matched 0 or 1 times in order to be evaluated as true. When used in conjuNCtion with the <TT>+</TT>, <TT>_</TT>, <TT>?</TT>, or {<TT>n</TT>, <TT>m</TT>} meta-characters and brackets, it means that the regular expression should be non-greedy and match the smallest possible string.</TD></TR><TR><TD WIDTH=133><CENTER><I>Meta-Brackets</I></CENTER></TD><TD WIDTH=457><I>Description</I></TD></TR><TR><TD WIDTH=133><CENTER>()</CENTER></TD><TD WIDTH=457>The parentheses let you affect the order of pattern evaluation and act as a form of pattern memory. See the section "Pattern Memory" later in this chapter for more information.</TD></TR><TR><TD WIDTH=133><CENTER>(?...)</CENTER></TD><TD WIDTH=457>If a question mark immediately follows the left parentheses, it indicates that an extended mode component is being specified. See the section, "Example: Extension Syntax," later in this chapter for more information.</TD></TR><TR><TD WIDTH=133><CENTER>{n, m}</CENTER></TD><TD WIDTH=457>The curly braces specify how many times the "thing" immediately to the left should be matched. <TT>{n}</TT> means that it should be matched exactly n times. <TT>{n,}</TT> means it must be matched at least n times. <TT>{n, m}</TT> means that it must be matched at least n times and not more than m times.</TD></TR><TR><TD WIDTH=133><CENTER>[]</CENTER></TD><TD WIDTH=457>The square brackets let you create a character class. For instaNCe, <TT>m/[abc]/</TT> will evaluate to true if any of <TT>"a"</TT>, <TT>"b"</TT>, or <TT>"c"</TT> is contained in <TT>$_</TT>. The square brackets are a more readable alternative to the alternation meta-character.</TD></TR><TR><TD WIDTH=133><CENTER><I>Meta-SequeNCes</I></CENTER></TD><TD WIDTH=457><I>Description</I></TD></TR><TR><TD WIDTH=133><CENTER>\</CENTER></TD><TD WIDTH=457>This meta-character "escapes" the following character. This means that any special meaning normally attached to that character is ignored. For instaNCe, if you need to iNClude a dollar sign in a pattern, you must use <TT>\$</TT> to avoid Perl's variable interpolation. Use <TT>\\</TT> to specify the backslash character in your pattern.</TD></TR><TR><TD WIDTH=133><CENTER>\0nnn</CENTER></TD><TD WIDTH=457>Any Octal byte.</TD></TR><TR><TD WIDTH=133><CENTER>\a</CENTER></TD><TD WIDTH=457>Alarm.</TD></TR><TR><TD WIDTH=133><CENTER>\A</CENTER></TD><TD WIDTH=457>This meta-sequeNCe represents the beginning of the string. Its meaning is not affected by the <TT>/m</TT> option.</TD></TR><TR><TD WIDTH=133><CENTER>\b</CENTER></TD><TD WIDTH=457>This meta-sequeNCe represents the backspace character inside a character class; otherwise, it represents a <I>word boundary</I>. A word boundary is the spot between word (<TT>\w</TT>) and non-word(<TT>\W</TT>) characters. Perl thinks that the <TT>\W</TT> meta-sequeNCe matches the imaginary characters off the ends of the string.</TD></TR><TR><TD WIDTH=133><CENTER>\B</CENTER></TD><TD WIDTH=457>Match a non-word boundary.</TD></TR><TR><TD WIDTH=133><CENTER>\cn</CENTER></TD><TD WIDTH=457>Any control character.</TD></TR><TR><TD WIDTH=133><CENTER>\d</CENTER></TD><TD WIDTH=457>Match a single digit character.</TD></TR><TR><TD WIDTH=133><CENTER>\D</CENTER></TD><TD WIDTH=457>Match a single non-digit character.</TD></TR><TR><TD WIDTH=133><CENTER>\e</CENTER></TD><TD WIDTH=457>Escape.</TD></TR><TR><TD WIDTH=133><CENTER>\E</CENTER></TD><TD WIDTH=457>Terminate the <TT>\L</TT> or <TT>\U</TT> sequeNCe.</TD></TR><TR><TD WIDTH=133><CENTER>\f</CENTER></TD><TD WIDTH=457>Form Feed.</TD></TR><TR><TD WIDTH=133><CENTER>\G</CENTER></TD><TD WIDTH=457>Match only where the previous <TT>m//g</TT> left off.</TD></TR><TR><TD WIDTH=133><CENTER>\l</CENTER></TD><TD WIDTH=457>Change the next character to lowercase.</TD></TR><TR><TD WIDTH=133><CENTER>\L</CENTER></TD><TD WIDTH=457>Change the following characters to lowercase until a <TT>\E</TT> sequeNCe is eNCountered.</TD></TR><TR><TD WIDTH=133><CENTER>\n</CENTER></TD><TD WIDTH=457>Newline.</TD></TR><TR><TD WIDTH=133><CENTER>\Q</CENTER></TD><TD WIDTH=457>Quote Regular Expression meta-characters literally until the <TT>\E</TT> sequeNCe is eNCountered.</TD></TR><TR><TD WIDTH=133><CENTER>\r</CENTER></TD><TD WIDTH=457>Carriage Return.</TD></TR><TR><TD WIDTH=133><CENTER>\s</CENTER></TD><TD WIDTH=457>Match a single whitespace character.</TD></TR><TR><TD WIDTH=133><CENTER>\S</CENTER></TD><TD WIDTH=457>Match a single non-whitespace character.</TD></TR><TR><TD WIDTH=133><CENTER>\t</CENTER></TD><TD WIDTH=457>Tab.</TD></TR><TR><TD WIDTH=133><CENTER>\u</CENTER></TD><TD WIDTH=457>Change the next character to uppercase.</TD></TR><TR><TD WIDTH=133><CENTER>\U</CENTER></TD><TD WIDTH=457>Change the following characters to uppercase until a <TT>\E</TT> sequeNCe is eNCountered.</TD></TR><TR><TD WIDTH=133><CENTER>\v</CENTER></TD><TD WIDTH=457>Vertical Tab.</TD></TR><TR><TD WIDTH=133><CENTER>\w</CENTER></TD><TD WIDTH=457>Match a single word character. Word characters are the alphanumeric and underscore characters.</TD></TR><TR><TD WIDTH=133><CENTER>\W</CENTER></TD><TD WIDTH=457>Match a single non-word character.</TD></TR><TR><TD WIDTH=133><CENTER>\xnn</CENTER></TD><TD WIDTH=457>Any Hexadecimal byte.</TD></TR><TR><TD WIDTH=133><CENTER>\Z</CENTER></TD><TD WIDTH=457>This meta-sequeNCe represents the end of the string. Its meaning is not affected by the <TT>/m</TT> option.</TD></TR><TR><TD WIDTH=133><CENTER>\$</CENTER></TD><TD WIDTH=457>Dollar Sign.</TD></TR><TR><TD WIDTH=133><CENTER>\@</CENTER></TD><TD WIDTH=457>Ampersand.</TD></TR></TABLE></CENTER><P><BLOCKQUOTE><B>Character SequeNCes:</B> A sequeNCe of characters will matchthe identical sequeNCe in the searched string. The charactersneed to be in the same order in both the pattern and the searchedstring for the match to be true. For example, <TT>m/abc/;</TT>will match <TT>"abc"</TT>but not <TT>"cab"</TT> or<TT>"bca"</TT>. If any characterin the sequeNCe is a meta-character, you need to use the backslashto match its literal value.</BLOCKQUOTE><BLOCKQUOTE><B>Alternation:</B> The <I>alternation </I>meta-character (<TT>|</TT>)will let you match more than one possible string. For example,<TT>m/a|b/;</TT> will match if eitherthe <TT>"a"</TT> characteror the <TT>"b"</TT> characteris in the searched string. You can usesequeNCes of more than onecharacter with alternation. For example, <TT>m/dog|cat/;</TT>will match if either of the strings <TT>"dog"</TT>or <TT>"cat"</TT> is inthe searched string.<BR></BLOCKQUOTE><p><CENTER><TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%><TR><TD><B>Tip</B> </TD></TR><TR><TD><BLOCKQUOTE>Some programmers like to eNClose the alternation sequeNCe inside parentheses to help indicate where the sequeNCe begins and ends.</BLOCKQUOTE><BLOCKQUOTE><TT>m/(dog|cat)/;</TT></BLOCKQUOTE><BLOCKQUOTE>However, this will affect something called <I>pattern memory</I>, which you'll be learning about in the section, "Example: Pattern Memory," later in the chapter.</BLOCKQUOTE></TD></TR></TABLE></CENTER><P><BLOCKQUOTE><B>Character Classes:</B> The square brackets are used to createcharacter classes. A <I>character class</I> is used to match aspecific type of character. For example, you can match any decimaldigit using <TT>m/[0123456789]/;</TT>.This will match a single character in the range of zero to nine.You can find more information about character classes in the section,"Example: Character Classes," later in this chapter.</BLOCKQUOTE><BLOCKQUOTE><B>Symbolic Character Classes:</B> There are several characterclasses that are used so frequently that they have a symbolicrepresentation. The period meta-character stands for a specialcharacter class that matches all characters except for the newline.The rest are <TT>\d</TT>, <TT>\D</TT>,<TT>\s</TT>, <TT>\S</TT>,<TT>\w</TT>, and <TT>\W</TT>.These are mentioned in Table 10.5 earlier and are discussed inthe section, "Example: Character Classes," later inthis chapter.</BLOCKQUOTE><BLOCKQUOTE><B>ANChors:</B> The caret (<TT>^</TT>)and the dollar sign meta-characters are used to aNChor a patternto the beginning and the end of the searched string. The caretis always the first character in the pattern when used as an aNChor.For example, <TT>m/^one/;</TT> willonly match if the searched string starts with a sequeNCe of characters,<TT>one</TT>. The dollar sign is alwaysthe last character in the pattern when used as an aNChor. Forexample, <TT>m/(last|end)$/;</TT>will match only if the searched string ends with either the charactersequeNCe <TT>last</TT> or the charactersequeNCe <TT>end</TT>. The <TT>\A</TT>and <TT>\Z</TT> meta-sequeNCes alsoare used as pattern aNChors for the beginning and end of strings.</BLOCKQUOTE><BLOCKQUOTE><B>Quantifiers:</B> There are several meta-characters that aredevoted to controlling how many characters are matched. For example,<TT>m/a{5}/;</TT> means that five<TT>a</TT> characters must be foundbefore a true result can be returned. The <TT>*</TT>,<TT>+</TT>, and <TT>?</TT>meta-characters and the curly braces are all used as quantifiers.See the section, "Example: Quantifiers," later in thischapter for more information.</BLOCKQUOTE><BLOCKQUOTE><B>Pattern Memory:</B> Parentheses are used to store matched valuesinto buffers for later recall. I like to think of this as a formof pattern memory. Some programmers call them back-refereNCes.After you use <TT>m/(fish|fowl)/;</TT>to match a string and a match is found, the variable <TT>$1</TT>will hold either <TT>fish</TT> or<TT>fowl</TT> depending on which sequeNCewas matched. See the section, "Example: Pattern Memory,"later in this chapter for more information.</BLOCKQUOTE><BLOCKQUOTE><B>Word Boundaries:</B> The <TT>\b</TT>meta-sequeNCe will match the spot between a space and the firstcharacter of a word or between the last character of a word andthe space. The <TT>\b</TT> will matchat the beginning or end of a string if there are no leading ortrailing spaces. For example, <TT>m/\bfoo/;</TT>will match <TT>foo</TT> even withoutspaces surrounding the word. It also will match $<TT>foo</TT>because the dollar sign is not considered a word character. Thestatement <TT>m/foo\b/;</TT> willmatch <TT>foo</TT> but not <TT>foobar</TT>,and the statement <TT>m/\bwiz/;</TT>will match <TT>wizard</TT> but not<TT>geewiz</TT>. See the section,"Example: Character Classes," later in this chapterfor more information about word boundaries.</BLOCKQUOTE><BLOCKQUOTE>The <TT>\B</TT> meta-sequeNCe willmatch everywhere except at a word boundary.</BLOCKQUOTE><BLOCKQUOTE><B>Quoting Meta-Characters:</B> You can match meta-charactersliterally by eNClosing them in a <TT>\Q..\E</TT>sequeNCe. This will let you avoid using the backslash characterto escape all meta-characters, and your code will be easier toread.</BLOCKQUOTE><BLOCKQUOTE><B>Extended Syntax:</B> The (?...) sequeNCe lets you use an extendedversion of the regular expression syntax. The different optionsare discussed in the section, "Example: Extension Syntax,"later in this chapter.</BLOCKQUOTE><BLOCKQUOTE><B>Combinations:</B> Any of the preceding components can be combinedwith any other to create simple or complex patterns.</BLOCKQUOTE><P>The power of patterns is that you don't always know in advaNCethe value of the string that you will be searching. If you needto match the first word in a string that was read in from a file,you probably have no idea how long it might be; therefore, youneed to build a pattern. You might start with the <TT>\w</TT>symbolic character class, which will match any single alphanumericor underscore character. So, assuming that the string is in the<TT>$_</TT> variable, you can matcha one-character word like this:<BLOCKQUOTE><PRE>m/\w/;</PRE></BLOCKQUOTE><P>If you need to match both a one-character word and a two-characterword, you can do this:<BLOCKQUOTE><PRE>m/\w|\w\w/;</PRE></BLOCKQUOTE><P>This pattern says to match a single word character or two consecutiveword characters. You could continue to add alternation componentsto match the different lengths of words that you might expectto see, but there is a better way.<P>You can use the <TT>+</TT> quantifierto say that the match should succeed only if the component ismatched one or more times. It is used this way:<BLOCKQUOTE><PRE>m/\w+/;</PRE></BLOCKQUOTE><P>If the value of <TT>$_</TT> was <TT>"AAA
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -