⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 ch06_01.htm

📁 By Tom Christiansen and Nathan Torkington ISBN 1-56592-243-3 First Edition, published August 1998
💻 HTM
📖 第 1 页 / 共 3 页
字号:
<HTML><HEAD><METANAME="DC.title"CONTENT="Perl Cookbook"><METANAME="DC.creator"CONTENT="Tom Christiansen &amp; Nathan Torkington"><METANAME="DC.publisher"CONTENT="O'Reilly &amp; Associates, Inc."><METANAME="DC.date"CONTENT="1999-07-02T01:33:04Z"><METANAME="DC.type"CONTENT="Text.Monograph"><METANAME="DC.format"CONTENT="text/html"SCHEME="MIME"><METANAME="DC.source"CONTENT="1-56592-243-3"SCHEME="ISBN"><METANAME="DC.language"CONTENT="en-US"><METANAME="generator"CONTENT="Jade 1.1/O'Reilly DocBook 3.0 to HTML 4.0"><LINKREV="made"HREF="mailto:online-books@oreilly.com"TITLE="Online Books Comments"><LINKREL="up"HREF="index.htm"TITLE="Perl Cookbook"><LINKREL="prev"HREF="ch05_17.htm"TITLE="5.16. Program: dutree"><LINKREL="next"HREF="ch06_02.htm"TITLE="6.1. Copying and Substituting Simultaneously"></HEAD><BODYBGCOLOR="#FFFFFF"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl Cookbook"><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><p><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch05_17.htm"TITLE="5.16. Program: dutree"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 5.16. Program: dutree"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><B><FONTFACE="ARIEL,HELVETICA,HELV,SANSERIF"SIZE="-1"></FONT></B></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch06_02.htm"TITLE="6.1. Copying and Substituting Simultaneously"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 6.1. Copying and Substituting Simultaneously"BORDER="0"></A></TD></TR></TABLE></DIV><DIVCLASS="chapter"><H1CLASS="chapter"><ACLASS="title"NAME="ch06-32612">6. Pattern Matching</A></H1><DIVCLASS="htmltoc"><P><B>Contents:</B><BR><ACLASS="sect1"HREF="#ch06-35940"TITLE="6.0. Introduction">Introduction</A><BR><ACLASS="sect1"HREF="ch06_02.htm"TITLE="6.1. Copying and Substituting Simultaneously">Copying and Substituting Simultaneously</A><BR><ACLASS="sect1"HREF="ch06_03.htm"TITLE="6.2. Matching Letters">Matching Letters</A><BR><ACLASS="sect1"HREF="ch06_04.htm"TITLE="6.3. Matching Words">Matching Words</A><BR><ACLASS="sect1"HREF="ch06_05.htm"TITLE="6.4.  Commenting Regular Expressions"> Commenting Regular Expressions</A><BR><ACLASS="sect1"HREF="ch06_06.htm"TITLE="6.5. Finding the Nth Occurrence of a Match">Finding the N<SUPCLASS="superscript">th</SUP> Occurrence of a Match</A><BR><ACLASS="sect1"HREF="ch06_07.htm"TITLE="6.6. Matching Multiple Lines">Matching Multiple Lines</A><BR><ACLASS="sect1"HREF="ch06_08.htm"TITLE="6.7. Reading Records with a Pattern Separator">Reading Records with a Pattern Separator</A><BR><ACLASS="sect1"HREF="ch06_09.htm"TITLE="6.8. Extracting a Range of Lines">Extracting a Range of Lines</A><BR><ACLASS="sect1"HREF="ch06_10.htm"TITLE="6.9. Matching Shell Globs as Regular Expressions">Matching Shell Globs as Regular Expressions</A><BR><ACLASS="sect1"HREF="ch06_11.htm"TITLE="6.10. Speeding Up Interpolated Matches">Speeding Up Interpolated Matches</A><BR><ACLASS="sect1"HREF="ch06_12.htm"TITLE="6.11. Testing for a Valid Pattern">Testing for a Valid Pattern</A><BR><ACLASS="sect1"HREF="ch06_13.htm"TITLE="6.12. Honoring Locale Settings in Regular Expressions">Honoring Locale Settings in Regular Expressions</A><BR><ACLASS="sect1"HREF="ch06_14.htm"TITLE="6.13. Approximate Matching">Approximate Matching</A><BR><ACLASS="sect1"HREF="ch06_15.htm"TITLE="6.14. Matching from Where the Last Pattern Left Off">Matching from Where the Last Pattern Left Off</A><BR><ACLASS="sect1"HREF="ch06_16.htm"TITLE="6.15. Greedy and Non-Greedy Matches">Greedy and Non-Greedy Matches</A><BR><ACLASS="sect1"HREF="ch06_17.htm"TITLE="6.16. Detecting Duplicate Words">Detecting Duplicate Words</A><BR><ACLASS="sect1"HREF="ch06_18.htm"TITLE="6.17. Expressing AND, OR, and NOT in a Single Pattern">Expressing AND, OR, and NOT in a Single Pattern</A><BR><ACLASS="sect1"HREF="ch06_19.htm"TITLE="6.18. Matching Multiple-Byte Characters">Matching Multiple-Byte Characters</A><BR><ACLASS="sect1"HREF="ch06_20.htm"TITLE="6.19. Matching a Valid Mail Address">Matching a Valid Mail Address</A><BR><ACLASS="sect1"HREF="ch06_21.htm"TITLE="6.20. Matching Abbreviations">Matching Abbreviations</A><BR><ACLASS="sect1"HREF="ch06_22.htm"TITLE="6.21. Program: urlify">Program: urlify</A><BR><ACLASS="sect1"HREF="ch06_23.htm"TITLE="6.22. Program: tcgrep">Program: tcgrep</A><BR><ACLASS="sect1"HREF="ch06_24.htm"TITLE="6.23. Regular Expression Grabbag">Regular Expression Grabbag</A></P><P></P></DIV><DIVCLASS="epigraph"ALIGN="right"><PCLASS="para"ALIGN="right"><I>[Art is] pattern informed by sensibility.</I></P><PCLASS="attribution"ALIGN="right">-&nbsp;Sir Herbert Read <CITECLASS="citetitle">The Meaning of Art</CITE></P></DIV><DIVCLASS="sect1"><H2CLASS="sect1"><ACLASS="title"NAME="ch06-35940">6.0. Introduction</A></H2><PCLASS="para"><ACLASS="indexterm"NAME="ch06-idx-1000007453-0"></A>Although most modern programming languages offer primitive pattern matching tools, usually through an extra library, Perl's patterns are integrated directly into the language core. Perl's patterns boast features not found in pattern matching in other languages, features that encourage a whole different way of looking at data. Just as chess players see patterns in the board positions that their pieces control, Perl adepts look at data in terms of patterns. These patterns, expressed in the punctuation-intensive language of regular expressions,[<ACLASS="footnote"HREF="#ch06-pgfId-1000006582">1</A>] provide access to powerful algorithms normally available only to computer science scholars.</P><BLOCKQUOTECLASS="footnote"><DIVCLASS="footnote"><PCLASS="para"><ACLASS="footnote"NAME="ch06-pgfId-1000006582">[1]</A> To be honest, <EMCLASS="emphasis">regular expressions</EM> in the classic sense of the word do not by definition contain backreferences, the way Perl's patterns do.</P></DIV></BLOCKQUOTE><PCLASS="para">"If this pattern matching thing is so powerful and so fantastic," you may be saying, "why don't you have a hundred different recipes on regular expressions in this chapter?" Regular expressions are the natural solution to many problems involving numbers, strings, dates, web documents, mail addresses, and almost everything else in this book ;  we used pattern matching over 100 times in other chapters. This chapter mostly presents recipes in which pattern matching forms part of the questions, not just part of the answers.</P><PCLASS="para">Perl's extensive and ingrained support for regular expressions means that you not only have features available that you won't find in any other language, but you have new ways of using them, too. Programmers new to Perl often look for functions like these:</P><PRECLASS="programlisting">match( $string, $pattern );subst( $string, $pattern, $replacement );</PRE><PCLASS="para">But matching and substituting are such common tasks that they merit their own syntax:</P><PRECLASS="programlisting">$meadow =~ m/sheep/;   # True if $meadow contains &quot;sheep&quot;$meadow !~ m/sheep/;   # True if $meadow doesn't contain &quot;sheep&quot;$meadow =~ s/old/new/; # Replace &quot;old&quot; with &quot;new&quot; in $meadow</PRE><PCLASS="para">Pattern matching isn't like direct string comparison, even at its simplest. It's more like string searching with mutant wildcards on steroids. Without anchors, the position where the match occurs can float freely throughout the string. Any of the following lines would also be matched by the expression <CODECLASS="literal">$meadow</CODE> <CODECLASS="literal">=~</CODE> <CODECLASS="literal">/ovine/</CODE>, giving false positives when looking for lost sheep:</P><PRECLASS="programlisting">Fine bovines demand fine toreadors.Muskoxen are a polar ovibovine species.Grooviness went out of fashion decades ago.</PRE><PCLASS="para">Sometimes they're right in front of you but they still don't match:</P><PRECLASS="programlisting">Ovines are found typically in oviaries.</PRE><PCLASS="para">The problem is that while you are probably thinking in some human language, the pattern matching engine most assuredly is not. When the engine is presented with the pattern <CODECLASS="literal">/ovine/</CODE> and a string to match this against, it searches the string for an <CODECLASS="literal">&quot;o&quot;</CODE> that is immediately followed by a <CODECLASS="literal">&quot;v&quot;</CODE>, then by an <CODECLASS="literal">&quot;i&quot;</CODE>, then by an <CODECLASS="literal">&quot;n&quot;</CODE>, and then finally by an <CODECLASS="literal">&quot;e&quot;</CODE>. What comes before or after that sequence doesn't matter.</P><PCLASS="para">As you find your patterns matching some strings you don't want them to match and not matching other strings that you do want them to match, you start embellishing. If you're really looking for nothing but sheep, you probably want to match more like this:</P><PRECLASS="programlisting">if ($meadow =~ /\bovines?\b/i) { print &quot;Here be sheep!&quot; }</PRE><PCLASS="para">Don't be tricked by the phantom cow lurking in that string. That's not a bovine. It's an ovine with a <CODECLASS="literal">\b</CODE> in front, which matches at a word boundary only. The <CODECLASS="literal">s?</CODE> indicates an optional <CODECLASS="literal">&quot;s&quot;</CODE> so we can find one or more ovines. The trailing <CODECLASS="literal">/i</CODE> makes whole pattern match case insensitive.</P><PCLASS="para">As you see, some characters or sequences of characters have special meaning to the pattern-matching engine. These metacharacters let you <EMCLASS="emphasis">anchor</EM> the pattern to the start or end of the string, give alternatives for parts of a pattern, allow repetition and wildcarding, and remember part of the matching substring for use later in the pattern or in subsequent code.</P><PCLASS="para">Learning the syntax of pattern matching isn't as daunting as it might appear. Sure, there are a lot of symbols, but each has a reason for existing. Regular expressions aren't random jumbles of punctuation &nbsp;-  they're carefully thought out jumbles of punctuation! If you forget one, you can always look it up. Summary tables are included in <ACLASS="citetitle"HREF="../prog/index.htm"TITLE="Programming Perl"><CITECLASS="citetitle">Programming Perl</CITE></A>, <EMCLASS="emphasis">Mastering Regular Expressions</EM>, and the <ICLASS="filename">perlre  </I>(1) and <ICLASS="filename">perlop  </I>(1) manpages included with every Perl installation.</P><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch06-chap06_the_0">The Tricky Bits</A></H3><PCLASS="para">Much trickier than the syntax of regular expressions is their sneaky semantics. The three aspects of pattern-matching behavior that seem to cause folks the most trouble are greed, eagerness, and backtracking (and also how these three interact with each other).</P><PCLASS="para"><ACLASS="indexterm"NAME="ch06-idx-1000007459-0"></A><ACLASS="indexterm"NAME="ch06-idx-1000007459-1"></A><ACLASS="indexterm"NAME="ch06-idx-1000007459-2"></A><ACLASS="indexterm"NAME="ch06-idx-1000007459-3"></A><ACLASS="indexterm"NAME="ch06-idx-1000007459-4"></A>Greed is the principle that if a quantifier (like <CODECLASS="literal">*</CODE>) can match a varying number of times, it will prefer to match as long a substring as it can. This is explained in <ACLASS="xref"HREF="ch06_16.htm"TITLE="Greedy and Non-Greedy Matches">Recipe 6.15</A>.</P><PCLASS="para">Eagerness is the notion that the leftmost match wins. The engine is very eager to return you a match as quickly as possible, sometimes even before you are expecting it. Consider the match <CODECLASS="literal">&quot;Fred&quot;</CODE> <CODECLASS="literal">=~</CODE> <CODECLASS="literal">/x*/</CODE>. If asked to explain this in plain language, you might say "Does the string <CODECLASS="literal">&quot;Fred&quot;</CODE> contain any <CODECLASS="literal">x </CODE>'s?" If so, you might be surprised to learn that it seems to. That's because <CODECLASS="literal">/x*/</CODE> doesn't truly mean "any <CODECLASS="literal">x</CODE>'s", unless your idea of "any" includes nothing at all. Formally, it means <EMCLASS="emphasis">zero or more</EM> of them, and in this case, zero sufficed for the eager matcher.</P><PCLASS="para">A more illustrative example of eagerness would be the following:</P><PRECLASS="programlisting">$string = &quot;good food&quot;;$string =~ s/o*/e/;</PRE><PCLASS="para">Can you guess which of the following is in <CODECLASS="literal">$string</CODE> after that substitution?</P><PRECLASS="programlisting"><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>good food</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -