ch06_01.htm
来自「By Tom Christiansen and Nathan Torkingto」· HTM 代码 · 共 1,459 行 · 第 1/3 页
HTM
1,459 行
><I>geod food</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>geed food</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>geed feed</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>ged food</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>ged fed</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>egood food</I></CODE></B></CODE></PRE><PCLASS="para">The answer is the last one because the earliest point at which zero or more occurrences of <CODECLASS="literal">"o"</CODE> could be found was right at the beginning of the string. Surprised? Regular expressions can do that to you.</P><PCLASS="para">Can you guess what adding <CODECLASS="literal">/g</CODE><ACLASS="indexterm"NAME="ch06-idx-1000007466-0"></A> modifier to make the substitution global will do? Think of it this way: that string has many places where zero or more instances of <CODECLASS="literal">"o"</CODE> occur - eight, to be precise. The answer is <CODECLASS="literal">"egeede</CODE> <CODECLASS="literal">efeede"</CODE>.</P><PCLASS="para">Here's another example of where greed takes a back seat to eagerness:</P><PRECLASS="programlisting">% echo ababacaca | perl -ne 'print "$&\n" if /(a|ba|b)+(a|ac)+/'<CODECLASS="userinput"><B><CODECLASS="replaceable"><I>ababa</I></CODE></B></CODE></PRE><PCLASS="para">That's because Perl uses what's called a traditional NFA,[<ACLASS="footnote"HREF="#ch06-pgfId-1000000612">2</A>] a <ACLASS="indexterm"NAME="ch06-idx-1000007467-0"></A><ACLASS="indexterm"NAME="ch06-idx-1000007467-1"></A>non-deterministic finite automaton. This kind of matching engine is not guaranteed to return the longest <EMCLASS="emphasis">overall</EM> match, just the longest, leftmost match. You might think of Perl's greed as being left-to-right directed, not globally greedy.</P><BLOCKQUOTECLASS="footnote"><DIVCLASS="footnote"><PCLASS="para"><ACLASS="footnote"NAME="ch06-pgfId-1000000612">[2]</A> As opposed to a POSIX-style NFA. See <EMCLASS="emphasis">Mastering Regular Expressions</EM> for the differences.</P></DIV></BLOCKQUOTE><PCLASS="para">But it doesn't have to be that way. Here's an example using <EMCLASS="emphasis">awk</EM>, a language that Perl borrows a lot from:</P><PRECLASS="programlisting">% echo ababacaca | awk 'match($0,/(a|ba|b)+(a|ac)+/) { print substr($0, RSTART, RLENGTH) }'<CODECLASS="userinput"><B><CODECLASS="replaceable"><I>ababacaca</I></CODE></B></CODE></PRE><PCLASS="para">Choosing how to implement pattern matching depends mainly on two factors: are the expressions nonregular (do they use backreferences), and what needs to be returned (yes/no, range of whole match, ranges of subexpressions). Tools like <EMCLASS="emphasis">awk</EM>, <EMCLASS="emphasis">egrep</EM>, and <EMCLASS="emphasis">lex</EM> use regular expressions and only need a yes/no answer or the range of the whole match. This is exactly what <ACLASS="indexterm"NAME="ch06-idx-1000008226-0"></A><ACLASS="indexterm"NAME="ch06-idx-1000008226-1"></A>DFAs can support, and because DFAs are faster and simpler, these tools have traditionally used DFA implementations. Pattern matching within programs and libraries, such as <EMCLASS="emphasis">ed</EM>, <EMCLASS="emphasis">regex</EM>, and <EMCLASS="emphasis">perl</EM>, is another kettle of fish; typically, we need to support nonregular expressions and we need to know what parts of the string were matched by various parts of the pattern. This is a much harder problem with potentially exponential run times. The natural algorithm for this problem is an NFA, and therein lies both a problem and an opportunity. The problem is that NFAs are slow. The opportunity is that significant performance gains can be made by rewriting the patterns to exploit how the particular NFA implementation runs. This is a major part of Jeffrey Friedl's book, <EMCLASS="emphasis">Mastering Regular Expressions</EM>.<ACLASS="indexterm"NAME="ch06-idx-1000007479-0"></A><ACLASS="indexterm"NAME="ch06-idx-1000007479-1"></A><ACLASS="indexterm"NAME="ch06-idx-1000007479-2"></A><ACLASS="indexterm"NAME="ch06-idx-1000007479-3"></A><ACLASS="indexterm"NAME="ch06-idx-1000007479-4"></A></P><PCLASS="para"><ACLASS="indexterm"NAME="ch06-idx-1000007477-0"></A><ACLASS="indexterm"NAME="ch06-idx-1000007477-1"></A>The last and most powerful of the three tricky bits in pattern matching is backtracking. For a pattern to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and tries to find another match for the beginning part - that's why it's called backtracking. Essentially, it means that the engine is going to try different possibilities, systematically investigating alternate matches until it finds one that works. In some pattern matching implementations, you keep backtracking in case other submatches make the overall match longer. Perl's matcher doesn't do that; as soon as one possibility works, it uses that - until and unless something later on in the pattern fails, forcing a backtrack to retry another possible way of matching. This is discussed in <ACLASS="xref"HREF="ch06_17.htm"TITLE="Detecting Duplicate Words">Recipe 6.16</A>.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch06-chap06_pattern_matching_0">Pattern-Matching Modifiers</A></H3><PCLASS="para"><ACLASS="indexterm"NAME="ch06-idx-1000010775-0"></A>Pattern-matching modifiers are a lot easier to list and learn than the different metacharacters. Here's a brief summary of them:</P><TABLECLASS="informaltable"BORDER="1"CELLPADDING="3"><TBODYCLASS="tbody"><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para"><CODECLASS="literal">/i</CODE></P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Ignore alphabetic case (locale-aware)</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para"><CODECLASS="literal">/x</CODE></P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Ignore most whitespace in pattern and permit comments</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para"><CODECLASS="literal">/g</CODE></P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Global - match/substitute as often as possible</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para"><CODECLASS="literal">/gc</CODE></P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Don't reset search position on failed match</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para"><CODECLASS="literal">/s</CODE></P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Let <CODECLASS="literal">. </CODE>match newline; also, ignore deprecated <CODECLASS="literal">$*</CODE></P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para"><CODECLASS="literal">/m</CODE></P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Let <CODECLASS="literal">^ </CODE>and <CODECLASS="literal">$</CODE> match next to embedded <CODECLASS="literal">\n</CODE></P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para"><CODECLASS="literal">/o</CODE></P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Compile pattern once only</P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para"><CODECLASS="literal">/e</CODE></P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Righthand side of a <CODECLASS="literal">s/// </CODE>is code to <CODECLASS="literal">eval</CODE></P></TD></TR><TRCLASS="row"VALIGN="TOP"><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para"><CODECLASS="literal">/ee</CODE></P></TD><TDCLASS="entry"ROWSPAN="1"COLSPAN="1"><PCLASS="para">Righthand side of a <CODECLASS="literal">s/// </CODE>is a string to <CODECLASS="literal">eval</CODE>, then run as code, and its return value <CODECLASS="literal">eval</CODE>'led again.</P></TD></TR></TBODY
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?