⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 chap34.htm

📁 介绍算法的书籍
💻 HTM
📖 第 1 页 / 共 5 页
字号:





<h1><a name="09d3_1bf5">34.4 The Knuth-Morris-Pratt algorithm<a name="09d3_1bf5"></h1><P>
<a name="09d3_1bf2"><a name="09d3_1bf3"><a name="09d3_1bf4">We now present a linear-time string-matching algorithm due to Knuth, Morris, and Pratt. Their algorithm achieves a <img src="../images/bound.gif">(<I>n + m</I>) running time by avoiding the computation of the transition function <img src="../images/delta12.gif"><I> </I>altogether, and it does the pattern matching using just an auxiliary function <img src="../images/piuc.gif"><I></I>[1 . . <I>m</I>] precomputed from the pattern in time <I>O</I>(<I>m</I>). The array <img src="../images/piuc.gif"><I> </I>allows the transition function <img src="../images/delta12.gif"><I> </I>to be computed efficiently (in an amortized sense) &quot;on the fly&quot; as needed. Roughly speaking, for any state <I>q</I> = 0, 1, . . . <I>, m,</I>and any character <I>a</I> <img src="../images/memof12.gif"> <img src="../images/sum14.gif">, the value <img src="../images/piuc.gif"><I></I>[<I>q</I>] contains the information that is independent of <I>a</I> and is needed to compute <img src="../images/delta12.gif"><I> </I>(<I>q, a</I>). (This remark will be clarified shortly.) Since the array <img src="../images/piuc.gif"><I> </I>has only <I>m</I> entries, whereas <img src="../images/delta12.gif"><I> </I>has <I>O</I>(<I>m</I> <img src="../images/sglvrt.gif"><img src="../images/sum14.gif"><img src="../images/sglvrt.gif">) entries, we save a factor of <img src="../images/sum14.gif"> in the preprocessing by computing <img src="../images/piuc.gif"><I> </I>rather than <img src="../images/delta12.gif"><I>.</I><P>





<h2>The prefix function for a pattern</h2><P>
<a name="09d4_1bf5">The prefix function for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid testing useless shifts in the naive pattern-matching algorithm or to avoid the precomputation of <img src="../images/delta12.gif"><I></I> for a string-matching automaton.<P>
Consider the operation of the naive string matcher. Figure 34.9(a) shows a particular shift <I>s</I> of a template containing the pattern <I>P</I> = ababaca against a text <I>T</I>. For this example, <I>q</I> = 5 of the characters have matched successfully, but the 6th pattern character fails to match the corresponding text character. The information that <I>q</I> characters have matched successfully determines the corresponding text characters. Knowing these <I>q</I> text characters allows us to determine immediately that certain shifts are invalid. In the example of the figure, the shift <I>s</I> + 1 is necessarily invalid, since the first pattern character, an a, would be aligned with a text character that is known to match with the second pattern character, a b. The shift <I>s</I> + 2 shown in part (b) of the figure, however, aligns the first three pattern characters with three text characters that must necessarily match. In general, it is useful to know the answer to the following question:<P>
Given that pattern characters <I>P</I>[1 . . <I>q</I>] match text characters <I>T</I>[<I>s</I> + 1 . . <I>s</I> + <I>q</I>], what is the least shift <I>s</I><I>'</I> &gt; <I>s</I> such that<P>
<pre><I>P</I>[1 . . <I>k</I>] = T[s' 1 . . s' <I>k</I>],</sub></sup></pre><P>
<h4><a name="09d4_1bf8">(34.5)<a name="09d4_1bf8"></sub></sup></h4><P>
where <I>s</I>' <I>+ k</I> = <I>s + q</I>?<P>
Such a shift <I>s</I>' is the first shift greater than <I>s</I> that is not necessarily invalid due to our knowledge of <I>T</I>[<I>s</I> + 1 . . <I>s</I> + <I>q</I>]. In the best case, we have that <I>s</I><I>'</I> = <I>s</I> + <I>q</I>, and shifts <I>s</I> + 1, <I>s</I> + 2, . . . , <I>s</I> + <I>q</I> - 1 are all immediately ruled out. In any case, at the new shift <I>s</I>'<I> </I>we don't need to compare the first <I>k</I> characters of <I>P</I> with the corresponding characters of <I>T</I>, since we are guaranteed that they match by equation (34.5).<P>
<img src="870_a.gif"><P>
<h4><a name="09d4_1bf9">Figure 34.9 The prefix function <img src="../images/piuc.gif">. (a) The pattern P = ababaca is aligned with a text T so that the first q = 5 characters match. Matching characters, shown shaded, are connected by vertical lines. (b) Using only our knowledge of the 5 matched characters, we can deduce that a shift of s + 1 is invalid, but that a shift of s' = s + 2 is consistent with everything we know about the text and therefore is potentially valid. (c) The useful information for such deductions can be precomputed by comparing the pattern with itself. Here, we see that the longest prefix of P that is also a suffix of P<SUB>5</SUB> is P<SUB>3</SUB>. This infomation is precomputed and represented in the array <img src="../images/piuc.gif">, so that <img src="../images/piuc.gif">[5] = 3. Given that q characters have matched successfully at shift s, the next potentially valid shift is at s' = s + (q - <img src="../images/piuc.gif">[q]).<a name="09d4_1bf9"></sub></sup></h4><P>
The necessary information can be precomputed by comparing the pattern against itself, as illustrated in Figure 34.9(<I>c</I>). Since <I>T</I>[<I>s</I>'<I> + 1 . . s'</I> + <I>k</I>] is part of the known portion of the text, it is a suffix of the string <I>P<SUB>q</I></SUB>. Equation (34.5) can therefore be interpreted as asking for the largest <I>k</I> &lt; <I>q </I>such that <img src="871_a.gif"><FONT FACE="Courier New" SIZE=2>.</FONT>Then, <I>s</I><I>'</I> = <I>s</I> + (<I>q - k</I>) is the next potentially valid shift. It turns out to be convenient to store the number <I>k</I> of matching characters at the new shift <B>s<I></B>'</I>, rather than storing, say, <I>s</I><I>'</I> - <I>s</I>. This information can be used to speed up both the naive string-matching algorithm and the finite-automaton matcher.<P>
We formalize the precomputation required as follows. Given a pattern <I>P</I>[1 . . <I>m</I>], the <I><B>prefix function</I></B> for the pattern <I>P</I> is the function <img src="../images/piuc.gif"> : {1, 2, . . . , <I>m</I>}<img src="../images/arrow12.gif">{0, 1, . . . , <I>m</I> - 1 } such that<P>
<img src="871_b.gif"><P>
That is, <img src="../images/piuc.gif">[<I>q</I>] is the length of the longest prefix of <I>P</I> that is a proper suffix of <I>P<SUB>q</I></SUB>. As another example, Figure 34.10(a) gives the complete prefix function <img src="../images/piuc.gif"> for the pattern <FONT FACE="Courier New" SIZE=2>ababababca</FONT>. <P>
The Knuth-Morris-Pratt matching algorithm is given in pseudocode below as the procedure KMP-<FONT FACE="Courier New" SIZE=2>MATCHER</FONT>. It is mostly modeled after <FONT FACE="Courier New" SIZE=2>FINITE</FONT>-<FONT FACE="Courier New" SIZE=2>AUTOMATON</FONT>-<FONT FACE="Courier New" SIZE=2>MATCHER</FONT>, as we shall see. KMP-<FONT FACE="Courier New" SIZE=2>MATCHER</FONT> calls the auxiliary procedure <FONT FACE="Courier New" SIZE=2>COMPUTE</FONT>-<FONT FACE="Courier New" SIZE=2>PREFIX</FONT>-<FONT FACE="Courier New" SIZE=2>FUNCTION</FONT> to compute <img src="../images/piuc.gif">.<P>
<pre><a name="09d4_1bf6">KMP-MATCHER(<I>T, P</I>)</sub></sup></pre><P>
<pre>1 <I>n</I> <img src="../images/arrlt12.gif"> <I>length</I>[<I>T</I>]</sub></sup></pre><P>
<pre>2 <I>m</I> <img src="../images/arrlt12.gif"> <I>length</I>[<I>P</I>]</sub></sup></pre><P>
<pre>3 <img src="../images/piuc.gif"> <img src="../images/arrlt12.gif"> COMPUTE-PREFIX-FUNCTION(<I>P</I>)</sub></sup></pre><P>
<pre>4 <I>q</I> <img src="../images/arrlt12.gif"> 0</sub></sup></pre><P>
<pre>5 <B>for</B> <I>i</I> <img src="../images/arrlt12.gif"> 1 <B>to</B> <I>n</I></sub></sup></pre><P>
<pre>6<B>      do while</B> <I>q</I> &gt; 0 and <I>P</I>[<I>q</I> + 1] <img src="../images/noteq.gif"> <I>T</I>[<I>i</I>]</sub></sup></pre><P>
<pre>7<B>             do</B> <I>q</I> <img src="../images/arrlt12.gif"> <img src="../images/piuc.gif">[<I>q</I>]</sub></sup></pre><P>
<pre>8<B>        if</B> <I>P</I>[<I>q</I> + 1] = <I>T</I>[<I>i</I>]</sub></sup></pre><P>
<pre>9<B>           then</B> <I>q</I> <img src="../images/arrlt12.gif"> <I>q</I> + 1</sub></sup></pre><P>
<pre>10<B>        if</B> <I>q</I> = <I>m</I></sub></sup></pre><P>
<pre>11<B>           then</B> print &quot;Pattern occurs with shift&quot; <I>i</I> - <I>m</I></sub></sup></pre><P>
<pre>12<I>                q</I> <img src="../images/arrlt12.gif"> <img src="../images/piuc.gif">[<I>q</I>]</sub></sup></pre><P>
<pre></sub></sup></pre><P>
<pre><a name="09d4_1bf7">COMPUTE-PREFIX-FUNCTION(<I>P</I>)</sub></sup></pre><P>
<pre>1 <I>m</I> <img src="../images/arrlt12.gif"> <I>length</I>[<I>P</I>]</sub></sup></pre><P>
<pre>2 <img src="../images/piuc.gif"> [1] <img src="../images/arrlt12.gif"> 0</sub></sup></pre><P>
<pre>3 <I>k</I> <img src="../images/arrlt12.gif"> 0</sub></sup></pre><P>
<pre>4 <B>for</B> <I>q</I> <img src="../images/arrlt12.gif"> 2 <B>to</B> <I>m</I></sub></sup></pre><P>
<pre>5 <B>     do while</B> <I>k</I> &gt; 0 and <I>P</I>[<I>k</I> + 1 <img src="../images/noteq.gif"> <I>P</I>[<I>q</I>]</sub></sup></pre><P>
<pre>6<B>            do</B> <I>k</I> <img src="../images/arrlt12.gif"> <img src="../images/piuc.gif">[<I>k</I>]</sub></sup></pre><P>
<pre>7<B>        if</B> <I>P</I>[<I>k</I> + 1] = <I>P</I>[<I>q</I>]</sub></sup></pre><P>
<pre>8<B>           then</B> <I>k</I> <img src="../images/arrlt12.gif"> <I>k</I> + 1</sub></sup></pre><P>
<pre>9        <img src="../images/piuc.gif">[<I>q</I>] <img src="../images/arrlt12.gif"> <I>k</I></sub></sup></pre><P>
<pre>10 <B>return</B> <img src="../images/piuc.gif"></sub></sup></pre><P>
We begin with an analysis of the running times of these procedures. Proving these procedures correct will be more complicated.<P>
<P>







<h2>Running-time analysis</h2><P>
<a name="09d5_1bf8"><a name="09d5_1bf9">The running time of <FONT FACE="Courier New" SIZE=2>COMPUTE</FONT>-<FONT FACE="Courier New" SIZE=2>PREFIX</FONT>-<FONT FACE="Courier New" SIZE=2>FUNCTION</FONT> is <I>O</I>(<I>m</I>), using an amortized analysis (see Chapter 18). We associate a potential of <I>k</I> with the current state <I>k</I> of the algorithm. This potential has an initial value of 0, by line 3. Lin

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -