📄 chap34.htm
字号:
<P>
<h1><a name="09cc_1bd9">34.2 The Rabin-Karp algorithm<a name="09cc_1bd9"></h1><P>
<a name="09cc_1bd4"><a name="09cc_1bd5">Rabin and Karp have proposed a string-matching algorithm that performs well in practice and that also generalizes to other algorithms for related problems, such as two-dimensional pattern matching. The worst-case running time of the Rabin-Karp algorithm is <I>O</I>((<I>n</I> - <I>m</I> + 1)<I>m</I>), but it has a good average-case running time.<P>
This algorithm makes use of elementary number-theoretic notions such as the equivalence of two numbers modulo a third number. You may want to refer to Section 33.1 for the relevant definitions.<P>
For expository purposes, let us assume that <img src="../images/sum14.gif"> = {<FONT FACE="Courier New" SIZE=2>0</FONT>, <FONT FACE="Courier New" SIZE=2>1</FONT>, <FONT FACE="Courier New" SIZE=2>2</FONT>, . . . , <FONT FACE="Courier New" SIZE=2>9</FONT>}, so that each character is a decimal digit. (In the general case, we can assume that each character is a digit in radix-<I>d</I> notation, where <I>d</I> = <img src="../images/sglvrt.gif"><img src="../images/sum14.gif"><img src="../images/sglvrt.gif">.) We can then view a string of <I>k</I> consecutive characters as representing a length-<I>k</I> decimal number. The character string <FONT FACE="Courier New" SIZE=2>31415</FONT> thus corresponds to the decimal number 31,415. Given the dual interpretation of the input characters as both graphical symbols and digits, we find it convenient in this section to denote them as we would digits, in our standard text font.<P>
Given a pattern <I>P</I>[1 . . <I>m</I>], we let <I>p</I> denote its corresponding decimal value. In a similar manner, given a text <I>T</I>[1 . . <I>n</I>], we let <I>t<SUB>s</I></SUB> denote the decimal value of the length-<I>m</I> substring <I>T</I>[<I>s </I>+ 1 . . <I>s </I>+ <I>m</I>], for <I>s</I> = 0, 1, . . . , <I>n - m</I>. Certainly, <I>t<SUB>s</I></SUB> = <I>p</I> if and only if <I>T</I>[<I>s </I>+ 1 . . <I>s </I>+ <I>m</I>] = <I>P</I>[1 . . <I>m</I>]; thus, <I>s</I> is a valid shift if and only if <I>t</I><SUB>s</SUB> = <I>p</I>. If we could compute <I>p</I> in time <I>O</I>(<I>m</I>) and all of the <I>t<SUB>i</I></SUB> values in a total of <I>O</I>(<I>n</I>) time, then we could determine all valid shifts <I>s</I> in time <I>O</I>(<I>n</I>) by comparing <I>p</I> with each of the <I>t<SUB>s</I></SUB>'s. (For the moment, let's not worry about the possibility that <I>p</I> and the <I>t<SUB>s</I></SUB>'s might be very large numbers.)<P>
We can compute <I>p</I> in time <I>O</I>(<I>m</I>) using Horner's rule (see Section 32.1):<P>
<pre><I>p</I> = <I>P</I>[<I>m</I>] + 10 (<I>P</I>[<I>m </I>- 1] + 10(P[m - 2] + <I>. . .</I> + 10(<I>P</I>[2] + 10<I>P</I>[1])<I> . . . </I>)).</sub></sup></pre><P>
The value <I>t</I><SUB>0</SUB> can be similarly computed from <I>T</I>[1 . . <I>m</I>] in time O(<I>m</I>).<P>
To compute the remaining values <I>t</I><SUB>1</SUB>, <I>t</I><SUB>2</SUB>, . . . , <I>t<SUB>n-m</I></SUB> in time <I>O</I>(<I>n</I> - <I>m</I>), it suffices to observe that <I>t</I><SUB>s + 1</SUB> can be computed from <I>t</I><SUB>s</SUB> in constant time, since<P>
<pre><I>t<SUB>s</I> + 1 </SUB> = 10(<I>t<SUB>s</I></SUB> - 10<I><SUP>m</I> - 1</SUP><I>T</I>[<I>s </I>+ 1]) + <I>T</I>[<I>s </I>+ m <I>+ </I>1].</sub></sup></pre><P>
<h4><a name="09cc_1bda">(34.1)<a name="09cc_1bda"></sub></sup></h4><P>
For example, if <I>m</I>= 5 and <I>t<SUB>s</I></SUB> = 31415, then we wish to remove the high-order digit <I>T</I>[<I>s </I>+ 1] = 3 and bring in the new low-order digit (suppose it is <I>T</I>[<I>s </I>+ 5 + 1] = 2) to obtain<P>
<pre><I>t<SUB>s</I>+1 </SUB>= 10(31415 - 10000.3) + 2</sub></sup></pre><P>
<pre>= 14152 .</sub></sup></pre><P>
Subtracting 10<I><SUP>m</I></SUP>-1 <I>T</I>[<I>s</I>+1] removes the high-order digit from <I>t<SUB>s</I></SUB>, multiplying the result by 10 shifts the number left one position, and adding <I>T</I>[<I>s </I>+ <I>m </I>+ 1] brings in the appropriate low-order digit. If the constant 10<I><SUP><FONT FACE="Courier New" SIZE=2>m</I></SUP>-1</FONT><SUP> </SUP>is precomputed (which can be done in time <I>O</I>(1g <I>m</I>) using the techniques of Section 33.6, although for this application a straightforward <I>O</I>(<I>m</I>) method is quite adequate), then each execution of equation (34.1) takes a constant number of arithmetic operations. Thus, p and <I>t</I><SUB>0</SUB>, <I>t</I><SUB>1</SUB>, . . . , <I>t<SUB>n</I></SUB>-<I>m</I> can all be computed in time <I>O</I>(<I>n </I>+ <I>m</I>), and we can find all occurrences of the pattern <I>P</I>[1 . . <I>m</I>] in the text <I>T</I>[1 . . <I>n</I>] in time <I>O</I>(<I>n </I>+ <I>m</I>).<P>
The only difficulty with this procedure is that <I>p</I> and <I>t<SUB>s</I></SUB> may be too large to work with conveniently. If <I>P</I> contains <I>m</I> characters, then assuming that each arithmetic operation on <I>p</I> (which is <I>m</I> digits long) takes "constant time" is unreasonable. Fortunately, there is a simple cure for this problem, as shown in Figure 34.4 : compute <I>p</I> and the <I>t<SUB>s</I></SUB>'s modulo a suitable modulus <I>q</I>. Since the computation of <I>p</I>, <I>t</I><SUB>0</SUB>, and the recurrence (34.1) can all be performed modulo <I>q</I>, we see that <I>p</I> and all the <I>t<SUB>s</I></SUB>'s can be computed modulo <I>q</I> in time <I>O</I>(<I>n </I>+ <I>m</I>). The modulus <I>q</I> is typically chosen as a prime such that 10<I>q</I> just fits within one computer word, which allows all of the necessary computations to be performed with single-precision arithmetic.<P>
<img src="859_a.gif"><P>
<h4><a name="09cc_1bdb">Figure 34.4 The Rabin-Karp algorithm. Each character is a decimal digit, and we compute values modulo 13. (a) A text string. A window of length 5 is shown shaded. The numerical value of the shaded number is computed modulo 13, yielding the value 7. (b) The same text string with values computed modulo 13 for each possible position of a length-5 window. Assuming the pattern P = 31415, we look for windows whose value modulo 13 is 7, since 31415 <img src="../images/equiv10.gif"> 7 (mod 13). Two such windows are found, shown shaded in the figure. The first, beginning at text position 7, is indeed an occurrence of the pattern, while the second, beginning at text position 13, is a spurious hit. (c) Computing the value for a window in constant time, given the value for the previous window. The first window has value 31415. Dropping the high-order digit 3, shifting left (multiplying by 10), and then adding in the low-order digit 2 gives us the new value 14152. All computations are performed modulo 13, however, so the value for the first window is 7, and the value computed for the new window is 8.<a name="09cc_1bdb"></sub></sup></h4><P>
In general, with a <I>d</I>-ary alphabet {0, 1, . . . ,<I>d</I> - 1}, we choose <I>q</I> so that <I>d q </I>fits within a computer word and adjust the recurrence equation (34.1) to work modulo <I>q</I>, so that it becomes<P>
<pre><I>t<SUB>s</I>+1</SUB> = (<I>d</I>(<I>t<SUB>s</I></SUB> - <I>T</I>[<I>s</I> + 1]<I>h</I>) + <I>T</I>[<I>s</I> + <I>m</I> + 1]) mod <I>q </I>,</sub></sup></pre><P>
<h4><a name="09cc_1bdc">(34.2)<a name="09cc_1bdc"></sub></sup></h4><P>
where <I>h</I> <img src="../images/equiv10.gif"> <I>d<SUP>m</I>-1</SUP> (mod <I>q</I>) is the value of the digit "1" in the high-order position of an <I>m</I>-digit text window.<P>
<a name="09cc_1bd6">The ointment of working modulo <I>q</I> now contains a fly, however, since <I>t<SUB>s</I></SUB> <img src="../images/equiv10.gif"> <I>p</I> (mod <I>q</I>) does not imply that <I>t<SUB>s</I></SUB> = <I>p</I>. On the other hand, if <img src="860_a.gif"> (mod <I>q</I>), then we definitely have that <I>t<SUB>s</I></SUB> <img src="../images/noteq.gif"> <I>p</I>, so that shift <I>s</I> is invalid. We can thus use the test <I>t<SUB>s</I></SUB> <img src="../images/equiv10.gif"> <I>p</I> (mod <I>q</I>) as a fast heuristic test to rule out invalid shifts <I>s</I>. Any shift <I>s</I> for which <I>t<SUB>s</I></SUB> <img src="../images/equiv10.gif"> <I>p</I> (mod <I>q</I>) must be tested further to see if <I>s</I> is really valid or we just have a <I><B>spurious hit</I></B>. This testing can be done by explicitly checking the condition <I>P</I>[1 . . <I>m</I>] = <I>T</I>[<I>s</I> + 1 . . <I>s</I> + <I>m</I>]. If <I>q</I> is large enough, then we can hope that spurious hits occur infrequently enough that the cost of the extra checking is low.<P>
The following procedure makes these ideas precise. The inputs to the procedure are the text <I>T</I>, the pattern <I>P</I>, the radix <I>d</I> to use (which is typically taken to be |<img src="../images/sum14.gif">|), and the prime <I>q</I> to use.<P>
<pre><a name="09cc_1bd7">RABIN-KARP-MATCHER(<I>T, P, d, q</I>)</sub></sup></pre><P>
<pre>1 <I>n</I> <img src="../images/arrlt12.gif"> <I>length</I>[<I>T</I>]</sub></sup></pre><P>
<pre>2 <I>m</I><img src="../images/arrlt12.gif"> <I>length</I>[<I>P</I>]</sub></sup></pre><P>
<pre>3 <I>h</I> <img src="../images/arrlt12.gif"> <I>d<SUP>m</I></SUP>-<I>1<SUP> mod </I>q</sub></sup></pre><P>
<pre>4 <I>p</I> <img src="../images/arrlt12.gif"> 0</sub></sup></pre><P>
<pre>5 <I>t</I><SUB>0</SUB> <img src="../images/arrlt12.gif"> 0</sub></sup></pre><P>
<pre>6 <B>for</B> <I>i</I> <img src="../images/arrlt12.gif"> 1 <B>to</B> <I>m</I></sub></sup></pre><P>
<pre>7<B> do</B> <I>p</I> <img src="../images/arrlt12.gif"> (<I>dp</I> + <I>P</I>[<I>i</I>]) mod <I>q</I></sub></sup></pre><P>
<pre>8<I> t</I><SUB>0</SUB> <img src="../images/arrlt12.gif"> (<I>dt</I><SUB>0</SUB> + <I>T</I>[<I>i</I>]) mod <I>q</I></sub></sup></pre><P>
<pre>9 <B>for</B> <I>s</I> <img src="../images/arrlt12.gif"> 0 <B>to</B> <I>n</I> - <I>m</I></sub></sup></pre><P>
<pre>10 <B>do if</B> <I>p</I> = <I>t<SUB>s</I></sub></sup></pre><P>
<pre>11<B> then if</B> <I>P</I>[1 . . <I>m</I>] = <I>T</I>[<I>s</I> + 1 . . <I>s</I> + <I>m</I>]</sub></sup></pre><P>
<pre>12<B> then</B> "Pattern occurs with shift" <I>s</I></sub></sup></pre><P>
<pre>13<B> if</B> <I>s</I> < <I>n</I> - <I>m</I></sub></sup></pre><P>
<pre>14<B> then</B> <I>t<SUB>s</I>+1</SUB> <img src="../images/arrlt12.gif"> (<I>d</I>(<I>t<SUB>s</I></SUB> - <I>T</I>[<I>s</I> + 1]<I>h</I>) + <I>T</I>[<I>s</I> + <I>m</I> + 1]) mod <I>q</I></sub></sup></pre><P>
The procedure <FONT FACE="Courier New" SIZE=2>RABIN</FONT>-<FONT FACE="Courier New" SIZE=2>KARP</FONT>-<FONT FACE="Courier New" SIZE=2>MATCHER</FONT> works as follows. All characters are interpreted as radix-<I>d</I> digits. The subscripts on <I>t</I> are provided only for clarity; the program works correctly if all the subscripts are dropped. Line 3 initializes <I>h</I> to the value of the high-order digit position of an <I>m</I>-digit window. Lines 4-8 compute <I>p</I> as the value of <I>P</I>[1 . . <I>m</I>] mod <I>q </I>and <I>t</I><SUB>0</SUB> as the value of <I>T</I>[1 . . <I>m</I>] mod <I>q</I>. The <B>for</B> loop beginning on line 9 iterates through all possible shifts <I>s</I>. The loop has the following invariant: whenever line 10 is executed, <I>t<SUB>s</I></SUB> = <I>T</I>[<I>s</I> + 1 . . <I>s</I> + <I>m</I>] mod <I>q</I>. If <I>p</I> = <I>t<SUB>s</I></SUB> in line 10 (a "hit"), then we check to see if <I>P</I>[1 . . <I>m</I>] = <I>T</I>[<I>s</I> + 1 . . <I>s</I> + <I>m</I>] in line 11 to rule out the possibility of a spurious hit. Any valid shifts found are printed out on line 12. If <I>s</I> < <I>n</I> - <I>m</I> (checked in line 13), then the <B>for</B> loop is to be executed at least one more time, and so line 14 is first executed to ensure that the loop invariant holds when line 10 is again reached. Line 14 computes the value of <I>t<SUB>s+</I>1</SUB> mod <I>q</I> from the value of <I>t<SUB>s</I></SUB> mod <I>q</I> in constant time using equation (34.2) directly.<P>
<a name="09cc_1bd8">The running time of <FONT FACE="Courier New" SIZE=2>RABIN</FONT>-<FONT FACE="Courier New" SIZE=2>KARP</FONT>-<FONT FACE="Courier New" SIZE=2>MATCHER</FONT> is <img src="../images/bound.gif">((<I>n</I> - <I>m</I> + 1)<I>m</I>) in the worst case, since (like the naive string-matching algorithm) the Rabin-Karp algorithm explicitly verifies every valid shift. If <I>P</I> = a<I><SUP>m</I></SUP> and <I>T</I> = a<I><SUP>n</I></SUP>, then the verifications take time <img src="../images/bound.gif">((<I>n</I> - <I>m</I> + 1)<I>m</I>), since each of the <I>n</I> - <I>m</I> + 1 possible shifts is valid. (Note also that the computation of <I>d<SUP>m</I>-1</SUP> mod <I>q</I> on line 3 and the loop on lines 6-8 take time <I>O</I>(<I>m</I>) = <I>O</I>((<I>n</I> - <I>m</I> + 1 )<I>m</I>).)<P>
In many applications, we expect few valid shifts (perhaps <I>O</I>(1) of them), and so the expected running time of the algorithm is <I>O</I>(<I>n</I> + <I>m</I>) plus the time required to process spurious hits. We can base a heuristic analysis on the assumption that reducing values modulo <I>q</I> acts like a random mapping from <img src="../images/sum14.gif">* to <B>Z</B><I><SUB>q</I></SUB>. (See the discussion on the use of division for hashing in Section 12.3.1. It is difficult to formalize and prove such an assumption, although one viable approach is to assume that <I>q</I> is chosen randomly from integers of the appropriate size. We shall not pursue this formalization here.) We can then expect that the number of spurious hits is <I>O</I>(<I>n</I>/<I>q</I>), since the chance that an arbitrary <I>t<SUB>s</I></SUB> will be equivalent to <I>p</I>, modulo <I>q</I>, can be estimated as 1/<I>q</I>. The expected amount of time taken by the Rabin-Karp algorithm is then<P>
<pre><I>O</I>(<I>n</I>) + <I>O</I>(<I>m</I>(<I>v</I> +<I> n</I>/<I>q</I>)),</sub></sup></pre><P>
where <I>v</I> is the number of valid shifts. This running time is <I>O</I>(<I>n</I>) if we choose <I>q</I> <img src="../images/gteq.gif"> <I>m</I>. That is, if the expected number of valid shifts is small (<I>O</I>(1)) and the prime <I>q</I> is chosen to be larger than the length of the pattern, then we can expect the Rabin-Karp procedure to run in time <I>O</I>(<I>n</I> + <I>m</I>).<P>
<h2><a name="09cd_1bda">Exercises<a name="09cd_1bda"></h2><P>
<a name="09cd_1bdb">34.2-1<a name="09cd_1bdb"><P>
Working modulo <I>q</I> = 11, how many spurious hits does the Rabin-Karp matcher encounter in the text <I>T</I> = 3141592653589793 when looking for the pattern <I>P</I> = 26? <P>
<a name="09cd_1bdc">34.2-2<a name="09cd_1bdc"><P>
How would you extend the Rabin-Karp method to the problem of searching a text string for an occurrence of any one of a given set of <I>k</I> patterns?<P>
<a name="09cd_1bdd">34.2-3<a name="09cd_1bdd"><P>
Show how to extend the Rabin-Karp method to handle the problem of looking for a given <I>m</I> X <I>m</I> pattern in an <I>n</I> X <I>n</I> array of characters. (The pattern may be shifted vertically and horizontally, but it may not be rotated.)<P>
<a name="09cd_1bde">34.2-4<a name="09cd_1bde"><P>
<a name="09cd_1bd9">Alice has a copy of a long <I>n</I>-bit file <I>A</I> = <img src="../images/lftwdchv.gif"><I>a<SUB>n</I>-1</SUB>, <I>a<SUB>n</I>-2</SUB>, . . . , <I>a</I><SUB>0</SUB><img src="../images/wdrtchv.gif">, and Bob similarly has an <I>n</I>-bit file <I>B = </I><img src="../images/lftwdchv.gif">b<SUB>n<I>-1</SUB>, </I>b<SUB>n<I>-2</SUB>, . . . ,</I>b<I><SUB>0</I></SUB><img src="../images/wdrtchv.gif">. Alice and Bob wish to know if their files are identical. To avoid transmitting all of <I>A</I> or <I>B</I>, they use the following fast probabilistic check. Together, they select a prime <I>q</I> > 1000<I>n</I> and randomly select an integer <I>x</I> from {0, 1, . . . , <I>n</I> - 1}. Then, Alice evaluates<P>
<img src="862_a.gif"><P>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -