📄 chap34.htm
字号:
<HTML><HEAD>
<TITLE>Intro to Algorithms: CHAPTER 34: STRING MATCHING</TITLE></HEAD><BODY BGCOLOR="#FFFFFF">
<a href="chap35.htm"><img align=right src="../../images/next.gif" alt="Next Chapter" border=0></A>
<a href="toc.htm"><img align=right src="../../images/toc.gif" alt="Return to Table of Contents" border=0></A>
<a href="chap33.htm"><img align=right src="../../images/prev.gif" alt="Previous Chapter" border=0></A>
<h1><a name="09c8_1bcf">CHAPTER 34: STRING MATCHING<a name="09c8_1bcf"></h1><P>
<a name="09c8_1bbf"><a name="09c8_1bc0"><a name="09c8_1bc1">Finding all occurrences of a pattern in a text is a problem that arises frequently in text-editing programs. Typically, the text is a document being edited, and the pattern searched for is a particular word supplied by the user. Efficient algorithms for this problem can greatly aid the responsiveness of the text-editing program. String-matching algorithms are also used, for example, to search for particular patterns in DNA sequences.<P>
<a name="09c8_1bc2"><a name="09c8_1bc3"><a name="09c8_1bc4">We formalize the <I><B>string-matching problem</I></B> as follows. We assume that the text is an array <I>T</I>[1 . . <I>n</I>] of length <I>n</I> and that the pattern is an array <I>P</I>[1 . . <I>m</I>] of length <I>m</I>. We further assume that the elements of <I>P</I> and <I>T</I> are characters drawn from a finite alphabet <img src="../images/sum14.gif">. For example, we may have <img src="../images/sum14.gif"> = {<FONT FACE="Courier New" SIZE=2>0,</FONT> <FONT FACE="Courier New" SIZE=2>1</FONT>} or <img src="../images/sum14.gif"> = {<FONT FACE="Courier New" SIZE=2>a</FONT>, <FONT FACE="Courier New" SIZE=2>b</FONT>, . . . , <FONT FACE="Courier New" SIZE=2>z</FONT>}. The character arrays <I>P</I> and <I>T</I> are often called <I><B>strings</I></B> of characters.<P>
<a name="09c8_1bc5"><a name="09c8_1bc6"><a name="09c8_1bc7">We say that pattern <I>P</I> <I><B>occurs with shift</I></B> <I><B>s</I></B> in text <I>T</I> (or, equivalently, that pattern <I>P</I> <I><B>occurs beginning at position s </I>+ 1</B> in text <I>T</I>) if 0 <img src="../images/lteq12.gif"> <I>s</I> <img src="../images/lteq12.gif"> <I>n</I> - <I>m</I> and <I>T</I>[<I>s</I> + 1 . . <I>s</I> + <I>m</I>] = <I>P</I>[1 . . <I>m</I>] (that is, if <I>T</I>[<I>s</I> + <I>j</I>] = P[<I>j</I>], for 1 <img src="../images/lteq12.gif"> <I>j</I> <img src="../images/lteq12.gif"> <I>m</I>). If <I>P</I> occurs with shift <I>s</I> in <I>T</I>, then we call <I>s</I> a <I><B>valid shift</I></B>; otherwise, we call <I>s</I> an <I><B>invalid shift</I></B>. The string-matching problem is the problem of finding all valid shifts with which a given pattern <I>P</I> occurs in a given text <I>T</I>. Figure 34.1 illustrates these definitions.<P>
This chapter is organized as follows. In Section 34.1 we review the naive brute-force algorithm for the string-matching problem, which has worst-case running time <I>O</I>((<I>n</I> - <I>m</I> + 1)<I>m</I>). Section 34.2 presents an interesting string-matching algorithm, due to Rabin and Karp. This algorithm also has worst-case running time <I>O</I>((<I>n</I> - <I>m</I> + 1)<I>m</I>), but it works much better on average and in practice. It also generalizes nicely to other pattern-matching problems. Section 34.3 then describes a string-matching algorithm that begins by constructing a finite automaton specifically designed to search for occurrences of the given pattern <I>P</I> in a text. This algorithm runs in time <I>O</I>(<I>n</I> + <I>m</I> <img src="../images/sglvrt.gif"><img src="../images/sum14.gif"><img src="../images/sglvrt.gif">). The similar but much cleverer Knuth-Morris-Pratt (or KMP) algorithm is presented in Section 34.4; the KMP algorithm runs in time <I>O</I>(<I>n</I> + <I>m</I>). Finally, Section 34.5 describes an algorithm due to Boyer and Moore that is often the best practical choice, although its worst-case running time (like that of the Rabin-Karp algorithm) is no better than that of the naive string-matching algorithm.<P>
<img src="853_a.gif"><P>
<h4><a name="09c8_1bd0">Figure 34.1 The string-matching problem. The goal is to find all occurrences of the pattern P = abaa in the text T = abcabaabcabac. The pattern occurs only once in the text, at shift s = 3. The shift s = 3 is said to be a valid shift. Each character of the pattern is connected by a vertical line to the matching character in the text, and all matched characters are shown shaded.<a name="09c8_1bd0"></sub></sup></h4><P>
Notation and terminology<P>
<a name="09c8_1bc8"><a name="09c8_1bc9">We shall let <img src="../images/sum14.gif">* (read "sigma-star") denote the set of all finite-length strings formed using characters from the alphabet <img src="../images/sum14.gif">. In this chapter, we consider only finite-length strings. The zero-length <I><B>empty string</I></B>, denoted <img src="../images/epsilon.gif"><I></I>, also belongs to <img src="../images/sum14.gif">*. The length of a string <I>x</I> is denoted |<I>x</I>|. The <I><B>concatenation</I> </B>of two strings <I>x</I> and <I>y</I>, denoted <I>xy</I>, has length |<I>x</I>| + |<I>y</I>| and consists of the characters from <I>x</I> followed by the characters from <I>y</I>.<P>
<a name="09c8_1bca"><a name="09c8_1bcb"><a name="09c8_1bcc"><a name="09c8_1bcd">We say that a string <I>w</I> is a <I><B>prefix</I></B> of a string <I>x</I>, denoted <img src="854_a.gif"> for some string <I>y</I> <img src="../images/memof12.gif"> <img src="../images/sum14.gif">*. Note that if <img src="854_b.gif">, then |<I>w</I>| <img src="../images/lteq12.gif"> |<I>x</I>|. Similarly, we say that a string <I>w</I> is a <I><B>suffix</I></B> of a string <I>x, </I>denoted <img src="854_c.gif"> for some <I>y</I> <img src="../images/memof12.gif"> <img src="../images/sum14.gif">*. It follows from <img src="854_d.gif"> that |<I>w</I>| <img src="../images/lteq12.gif"> |<I>x</I>|. The empty string <img src="../images/epsilon.gif"><I></I> is both a suffix and a prefix of every string. For example, we have <FONT FACE="Courier New" SIZE=2>ab</FONT> <img src="854_e.gif"> <FONT FACE="Courier New" SIZE=2>abcca </FONT>and <FONT FACE="Courier New" SIZE=2>cca </FONT><img src="854_f.gif"> <FONT FACE="Courier New" SIZE=2>abcca</FONT>. It is useful to note that for any strings <I>x</I> and <I>y</I> and any character <I>a</I>, we have <img src="854_g.gif"> if and only if <img src="854_h.gif">. Also note that <img src="854_i.gif"> are transitive relations. The following lemma will be useful later.<P>
<a name="09c8_1bd1">Lemma 34.1<a name="09c8_1bd1"><P>
<a name="09c8_1bce">Suppose that <I>x</I>, <I>y</I>, and <I>z</I> are strings such that <img src="854_j.gif">. If |<I>x</I>| <img src="../images/lteq12.gif"> |<I>y</I>|, then <img src="854_k.gif">. If |<I>x</I>| <img src="../images/gteq.gif"> |<I>y</I>|, then <img src="854_l.gif">. If |<I>x</I>| = |<I>y</I>|, then <I>x</I> = <I>y</I>.<P>
<I><B>Proof </I></B>See Figure 34.2 for a graphical proof. <P>
For brevity of notation, we shall denote the <I>k</I>-character prefix <I>P</I>[1 . . <I>k</I>] of the pattern <I>P</I>[1 . . <I>m</I>] by <I>P<SUB>k</I></SUB>. Thus, <I>P</I><SUB>0</SUB> = <img src="../images/epsilon.gif"><I></I> and <I>P<SUB>m</I></SUB> = <I>P</I> = <I>P</I>[1 . . <I>m</I>]. Similarly, we denote the <I>k</I>-character prefix of the text <I>T</I> as <I>T<SUB>k</I></SUB>. Using this notation, we can state the string-matching problem as that of finding all shifts <I>s</I> in the range 0 <img src="../images/lteq12.gif"> <I>s</I> <img src="../images/lteq12.gif"> <I>n</I> - <I>m</I> such that <img src="854_m.gif">.<P>
In our pseudocode, we allow two equal-length strings to be compared for equality as a primitive operation. If the strings are compared from left to right and the comparison stops when a mismatch is discovered, we assume that the time taken by such a test is a linear function of the number of matching characters discovered. To be precise, the test "<I>x</I> = <I>y</I>" is assumed to take time <img src="../images/bound.gif">(<I>t </I>+ 1), where <I>t</I> is the length of the longest string <I>z</I> such that <img src="855_e.gif">.<P>
<img src="855_a.gif"><P>
<h4><a name="09c8_1bd2">Figure 34.2 A graphical proof of Lemma 34.1. We suppose that <img src="855_b.gif">. The three parts of the figure illustrate the three cases of the lemma. Vertical lines connect matching regions (shown shaded) of the strings. (a) If |x| <img src="../images/lteq12.gif"> |y|, then <img src="855_c.gif">. (b) If |x| <img src="../images/gteq.gif"> |y|, then <img src="855_d.gif">. (c) If |x| = |y|, then x = y.<a name="09c8_1bd2"></sub></sup></h4><P>
<h1><a name="09ca_1bd2">34.1 The naive string-matching algorithm<a name="09ca_1bd2"></h1><P>
<a name="09ca_1bcf"><a name="09ca_1bd0">The naive algorithm finds all valid shifts using a loop that checks the condition <I>P</I>[1 . . <I>m</I>] = <I>T</I>[<I>s + 1 . . s + m</I>] for each of the <I>n</I> - <I>m</I> + 1 possible values of <I>s</I>.<P>
<pre><a name="09ca_1bd1">NAIVE-STRING-MATCHER(<I>T, P</I>)</sub></sup></pre><P>
<pre>1 <I> n</I> <img src="../images/arrlt12.gif"> <I>length</I>[<I>T</I>]</sub></sup></pre><P>
<pre>2 <I> m</I> <img src="../images/arrlt12.gif"> length[<I>P</I>]</sub></sup></pre><P>
<pre>3<B> for</B> <I>s </I><img src="../images/arrlt12.gif"> 0 <B>to</B> <I>n</I> - <I>m</I></sub></sup></pre><P>
<pre>4 <B> do </B><I><B>if</I></B> <I>P</I>[1 . . <I>m</I>] = <I>T</I>[<I>s</I> + 1 . . <I>s</I> + <I>m</I>]</sub></sup></pre><P>
<pre>5<B> then </B>print "Pattern occurs with shift" <I>s</I></sub></sup></pre><P>
The naive string-matching procedure can be interpreted graphically as sliding a "template" containing the pattern over the text, noting for which shifts all of the characters on the template equal the corresponding characters in the text, as illustrated in Figure 34.3. The <B>for</B> loop beginning on line 3 considers each possible shift explicitly. The test on line 4 determines whether the current shift is valid or not; this test involves an implicit loop to check corresponding character positions until all positions match successfully or a mismatch is found. Line 5 prints out each valid shift <I>s</I>.<P>
<img src="856_a.gif"><P>
<h4><a name="09ca_1bd3">Figure 34.3 The operation of the naive string matcher for the pattern P = <FONT FACE="Courier New" SIZE=2>aab</FONT> and the text T = <FONT FACE="Courier New" SIZE=2>acaabc</FONT>. We can imagine the pattern P as a "template" that we slide next to the text. Parts (a)-(d) show the four successive alignments tried by the naive string matcher. In each part, vertical lines connect corresponding regions found to match (shown shaded), and a jagged line connects the first mismatched character found, if any. One occurrence of the pattern is found, at shift s = 2, shown in part (c).<a name="09ca_1bd3"></sub></sup></h4><P>
Procedure <FONT FACE="Courier New" SIZE=2>NAIVE-</FONT><FONT FACE="Courier New" SIZE=2>STRING</FONT>-<FONT FACE="Courier New" SIZE=2>MATCHER</FONT> takes time <img src="../images/bound.gif">((<I>n</I> - <I>m</I> + 1)<I>m</I>) in the worst case. For example, consider the text string <FONT FACE="Courier New" SIZE=2>a<I><SUP><FONT FACE="Courier New" SIZE=1>n</FONT></I></FONT></SUP> (a string of <I>n</I> <FONT FACE="Courier New" SIZE=2>a</FONT>'s) and the pattern <FONT FACE="Courier New" SIZE=2>a<I><SUP><FONT FACE="Courier New" SIZE=1>m</FONT></I></FONT></SUP>. For each of the <I>n</I> - <I>m</I> + 1 possible values of the shift <I>s</I>, the implicit loop on line 4 to compare corresponding characters must execute <I>m</I> times to validate the shift. The worst-case running time is thus <img src="../images/bound.gif">((<I>n</I> - <I>m</I> + 1)<I>m</I>), which is <img src="../images/bound.gif">(<I>n</I><SUP>2</SUP>) if <I>m</I> = <FONT FACE="Times New Roman" SIZE=2><img src="../images/hfbrdl12.gif"><I>n</I></FONT>/2<FONT FACE="Times New Roman" SIZE=2><img src="../images/hfbrdr12.gif"></FONT>.<P>
As we shall see, <FONT FACE="Courier New" SIZE=2>NAIVE-</FONT><FONT FACE="Courier New" SIZE=2>STRING</FONT>-<FONT FACE="Courier New" SIZE=2>MATCHER </FONT>is not an optimal procedure for this problem. Indeed, in this chapter we shall show an algorithm with a worst-case running time of <I>O</I>(<I>n </I>+ <I>m</I>). The naive string-matcher is inefficient because information gained about the text for one value of <I>s</I> is totally ignored in considering other values of <I>s</I>. Such information can be very valuable, however. For example, if <I>P</I> = <FONT FACE="Courier New" SIZE=2>aaab</FONT> and we find that <I>s</I> = 0 is valid, then none of the shifts 1, 2, or 3 are valid, since <I>T</I>[4] = b. In the following sections, we examine several ways to make effective use of this sort of information.<P>
<h2><a name="09cb_1bd4">Exercises<a name="09cb_1bd4"></h2><P>
<a name="09cb_1bd5">34.1-1<a name="09cb_1bd5"><P>
Show the comparisons the naive string matcher makes for the pattern <I>P</I> = <FONT FACE="Courier New" SIZE=2>0001</FONT> in the text <I>T</I> = <FONT FACE="Courier New" SIZE=2>000010001010001</FONT>.<P>
<a name="09cb_1bd6">34.1-2<a name="09cb_1bd6"><P>
Show that the worst-case time for the naive string matcher to find the <I>first</I> occurrence of a pattern in a text is <img src="../images/bound.gif">((<I>n</I> - <I>m</I> + 1)(<I>m</I> - 1)).<P>
<a name="09cb_1bd7">34.1-3<a name="09cb_1bd7"><P>
Suppose that all characters in the pattern <I>P</I> are different. Show how to accelerate N<FONT FACE="Courier New" SIZE=2>AIVE-</FONT><FONT FACE="Courier New" SIZE=2>STRING</FONT>-<FONT FACE="Courier New" SIZE=2>MATCHER</FONT> to run in time <I>O</I>(<I>n</I>) on an <I>n</I>-character text <I>T</I>.<P>
<a name="09cb_1bd8">34.1-4<a name="09cb_1bd8"><P>
Suppose that pattern <I>P</I> and text <I>T</I> are <I>randomly</I> chosen strings of length <I>m</I> and <I>n</I>, respectively, from the <I>d</I>-ary alphabet <img src="../images/sum14.gif"><I><SUB>d</I></SUB> = {0, 1, . . . , <I>d</I> - 1},<I> </I>where<I> d</I> <img src="../images/gteq.gif"> 2. Show that the <I>expected</I> number of character-to-character comparisons made by the implicit loop in line 4 of the naive algorithm is<P>
<img src="857_a.gif"><P>
(Assume that the naive algorithm stops comparing characters for a given shift once a mismatch is found or the entire pattern is matched.) Thus, for randomly chosen strings, the naive algorithm is quite efficient.<P>
<a name="09cb_1bd9">34.1-5<a name="09cb_1bd9"><P>
<a name="09cb_1bd2"><a name="09cb_1bd3">Suppose we allow the pattern <I>P</I> to contain occurrences of a <I><B>gap character </I></B><img src="857_b.gif"> that can match an <I>arbitrary</I> string of characters (even one of zero length). For example, the pattern <img src="857_c.gif"> occurs in the text <FONT FACE="Courier New" SIZE=2>cabccbacbacab</FONT> as<P>
<img src="857_d.gif"><P>
Note that the gap character may occur an arbitrary number of times in the pattern but is assumed not to occur at all in the text. Give a polynomial-time algorithm to determine if such a pattern <I>P</I> occurs in a given text <I>T</I>, and analyze the running time of your algorithm.<P>
<P>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -