📄 chap34.htm
字号:
and Bob similarly evaluates <I>B</I>(<I>x</I>). Prove that if <I>A</I> <img src="../images/noteq.gif"> <I>B</I>, there is at most one chance in 1000 that <I>A</I>(<I>x</I>) = <I>B</I>(<I>x</I>), whereas if the two files are the same, <I>A</I>(<I>x</I>) is necessarily the same as <I>B</I>(<I>x</I>). (<I>Hint</I>: See Exercise 33.4-4.)<P>
<P>
<P>
<h1><a name="09ce_1bdb">34.3 String matching with finite automata<a name="09ce_1bdb"></h1><P>
<a name="09ce_1bda">Many string-matching algorithms build a finite automaton that scans the text string <I>T</I> for all occurrences of the pattern <I>P</I>. This section presents a method for building such an automaton. These string-matching automata are very efficient: they examine each text character <I>exactly once</I>, taking constant time per text character. The time used--after the automaton is built--is therefore <img src="../images/bound.gif">(<I>n</I>). The time to build the automaton, however, can be large if <img src="../images/sum14.gif"> is large. Section 34.4 describes a clever way around this problem.<P>
We begin this section with the definition of a finite automaton. We then examine a special string-matching automaton and show how it can be used to find occurrences of a pattern in a text. This discussion includes details on how to simulate the behavior of a string-matching automaton on a given text. Finally, we shall show how to construct the string-matching automaton for a given input pattern.<P>
<h2>Finite automata</h2><P>
<a name="09cf_1bdb"><a name="09cf_1bdc"><a name="09cf_1bdd">A <I><B>finite automaton</I></B> <I>M</I> is a 5-tuple (<I>Q, q</I><SUB>0</SUB>, <I>A</I>, <img src="../images/sum14.gif">, <img src="../images/delta12.gif">), where<P>
<FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"></FONT> <I>Q</I> is a finite set of <B>states</B>,<P>
<a name="09cf_1bde"><FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"></FONT> <I>q</I><SUB>0</SUB> <img src="../images/memof12.gif"> <I>Q</I> is the <B>start state</B>,<P>
<a name="09cf_1bdf"><FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"></FONT> <I>A</I> <img src="../images/rgtubar.gif"> <I>Q</I> is a distinguished set of <B>accepting states</B>,<P>
<a name="09cf_1be0"><a name="09cf_1be1"><FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"></FONT> <img src="../images/sum14.gif"> is a finite <B>input alphabet</B>,<P>
<a name="09cf_1be2"><FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"></FONT> <img src="../images/delta12.gif"> is a function from <I>Q</I> X <img src="../images/sum14.gif"> into <I>Q</I>, called the <B>transition function</B> of <I>M</I>.<P>
<img src="863_a.gif"><P>
<h4><a name="09cf_1be7">Figure 34.5 A simple two-state finite automaton with state set Q = {0, 1}, start state q<SUB>0</SUB> = 0, and input alphabet <img src="../images/sum14.gif"> = {a, b}. (a) A tabular representation of the transition function <img src="../images/delta12.gif">. (b) An equivalent state-transition diagram. State 1 is the only accepting state (shown blackened). Directed edges represent transitions. For example, the edge from state 1 to state 0 labeled b indicates <img src="../images/delta12.gif">(1, b) = 0. This automaton accepts those strings that end in an odd number of a's. More precisely, a string x is accepted if and only if x = yz, where y = <img src="../images/epsilon.gif"> or y ends with a b, and z = a<SUP>k</SUP><FONT FACE="Times New Roman" SIZE=2></FONT>, where k is odd. For example, the sequence of states this automaton enters for input abaaa (including the start state) is <img src="../images/lftwdchv.gif">0, 1, 0, 1, 0, 1<img src="../images/wdrtchv.gif">, and so it accepts this input. For input abbaa, the sequence of states is <img src="../images/lftwdchv.gif">0, 1, 0, 0, 1, 0<img src="../images/wdrtchv.gif">, and so it rejects this input.<a name="09cf_1be7"></sub></sup></h4><P>
<a name="09cf_1be3"><a name="09cf_1be4"><a name="09cf_1be5">The finite automaton begins in state <I>q</I><SUB>0</SUB> and reads the characters of its input string one at a time. If the automaton is in state <I>q</I> and reads input character <I>a</I>, it moves ("makes a transition") from state <I>q</I> to state <img src="../images/delta12.gif"> (<I>q, a</I>). Whenever its current state <I>q</I> is a member of <I>A</I>, the machine <I>M</I> is said to have <I><B>accepted</I></B> the string read so far. An input that is not accepted is said to be <I><B>rejected</I></B>. Figure 34.5 illustrates these definitions with a simple two-state automaton.<P>
<a name="09cf_1be6">A finite automaton <I>M</I> induces a function ø, called the <I><B>final-state function</I></B>, from <img src="../images/sum14.gif">* to <I>Q</I> such that ø(<I>w</I>) is the state <I>M</I> ends up in after scanning the string <I>w</I>. Thus, <I>M</I> accepts a string <I>w</I> if and only if ø(<I>w</I>) <img src="../images/memof12.gif"> <I>A</I>. The function ø is defined by the recursive relation<P>
<pre>ø(<img src="../images/epsilon.gif"><I>) = </I>q<SUB><I></SUB><img src="../images/omicr.gif"> </I><SUB>,</sub></sup></pre><P>
<pre>ø(<I>wa</I>) = <img src="../images/delta12.gif"><I>(ø(</I>w<I>), </I>a<I>) for </I>w<I> <img src="../images/memof12.gif"> <img src="../images/sum14.gif">*, </I>a<I> <img src="../images/memof12.gif"> <img src="../images/sum14.gif">.</I></sub></sup></pre><P>
<P>
<h2>String-matching automata</h2><P>
<a name="09d0_1be7"><a name="09d0_1be8">There is a string-matching automaton for every pattern <I>P</I>; this automaton must be constructed from the pattern in a preprocessing step before it can be used to search the text string. Figure 34.6 illustrates this construction for the pattern <I>P</I> = <FONT FACE="Courier New" SIZE=2>ababaca</FONT>. From now on, we shall assume that <I>P</I> is a given fixed pattern string; for brevity, we shall not indicate the dependence upon <I>P</I> in our notation.<P>
<img src="864_a.gif"><P>
<h4><a name="09d0_1bed">Figure 34.6 (a) A state-transition diagram for the string-matching automaton that accepts all strings ending in the string ababaca. State 0 is the start state, and state 7 (shown blackened) is the only accepting state. A directed edge from state i to state j labeled a represents <img src="../images/delta12.gif">(i,a) = j. The right-going edges forming the "spine" of the automaton, shown heavy in the figure, correspond to successful matches between pattern and input characters. The left-going edges correspond to failing matches. Some edges corresponding to failing matches are not shown; by convention, if a state i has no outgoing edge labeled a for some a <img src="../images/memof12.gif"> <img src="../images/sum14.gif">, then <img src="../images/delta12.gif">(i,a) = 0. (b) The corresponding transition function <img src="../images/delta12.gif">, and the pattern string P = <FONT FACE="Courier New" SIZE=2>ababaca</FONT>. The entries corresponding to successful matches between pattern and input characters are shown shaded. (c) The operation of the automaton on the text T = <FONT FACE="Courier New" SIZE=2>abababacaba</FONT>. Under each text character T[i] is given the state ø(T<SUB>i</SUB>) the automaton is in after processing the prefix T<SUB>i</SUB>. One occurrence of the pattern is found, ending in position 9.<a name="09d0_1bed"></sub></sup></h4><P>
<a name="09d0_1be9">In order to specify the string-matching automaton corresponding to a given pattern <I>P</I>[1 . . <I>m</I>], we first define an auxiliary function <img src="../images/sum14.gif"><I></I>, called the <I><B>suffix function</I></B> corresponding to <I>P</I>. The function <img src="../images/sum14.gif"><I> </I>is a mapping from <img src="../images/sum14.gif">* to {0, 1, . . . , <I>m</I>} such that <img src="../images/sum14.gif"><I></I>(<I>x</I>) is the length of the longest prefix of <I>P</I> that is a suffix of <I>x</I>:<P>
<img src="865_a.gif"><P>
The suffix function <img src="../images/sum14.gif"> is well defined since the empty string <I>P</I><SUB>0</SUB> = <img src="../images/epsilon.gif"><I></I> is a suffix of every string. As examples, for the pattern <I>P</I> = <FONT FACE="Courier New" SIZE=2>ab</FONT>, we have <img src="../images/sum14.gif"><I>(<img src="../images/epsilon.gif"></I>) = 0, <img src="../images/sum14.gif">(<FONT FACE="Courier New" SIZE=2>ccaca</FONT>) = 1, and <img src="../images/sum14.gif"><I></I>(<FONT FACE="Courier New" SIZE=2>ccab</FONT>) = 2. For a pattern <I>P</I> of length <I>m</I>, we have <img src="../images/sum14.gif"><I>(</I>x<I>) = </I><I>m</I> if and only if <img src="865_b.gif">. It follows from the definition of the suffix function that if <img src="865_c.gif">, then <img src="../images/sum14.gif"><I>(</I>x<I>) <img src="../images/lteq12.gif"> <img src="../images/sum14.gif"></I>(<I>y</I>).<P>
We define the string-matching automaton corresponding to a given pattern P[1 . . <I>m</I>] as follows.<P>
<FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"></FONT> The state set <I>Q</I> is {0, 1, . . . , <I>m</I>}. The start state <I>q</I><SUB>0</SUB> is state 0, and state<I> m</I> is the only accepting state.<P>
<FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"></FONT> The transition function <img src="../images/delta12.gif"><I></I> is defined by the following equation, for any state <I>q</I> and character <I>a</I>:<P>
<pre><img src="../images/delta12.gif"> (<I>q</I>, <I>a</I>) = <img src="../images/sum14.gif">(<I>P<SUB>q</SUB>a</I>) .</sub></sup></pre><P>
<h4><a name="09d0_1bee">(34.3)<a name="09d0_1bee"></sub></sup></h4><P>
Here is an intuitive rationale for defining <img src="../images/delta12.gif"> <I>(</I>q, a<I>) = <img src="../images/sum14.gif"></I>(<I>P<SUB>q </SUB>a</I>). The machine maintains as an invariant of its operation that<P>
<pre><img src="../images/phicap12.gif">(<I>T<SUB>i</I></SUB>) = <img src="../images/sum14.gif">(<I>T<SUB>i</I></SUB>) ;</sub></sup></pre><P>
<h4><a name="09d0_1bef">(34.4)<a name="09d0_1bef"></sub></sup></h4><P>
this result is proved as Theorem 34.4 below. In words, this means that after scanning the first<I> i</I> characters of the text string <I>T</I>, the machine is in state <img src="../images/phicap12.gif"><I></I>(<I>T<SUB>i</I></SUB>)<I> = q</I>, where <I>q</I> = <img src="../images/sum14.gif"><I> </I>(<I>T<SUB>i</SUB>)</I> is the length of the longest suffix of <I>T<SUB>i</I></SUB> that is also a prefix of the pattern <I>P</I>. If the next character scanned is <I>T</I>[<I>i</I> + 1] = <I>a</I>, then the machine should make a transition to state <img src="../images/sum14.gif"><I>(</I>T<SUB>i <I></SUB>+ 1) = <img src="../images/sum14.gif"></I>(<I>T<SUB>i</SUB>a</I>). The proof of the theorem shows that <img src="../images/sum14.gif"><I>(</I>T<SUB>i</SUB>a<I>) = <img src="../images/sum14.gif"></I>(<I>P<SUB>q</SUB>a</I>). That is, to compute the length of the longest suffix of <I>T<SUB>i</SUB>a</I> that is a prefix of <I>P</I>, we can compute the longest suffix of <I>P<SUB>q</SUB>a</I> that is a prefix of <I>P</I>. At each state, the machine only needs to know the length of the longest prefix of <I>P</I> that is a suffix of what has been read so far. Therefore, setting <img src="../images/delta12.gif">(q, a) = <I><img src="../images/sum14.gif">(P<SUB>q</SUB>a)</I> maintains the desired invariant (34.4). This informal argument will be made rigorous shortly.<P>
In the string-matching automaton of Figure 34.6, for example, we have <img src="../images/delta12.gif"><I></I>(5, <FONT FACE="Courier New" SIZE=2>b</FONT>) = 4. This follows from the fact that if the automaton reads a b in state <I>q</I> = 5, then <I>P<SUB>q</I></SUB><FONT FACE="Courier New" SIZE=1>b</FONT> = <FONT FACE="Courier New" SIZE=2>ababab</FONT>, and the longest prefix of <I>P</I> that is also a suffix of <FONT FACE="Courier New" SIZE=2>ababab</FONT> is <I>P</I><SUB>4</SUB> = <FONT FACE="Courier New" SIZE=2>abab</FONT>.<P>
To clarify the operation of a string-matching automaton, we now give a simple, efficient program for simulating the behavior of such an automaton (represented by its transition function <img src="../images/delta12.gif"><I></I>) in finding occurrences of a pattern <I>P</I> of length <I>m</I> in an input text <I>T</I>[1 . . <I>n</I>]. As for any string-matching automaton for a pattern of length <I>m</I>, the state set <I>Q</I> is {0,1, . . . , <I>m</I>}, the start state is 0, and the only accepting state is state <I>m</I>.<P>
<img src="866_a.gif"><P>
<h4><a name="09d0_1bf0">Figure 34.7 An illustration for the proof of Lemma 34.2. The figure shows that r <img src="../images/lteq12.gif"> <img src="../images/sum14.gif">(x) + 1, where r = <img src="../images/sum14.gif">(xa).<a name="09d0_1bf0"></sub></sup></h4><P>
<pre><a name="09d0_1bea">FINITE-AUTOMATON-MATCHER(<I>T, </I><img src="../images/delta12.gif">,<I> </I>m<I>)</I></sub></sup></pre><P>
<pre>1 <I>n</I> <img src="../images/arrlt12.gif"> <I>length</I>[<I>T</I>]</sub></sup></pre><P>
<pre>2 <I>q</I> <img src="../images/arrlt12.gif"> 0</sub></sup></pre><P>
<pre>3<B> for</B> <I>i</I> <img src="../images/arrlt12.gif"> 1 <B>to</B> <I>n</I></sub></sup></pre><P>
<pre>4 <B>do</B> <I>q</I> <img src="../images/arrlt12.gif"> <img src="../images/delta12.gif"><I> (</I>q, T<I>[</I>i<I>])</I></sub></sup></pre><P>
<pre>5<B> if</B> <I>q</I> = <I>m</I></sub></sup></pre><P>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -