📄 chap17.htm

📁 美国麻省理工的算法导论,英文版,看后程序员宏观的能力培养有好处
💻 HTM
📖 第 1 页 / 共 5 页
字号:
Suppose we have a 100,000-character data file that we wish to store compactly. We observe that the characters in the file occur with the frequencies given by Figure 17.3. That is, only six different characters appear, and the character <FONT FACE="Courier New" SIZE=2>a</FONT> occurs 45,000 times.<P>
<a name="082f_1591"><a name="082f_1592"><a name="082f_1593"><a name="082f_1594">There are many ways to represent such a file of information. We consider the problem of designing a <I><B>binary character code</I></B> (or <I><B>code</I></B> for short) wherein each character is represented by a unique binary string. If we use a <I><B>fixed-length code</I></B>, we need 3 bits to represent six characters: <FONT FACE="Courier New" SIZE=2>a</FONT> = 000, <FONT FACE="Courier New" SIZE=2>b</FONT> = 001, . . . , <FONT FACE="Courier New" SIZE=2>f</FONT> = 101. This method requires 300,000 bits to code the entire file. Can we do better?<P>
<pre>                             a     b     c     d      e     f</sub></sup></pre><P>
<pre>--------------------------------------------------------------</sub></sup></pre><P>
<pre>Frequency (in thousands)    45    13    12    16      9      5</sub></sup></pre><P>
<pre>Fixed-length codeword      000   001   010   011    100    101</sub></sup></pre><P>
<pre>Variable-length codeword    0    101   100   111   1101   1100</sub></sup></pre><P>
<h4><a name="082f_1597">Figure 17.3 A character-coding problem. A data file of 100,000 characters contains only the characters <FONT FACE="Courier New" SIZE=2>a-f<FONT FACE="Times New Roman" SIZE=2>, with the frequencies indicated. If each character is assigned a 3-bit codeword, the file can be encoded in 300,000 bits. Using the variable-length code shown, the file can be encoded in 224,000 bits.<a name="082f_1597"></FONT></FONT></sub></sup></h4><P>
<a name="082f_1595">A<I> <B>variable-length code</I></B> can do considerably better than a fixed-length code, by giving frequent characters short codewords and infrequent characters long codewords. Figure 17.3 shows such a code; here the 1-bit string 0 represents <FONT FACE="Courier New" SIZE=2>a</FONT>, and the 4-bit string 1100 represents <FONT FACE="Courier New" SIZE=2>f</FONT>. This code requires (45 <img src="../images/dot10.gif"> 1 + 13 <img src="../images/dot10.gif"> 3 + 12 <img src="../images/dot10.gif"> 3 + 16 <img src="../images/dot10.gif"> 3 + 9 <img src="../images/dot10.gif"> 4 + 5 <img src="../images/dot10.gif"> 4) <img src="../images/dot10.gif"> 1,000 = 224,000 bits to represent the file, a savings of approximately 25%. In fact, this is an optimal character code for this file, as we shall see.<P>





<h2>Prefix codes</h2><P>
<a name="0830_1596">We consider here only codes in which no codeword is also a prefix of some other codeword. Such codes are called <I><B>prefix codes.</I></B><SUP>1</SUP> It is possible to show (although we won't do so here) that the optimal data compression achievable by a character code can always be achieved with a prefix code, so there is no loss of generality in restricting attention to prefix codes.<P>
<SUP><FONT FACE="Times" SIZE=1>1 </SUP><FONT FACE="Times" SIZE=2>Perhaps &quot;prefix-free codes&quot; would be a better name, but the term &quot;prefix codes&quot; is standard in the literature.</FONT></FONT><P>
Prefix codes are desirable because they simplify encoding (compression) and decoding. Encoding is always simple for any binary character code; we just concatenate the codewords representing each character of the file. For example, with the variable-length prefix code of Figure 17.3, we code the 3-character file <FONT FACE="Courier New" SIZE=2>abc</FONT> as 0 <img src="../images/dot10.gif"> 101 <img src="../images/dot10.gif"> 100 = 0101100, where we use &quot;<img src="../images/dot10.gif">&quot; to denote concatenation.<P>
Decoding is also quite simple with a prefix code. Since no codeword is a prefix of any other, the codeword that begins an encoded file is unambiguous. We can simply identify the initial codeword, translate it back to the original character, remove it from the encoded file, and repeat the decoding process on the remainder of the encoded file. In our example, the string 001011101 parses uniquely as 0 <img src="../images/dot10.gif"> 0 <img src="../images/dot10.gif"> 101 <img src="../images/dot10.gif"> 1101, which decodes to <FONT FACE="Courier New" SIZE=2>aabe</FONT>.<P>
The decoding process needs a convenient representation for the prefix code so that the initial codeword can be easily picked off. A binary tree whose leaves are the given characters provides one such representation. We interpret the binary codeword for a character as the path from the root to that character, where 0 means &quot;go to the left child&quot; and 1 means &quot;go to the right child.&quot; Figure 17.4 shows the trees for the two codes of our example. Note that these are not binary search trees, since the leaves need not appear in sorted order and internal nodes do not contain character keys.<P>
<a name="0830_1597">An optimal code for a file is always represented by a <I>full</I> binary tree, in which every nonleaf node has two children (see Exercise 17.3-1). The fixed-length code in our example is not optimal since its tree, shown in Figure 17.4(a), is not a full binary tree: there are codewords beginning 10 . . . , but none beginning 11 . . . . Since we can now restrict our attention to full binary trees, we can say that if <I>C</I> is the alphabet from which the characters are drawn, then the tree for an optimal prefix code has exactly <I>|C| leaves, one for each letter of the alphabet, and exactly |</I>C| - 1 internal nodes.<P>
<img src="339_a.gif"><P>
<h4><a name="0830_1598">Figure 17.4 Trees corresponding to the coding schemes in Figure 17.3. Each leaf is labeled with a character and its frequency of occurrence. Each internal node is labeled with the sum of the weights of the leaves in its subtree. (a) The tree corresponding to the fixed-length code <FONT FACE="Courier New" SIZE=2>a<FONT FACE="Times New Roman" SIZE=2> = 000, . . . , <FONT FACE="Courier New" SIZE=2>f<FONT FACE="Times New Roman" SIZE=2> = 100. (b) The tree corresponding to the optimal prefix code <FONT FACE="Courier New" SIZE=2>a<FONT FACE="Times New Roman" SIZE=2> = 0, <FONT FACE="Courier New" SIZE=2>b<FONT FACE="Times New Roman" SIZE=2> = 101, . . . , <FONT FACE="Courier New" SIZE=2>f<FONT FACE="Times New Roman" SIZE=2> = 1100<a name="0830_1598"></FONT></FONT></FONT></FONT></FONT></FONT></FONT></FONT></FONT></FONT></sub></sup></h4><P>
Given a tree <I>T</I> corresponding to a prefix code, it is a simple matter to compute the number of bits required to encode a file. For each character <I>c</I> in the alphabet <I>C</I>, let <I>f</I>(<I>c</I>) denote the frequency of <I>c</I> in the file and let <I>d<SUB>T</I></SUB>(<I>c</I>) denote the depth of <I>c</I>'s leaf in the tree. Note that <I>d<SUB>T</I></SUB>(<I>c</I>) is also the length of the codeword for character <I>c</I>. The number of bits required to encode a file is thus<P>
<img src="339_b.gif"><P>
<h4><a name="0830_1599">(17.3)<a name="0830_1599"></sub></sup></h4><P>
which we define as the <I><B>cost</I></B> of the tree <I>T</I>.<P>
<P>







<h2>Constructing a Huffman code</h2><P>
Huffman invented a greedy algorithm that constructs an optimal prefix code called a <I><B>Huffman code.</I></B> The algorithm builds the tree <I>T</I> corresponding to the optimal code in a bottom-up manner. It begins with a set of |<I>C|</I> leaves and performs a sequence of |<I>C|</I> - 1 &quot;merging&quot; operations to create the final tree.<P>
<a name="0831_1598">In the pseudocode that follows, we assume that <I>C</I> is a set of <I>n</I> characters and that each character <I>c</I> <img src="../images/memof12.gif"> <I>C</I> is an object with a defined frequency <I>f</I>[<I>c</I>]. A priority queue <I>Q</I>, keyed on <I>f</I>, is used to identify the two least-frequent objects to merge together. The result of the merger of two objects is a new object whose frequency is the sum of the frequencies of the two objects that were merged.<P>
<pre><a name="0831_1599">HUFFMAN(<I>C</I>)</sub></sup></pre><P>
<pre>1 <I>n</I> <img src="../images/arrlt12.gif"> |<I>C|</I></sub></sup></pre><P>
<pre><I>2 Q</I> <img src="../images/arrlt12.gif"> <I>C</I></sub></sup></pre><P>
<pre>3 <B>for</B><I> i</I> <img src="../images/arrlt12.gif"> 1 <B>to</B> <I>n - </I>1</sub></sup></pre><P>
<pre>4      <B>do</B> <I>z</I> <img src="../images/arrlt12.gif"> ALLOCATE-NODE()</sub></sup></pre><P>
<pre>5         <I>x</I> <img src="../images/arrlt12.gif"> <I>left</I>[<I>z</I>] <img src="../images/arrlt12.gif"> EXTRACT-MIN(<I>Q</I>)</sub></sup></pre><P>
<pre>6         y <img src="../images/arrlt12.gif"> <I>right</I>[<I>z</I>] <img src="../images/arrlt12.gif"> EXTRACT-MIN(<I>Q</I>)</sub></sup></pre><P>
<pre>7         <I>f</I>[<I>z</I>] <img src="../images/arrlt12.gif"> <I>f</I>[<I>x</I>] + <I>f</I>[<I>y</I>]</sub></sup></pre><P>
<pre>8         INSERT(<I>Q</I>, <I>z</I>)</sub></sup></pre><P>
<pre>9 <B>return</B><I> </I>EXTRACT-MIN(<I>Q</I>)</sub></sup></pre><P>
For our example, Huffman's algorithm proceeds as shown in Figure 17.5. Since there are 6 letters in the alphabet, the initial queue size is <I>n</I> = 6, and 5 merge steps are required to build the tree. The final tree represents the optimal prefix code. The codeword for a letter is the sequence of edge labels on the path from the root to the letter.<P>
Line 2 initializes the priority queue <I>Q</I> with the characters in <I>C</I>. The <B>for</B> loop in lines 3-8 repeatedly extracts the two nodes <I>x</I> and <I>y</I> of lowest frequency from the queue, and replaces them in the queue with a new node <I>z</I> representing their merger. The frequency of <I>z</I> is computed as the sum of the frequencies of <I>x</I> and <I>y</I> in line 7. The node <I>z</I> has <I>x</I> as its left child and <I>y</I> as its right child. (This order is arbitrary; switching the left and right child of any node yields a different code of the same cost.) After <I>n</I> - 1 mergers, the one node left in the queue--the root of the code tree--is returned in line 9.<P>
The analysis of the running time of Huffman's algorithm assumes that <I>Q</I> is implemented as a binary heap (see Chapter 7). For a set <I>C</I> of <I>n</I> characters, the initialization of <I>Q</I> in line 2 can be performed in <I>O</I>(<I>n</I>) time using the B<FONT FACE="Courier New" SIZE=2>UILD-</FONT><FONT FACE="Courier New" SIZE=2>HEAP</FONT> procedure in Section 7.3. The <B>for</B> loop in lines 3-8 is executed exactly |<I>n|</I> - 1 times, and since each heap operation requires time <I>O</I>(1g <I>n</I>), the loop contributes <I>O</I>(<I>n</I> 1g <I>n</I>) to the running time. Thus, the total running time of <FONT FACE="Courier New" SIZE=2>HUFFMAN</FONT> on a set of <I>n</I> characters is <I>O</I>(<I>n</I> 1g <I>n</I>).<P>
<P>







<h2>Correctness of Huffman's algorithm</h2><P>
<a name="0832_159a">To prove that the greedy algorithm <FONT FACE="Courier New" SIZE=2>HUFFMAN</FONT> is correct, we show that the problem of determining an optimal prefix code exhibits the greedy-choice and optimal-substructure properties. The next lemma shows that the greedy-choice property holds.<P>
<img src="341_a.gif"><P>
<h4><a name="0832_159c">Figure 17.5 The steps of Huffman's algorithm for the frequencies given in Figure 17.3. Each part shows the contents of the queue sorted into increasing order by frequency. At each step, the two trees with lowest frequencies are merged. Leaves are shown as rectangles containing a character and its frequency. Internal nodes are shown as circles containing the sum of the frequencies of its children. An edge connecting an internal node with its children is labeled 0 if it is an edge to a left child and 1 if it is an edge to a right child. The codeword for a letter is the sequence of labels on the edges connecting the root to the leaf for that letter. (a) The initial set of n = 6 nodes, one for each letter. (b)-(e) Intermediate stages.(f) The final tree.<a name="0832_159c"></sub></sup></h4><P>
<img src="342_a.gif"><P>
<h4><a name="0832_159d">Figure 17.6 An illustration of the key step in the proof of Lemma 17.2. In the optimal tree T, leaves b and c are two of the deepest leaves and are siblings. Leaves x and y are the two leaves that Huffman's algorithm merges together first; they appear in arbitrary positions in T. Leaves b and x are swapped to obtain tree T'. Then, leaves c and y are swapped to obtain tree T\". Since each swap does not increase the cost, the resulting tree T\" is also an optimal tree.<a name="0832_159d"></sub></sup></h4><P>
<a name="0832_159e">Lemma 17.2<a name="0832_159e"><P>
<a name="0832_159b">Let <I>C</I> be an alphabet in which each character <I>c</I> <img src="../images/memof12.gif"> <I>C</I> has frequency <I>f</I>[<I>c</I>]. Let <I>x</I> and <I>y</I> be two characters in <I>C</I> having the lowest frequencies. Then there exists an optimal prefix code for <I>C</I> in which the codewords for <I>x </I>and <I>y </I>have the same length and differ only in the last bit.<P>
<I><B>Proof     </I></B>The idea of the proof is to take the tree <I>T</I> representing an arbitrary  optimal prefix code and modify it to make a tree representing another optimal prefix code such that the characters <I>x</I> and <I>y</I> appear as sibling leaves of maximum depth in the new tree. If we can do this, then their codewords will have the same length and differ only in the last bit.<P>
Let <I>b</I> and <I>c</I> be two characters that are sibling leaves of maximum depth in <I>T</I>. Without loss of generality, we assume that <I>f</I>[b] <img src="../images/lteq12.gif"> <I>f</I>[<I>c</I>] and <I>f</I>[<I>x</I>] <img src="../images/lteq12.gif"> <I>f</I>[<I>y</I>]. Since <I>f</I>[<I>x</I>] and <I>f</I>[<I>y</I>] are the two lowest leaf frequencies, in order, and <I>f</I>[<I>b</I>] and <I>f</I>[<I>c</I>] are two arbitrary frequencies, in order, we have <I>f</I>[<I>x</I>] <img src="../images/lteq12.gif"> <I>f</I>[<I>b</I>] and <I>f</I>[<I>y</I>] <img src="../images/lteq12.gif"> <I>f</I>[<I>c</I>]. As shown in Figure 17.6, we exchange the positions in <I>T </I>of <I>b</I> and <I>x</I> to produce a tree <I>T</I>', and then we exchange the positions in <I>T</I>' of <I>c</I> and <I>y</I> to produce a tree <I>T</I>\". By equation (17.3), the difference in cost between <I>T</I> and <I>T</I>'<I> is</I><P>
<img src="342_b.gif"><P>
because both <I>f</I>[<I>b</I>] - <I>f</I>[<I>x</I>] and <I>d<SUB>T</I></SUB>[<I>b</I>] - <I>d<SUB>T</I></SUB>[<I>x</I>] are nonnegative. More specifically, <I>f</I>[<I>b</I>] - <I>f</I>[<I>x</I>] is nonnegative because <I>x</I> is a minimum-frequency leaf, and <I>d<SUB>T</I></SUB>[<I>b</I>] - <I>d<SUB>T</I></SUB>[<I>x</I>] is nonnegative because <I>b </I>is a leaf of maximum depth <I>in T</I>. Similarly, because exchanging <I>y</I> and <I>c</I> does not increase the cost, <I>B</I>(<I>T</I>') - <I>B</I>(<I>T</I>\") is nonnegative. Therefore, <I>B</I>(<I>T</I>\") <img src="../images/lteq12.gif"> <I>B</I>(<I>T</I>), and since <I>T</I> is optimal, <I>B</I>(<I>T</I>) <img src="../images/lteq12.gif"> <I>B</I>(<I>T</I>\"), which implies <I>B</I>(<I>T</I>\") = <I>B</I>(<I>T</I>). Thus, <I>T</I>\" is an optimal tree in which <I>x</I> and <I>y</I> appear as sibling leaves of maximum depth, from which the lemma follows.      <P>
Lemma 17.2 implies that the process of building up an optimal tree by mergers can, without loss of generality, begin with the greedy choice of merging together those two characters of lowest frequency. Why is this a greedy choice? We can view the cost of a single merger as being the sum of the frequencies of the two items being merged. Exercise 17.3-3 shows that the total cost of the tree constructed is the sum of the costs of its mergers. Of all possible mergers at each step, <FONT FACE="Courier New" SIZE=2>HUFFMAN</FONT> chooses the one that incurs the least cost.<P>
The next lemma shows that the problem of constructing optimal prefix codes has the optimal-substructure property.<P>
<a name="0832_159f">Lemma 17.3<a name="0832_159f"><P>
Let <I>T</I> be a full binary tree representing an optimal prefix code over an alphabet <I>C</I>, where frequency <I>f</I> [<I>c</I>] is defined for each character <I>c</I> <img src="../images/memof12.gif"> <I>C</I>. Consider any two characters <I>x</I> and <I>y </I>that appear as sibling leaves in <I>T</I>, and let <I>z</I> be their parent. Then, considering <I>z</I> as a character with frequency <I>f</I>[<I>z</I>] = <I>f</I>[<I>x</I>] + <I>f</I>[<I>y</I>], the tree <I>T</I>'<I> = </I>T<I> - {</I>x,y<I>} represents an optimal prefix code for the alphabet </I>C<I>' = </I>C<I> - {</I>x,y<I>} <img src="../images/wideu.gif"> {</I>z<I>}.</I><P>
<I><B>Proof     </I></B>We first show that the cost <I>B</I>(<I>T</I>) of tree <I>T</I> can be expressed in terms of the cost <I>B</I>(<I>T</I>') of tree <I>T</I><I>'</I> by considering the component costs in equation (17.3). For each <I>c</I> <img src="../images/memof12.gif"> <I>C</I> - {<I>x,y</I>}, we have <I>d<SUB>T</I></SUB>(<I>c</I>) = <I>d<SUB>T</I></SUB>'<SUB>(<I>c</I>)</SUB>, and hence <I>f</I>[<I>c</I>]<I>d<SUB><FONT FACE="Courier New" SIZE=2>T</I></FONT></SUB>(<I>c</I>) = <I>f</I>[<I>c</I>]<I>d<SUB><FONT FACE="Courier New" SIZE=2>T</I></SUB>'</FONT><SUB>(<I>c</I>)</SUB>. Since <I>d<SUB>T</I></SUB>(<I>x</I>) = <I>d<SUB>T</I></SUB>(<I>y</I>) = <I>d<SUB>T</I></SUB>'<SUB>(<I>z</I>)</SUB> + 1, we have<P>
<pre><I>f</I>[<I>x</I>]<I>d<SUB>T</I></SUB>(<I>x</I>) + <I>f</I>[<I>y</I>]<I>d<SUB>T</I></SUB>(<I>y</I>) = (<I>f</I>[<I>x</I>]) + (<I>f</I>[<I>y</I>])<I>d<SUB>T</I></SUB>'<SUB>(<I>z</I>)</SUB> + 1)</sub></sup></pre><P>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -