📄 chap12.htm

📁 程序设计的艺术,此书是英文版的,经久不衰,非常的经典
💻 HTM
📖 第 1 页 / 共 5 页
字号:





<h2><a name="07cb_144d">12.3.1 The division method<a name="07cb_144d"></h2><P>
<a name="07cb_144b"><a name="07cb_144c">In the <I><B>division method</I></B> for creating hash functions, we map a key <I>k</I> into one of <I>m</I> slots by taking the remainder of <I>k</I> divided by <I>m</I>. That is, the hash function is<P>
<pre><I>h</I>(<I>k</I>) = <I>k</I> mod <I>m</I> .</sub></sup></pre><P>
For example, if the hash table has size <I>m</I> = 12 and the key is <I>k</I> = 100, then <I>h(k)</I> = 4. Since it requires only a single division operation, hashing by division is quite fast.<P>
When using the division method, we usually avoid certain values of <I>m. </I>For example, <I>m</I> should not be a power of 2, since if <I>m</I> = 2<I><SUP>p</I></SUP>, then <I>h(k) </I>is just the <I>p</I> lowest-order bits of <I>k.</I> Unless it is known a priori that the probability distribution on keys makes all low-order <I>p</I>-bit patterns equally likely, it is better to make the hash function depend on all the bits of the key. Powers of 10 should be avoided if the application deals with decimal numbers as keys, since then the hash function does not depend on all the decimal digits of <I>k</I>. Finally, it can be shown that when <I>m</I> = 2<I><SUP>p</I></SUP> - 1 and <I>k </I>is a character string interpreted in radix 2<I><SUP>p</I></SUP>, two strings that are identical except for a transposition of two adjacent characters will hash to the same value.<P>
Good values for <I>m</I> are primes not too close to exact powers of 2. For example, suppose we wish to allocate a hash table, with collisions resolved by chaining, to hold roughly <I>n</I> = 2000 character strings, where a character has 8 bits. We don't mind examining an average of 3 elements in an unsuccessful search, so we allocate a hash table of size <I>m</I> = 701. The number 701 is chosen because it is a prime near <img src="../images/alpha12.gif"> = 2000/3 but not near any power of 2. Treating each key <I>k</I> as an integer, our hash function would be<P>
<pre><I>h</I>(<I>k</I>) =<I> k</I> mod 701 .</sub></sup></pre><P>
As a precautionary measure, we could check how evenly this hash function distributes sets of keys among the slots, where the keys are chosen from "real" data.<P>
<P>







<h2><a name="07cc_144f">12.3.2 The multiplication method<a name="07cc_144f"></h2><P>
<a name="07cc_144d"><a name="07cc_144e">The <I><B>multiplication method</I></B> for creating hash functions operates in two steps. First, we multiply the key <I>k</I> by a constant <I>A</I> in the range 0 &lt; <I>A</I> &lt; 1 and extract the fractional part of <I>kA</I>. Then, we multiply this value by <I>m </I>and take the floor of the result. In short, the hash function is<P>
<pre><I>h</I>(<I>k</I>) = <img src="../images/hfbrdl12.gif"><I>m</I> (<I>k</I> <I>A</I> mod 1)<img src="../images/hfbrdr12.gif"> ,</sub></sup></pre><P>
where "<I>k A</I> mod 1" means the fractional part of <I>kA</I>, that is, <I>kA - </I><FONT FACE="Times New Roman" SIZE=2><img src="../images/hfbrdl12.gif"><I>kA</I><img src="../images/hfbrdr12.gif"></FONT>.<P>
An advantage of the multiplication method is that the value of <I>m</I> is not critical. We typically choose it to be a power of 2--<I>m</I> = 2<I><SUP>p</I></SUP> for someinteger<I> p</I>--since we can then easily implement the function on most computers as follows. Suppose that the word size of the machine is <I>w</I> bits and that <I>k</I> fits into a single word. Referring to Figure 12.4, we first multiply <I>k</I> by the <I>w</I>-bit integer <FONT FACE="Courier New" SIZE=2><img src="../images/hfbrdl12.gif"></FONT>A <img src="../images/dot10.gif"> 2<I><SUP>w</I></SUP><img src="../images/hfbrdr12.gif">. The result is a 2<I>w</I>-bit value <I>r</I><SUB>1</SUB> 2<I><SUP>w</I></SUP> + <I>r</I><SUB>0</SUB>, where <I>r</I><SUB>1</SUB> is the high-order word of the product and <I>r</I><SUB>0</SUB> is the low-order word of the product. The desired <I>p</I>-bit hash value consists of the <I>p</I> most significant bits of <I>r</I><SUB>0</SUB>.<P>
<img src="229_a.gif"><P>
<h4><a name="07cc_1450">Figure 12.4 The multiplication method of hashing. The w-bit representation of the key k is multiplied by the w-bit value <img src="../images/hfbrdl12.gif">A.2<SUP>w</SUP><img src="../images/hfbrdr12.gif"><FONT FACE="Times New Roman" SIZE=2>, </FONT>where 0 &lt; A &lt; 1 is a suitable constant. The p highest-order bits of the lower w-bit half of the product form the desired hash value h(k).<a name="07cc_1450"></sub></sup></h4><P>
Although this method works with any value of the constant <I>A</I>, it works better with some values than with others. The optimal choice depends on the characteristics of the data being hashed. Knuth [123] discusses the choice of <I>A</I> in some detail and suggests that<P>
<img src="229_b.gif"><P>
<h4><a name="07cc_1451">(12.2)<a name="07cc_1451"></sub></sup></h4><P>
is likely to work reasonably well.<P>
As an example, if we have <I>k</I> = 123456, <I>m</I> = 10000, and <I>A</I> as in equation (12.2), then<P>
<pre><I>h</I>(<I>k</I>)  =  <img src="../images/hfbrdl12.gif">10000 <img src="../images/dot10.gif"> (123456 <img src="../images/dot10.gif"> 0.61803 . . . mod 1)<img src="../images/hfbrdr12.gif"></sub></sup></pre><P>
<pre>=  <img src="../images/hfbrdl12.gif">10000 <img src="../images/dot10.gif"> (76300.0041151. . . mod 1)<img src="../images/hfbrdr12.gif"></sub></sup></pre><P>
<pre>=  <img src="../images/hfbrdl12.gif">10000 <img src="../images/dot10.gif"> 0.0041151 . . .<img src="../images/hfbrdr12.gif"></sub></sup></pre><P>
<pre>=  <img src="../images/hfbrdl12.gif">41.151 . . .<img src="../images/hfbrdr12.gif"></sub></sup></pre><P>
<pre>=  41 .</sub></sup></pre><P>
<P>







<h2><a name="07cd_1455">12.3.3 Universal hashing<a name="07cd_1455"></h2><P>
<a name="07cd_144f"><a name="07cd_1450"><a name="07cd_1451"><a name="07cd_1452">If a malicious adversary chooses the keys to be hashed, then he can choose <I>n</I> keys that all hash to the same slot, yielding an average retrieval time of <img src="../images/bound.gif"> (<I>n</I>). Any fixed hash function is vulnerable to this sort of worst-case behavior; the only effective way to improve the situation is to choose the hash function <I>randomly</I> in a way that is <I>independent</I> of the keys that are actually going to be stored. This approach, called <I><B>universal hashing</I></B>, yields good performance on the average, no matter what keys are chosen by the adversary.<P>
<a name="07cd_1453">The main idea behind universal hashing is to select the hash function at random at run time from a carefully designed class of functions. As in the case of quicksort, randomization guarantees that no single input will always evoke worst-case behavior. Because of the randomization, the algorithm can behave differently on each execution, even for the same input. This approach guarantees good average-case performance, no matter what keys are provided as input. Returning to the example of a compiler's symbol table, we find that the programmer<FONT FACE="CG Times (W1)" SIZE=2>'</FONT>s choice of identifiers cannot now cause consistently poor hashing performance. Poor performance occurs only if the compiler chooses a random hash function that causes the set of identifiers to hash poorly, but the probability of this occurring is small and is the same for any set of identifiers of the same size.<P>
Let <img src="230_a.gif"> be a finite collection of hash functions that map a given universe <I>U</I> of keys into the range {0,1, . . . , <I>m </I>- 1}. Such a collection is said to be <I><B>universal</I></B> if for each pair of distinct keys <I>x,y </I><img src="../images/memof12.gif"> U<I>, the number of hash functions <img src="230_b.gif"> for which </I>h<I>(</I>x<I>) = </I>h<I>(</I>y<I>) is precisely <img src="230_c.gif">. In other words, with a hash function randomly chosen from <img src="230_d.gif">, the chance of a collision between </I>x<I> and </I>y<I> when </I>x <I><img src="../images/noteq.gif"> y</I> is exactly 1/<I>m</I>, which is exactly the chance of a collision if <I>h</I>(<I>x</I>) and <I>h</I>(<I>y</I>) are randomly chosen from the set {0,1, . . . , <I>m</I> - 1}.<P>
The following theorem shows that a universal class of hash functions gives good average-case behavior.<P>
<a name="07cd_1456">Theorem 12.3<a name="07cd_1456"><P>
If <I>h</I> is chosen from a universal collection of hash functions and is used to hash <I>n</I> keys into a table of size <I>m</I>, where <I>n</I> <img src="../images/lteq12.gif"> <I>m</I>, the expected number of collisions involving a particular key <I>x</I> is less than 1.<P>
<I><B>Proof     </I></B>For each pair <I>y, z</I> of distinct keys, let <I>c<SUB>yz</I></SUB> be a random variable that is 1 if <I>h</I>(<I>y</I>) = <I>h</I>(<I>z</I>) (i.e., if <I>y</I> and <I>z</I> collide using <I>h</I>) and 0 otherwise. Since, by definition, a single pair of keys collides with probability 1/<I>m</I>, we have<P>
<pre>E[<I>c<SUB>yz</I></SUB>] = 1/<I>m</I> .</sub></sup></pre><P>
Let <I>C<SUB>x</I></SUB> be the total number of collisions involving key <I>x</I> in a hash table <I>T</I> of size <I>m</I> containing <I>n</I> keys. Equation (6.24) gives<P>
<img src="230_e.gif"><P>
Since <I>n</I> <img src="../images/lteq12.gif"> <I>m</I>, we have E [<I>C<SUB>x</I></SUB>] &lt; 1.      <P>
But how easy is it to design a universal class of hash functions? It is quite easy, as a little number theory will help us prove. Let us choose our table size <I>m</I> to be prime (as in the division method). We decompose a key <I>x</I> into <I>r</I>+ 1 bytes (i.e., characters, or fixed-width binary substrings), so that <I>x</I> = <img src="../images/lftwdchv.gif"><I>x</I><SUB>0</SUB>,<I> x</I><SUB>1</SUB>,<I>. . . </I>,<I> x<SUB>r</I></SUB><img src="../images/wdrtchv.gif">; the only requirement is that the maximum value of a byte should be less than <I>m</I>. Let <I>a</I> = <img src="../images/lftwdchv.gif"><I>a</I><SUB>0</SUB>,<I> a</I><SUB>1</SUB>,<I> . . .</I> ,<I> a<SUB>r</I></SUB><img src="../images/wdrtchv.gif"> denote a sequence of <I>r</I> + 1 elements chosen randomly from the set {0,1, . . . , <I>m</I> - 1}. We define a corresponding hash function <img src="231_a.gif">:<P>
<img src="231_b.gif"><P>
<h4><a name="07cd_1457">(12.3)<a name="07cd_1457"></sub></sup></h4><P>
With this definition,<P>
<img src="231_c.gif"><P>
<h4><a name="07cd_1458">(12.4)<a name="07cd_1458"></sub></sup></h4><P>
has <I>m<SUP>r</I>+1</SUP> members.<P>
<a name="07cd_1459">Theorem 12.4<a name="07cd_1459"><P>
The class <img src="231_d.gif"> defined by equations (12.3) and (12.4) is a universal class of hash functions.<P>
<I><B>Proof     </I></B>Consider any pair of distinct keys <I>x, y</I>. Assume that <I>x</I><SUB>0</SUB> <img src="../images/noteq.gif"> <I>y</I><SUB>0</SUB>. (A similar argument can be made for a difference in any other byte position.) For any fixed values of <I>a</I><SUB>1</SUB>, <I>a</I><SUB>2</SUB>, . . . , <I>a<SUB>r</I></SUB>, there is exactly one value of <I>a</I><SUB>0</SUB> that satisfies the equation <I>h</I>(<I>x</I>) = <I>h</I>(<I>y</I>); this <I>a</I><SUB>0</SUB> is the solution to<P>
<img src="231_e.gif"><P>
<a name="07cd_1454">To see this property, note that since <I>m</I> is prime, the nonzero quantity <I>x</I><SUB>0</SUB><I> - y</I><SUB>0</SUB> has a multiplicative inverse modulo <I>m</I>, and thus there is a unique solution for <I>a</I><SUB>0</SUB> modulo <I>m</I>. (See Section 33.4.) Therefore, each pair of keys <I>x</I> and <I>y</I> collides for exactly <I>m<SUP>r</I></SUP> values of <I>a</I>, since they collide exactly once for each possible value of <img src="../images/lftwdchv.gif"><I>a</I><SUB>l</SUB>,<I> a</I><SUB>2</SUB>,<I> . . .</I>,<I> a<SUB>r</I></SUB><img src="../images/wdrtchv.gif"><I> </I>(i.e., for the unique value of <I>a</I><SUB><FONT FACE="Courier New" SIZE=2>0</FONT></SUB> noted above). Since there are <I>m<SUP>r+</I>l</SUP> possible values for the sequence a, keys <I>x</I> and <I>y</I> collide with probability exactly <I>m<SUP>r</SUP>/m<SUP>r+</I>1</SUP> = 1/<I>m</I>. Therefore, <img src="231_f.gif"> is universal.      <P>
<P>







<h2><a name="07ce_1456">Exercises<a name="07ce_1456"></h2><P>
<a name="07ce_1457">12.3-1<a name="07ce_1457"><P>
<a name="07ce_1455">Suppose we wish to search a linked list of length <I>n</I>, where each element contains a key <I>k</I> along with a hash value <I>h</I>(<I>k</I>). Each key is a long character string. How might we take advantage of the hash values when searching the list for an element with a given key?<P>
<a name="07ce_1458">12.3-2<a name="07ce_1458"><P>
Suppose a string of <I>r</I> characters is hashed into <I>m</I> slots by treating it as a radix-128 number and then using the division method. The number <I>m</I> is easily represented as a 32-bit computer word, but the string of <I>r</I> characters, treated as a radix-128 number, takes many words. How can we apply the division method to compute the hash value of the character string without using more than a constant number of words of storage outside the string itself?<P>
<a name="07ce_1459">12.3-3<a name="07ce_1459"><P>
Consider a version of the division method in which <I>h</I>(<I>k</I>) = <I>k</I> mod <I>m</I>, where <I>m = </I>2<I><SUP>p</SUP> - I and k</I> is a character string interpreted in radix 2<I><SUP>p</I></SUP>. Show that if string <I>x</I> can be derived from string <I>y</I> by permuting its characters, then <I>x</I> and <I>y</I> hash to the same value. Give an example of an application in which this property would be undesirable in a hash function.<P>
<a name="07ce_145a">12.3-4<a name="07ce_145a"><P>
Consider a hash table of size <I>m</I> = 1000 and the hash function <I>h</I>(<I>k</I>) = <FONT FACE="Times New Roman" SIZE=2><img src="../images/hfbrdl12.gif"><I>m </I></FONT>(<I>k A</I> mod 1)<FONT FACE="Times New Roman" SIZE=2><img src="../images/hfbrdr12.gif"></FONT> for <I>A</I> = <img src="232_a.gif">. Compute the locations to which the keys 61, 62, 63, 64, and 65 are mapped.<P>
<a name="07ce_145b">12.3-5<a name="07ce_145b"><P>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -