📄 chap12.htm
字号:
<pre>delete <I>x</I> from the list <I>T</I>[<I>h</I>(<I>key</I>[<I>x</I>])]</sub></sup></pre><P>
The worst-case running time for insertion is <I>O</I>(1). For searching, the worst-case running time is proportional to the length of the list; we shall analyze this more closely below. Deletion of an element <I>x</I> can be accomplished in <I>O</I>(1) time if the lists are doubly linked. (If the lists are singly linked, we must first find <I>x</I> in the list <I>T</I>[<I>h</I>(<I>key</I>[<I>x</I>])], so that the <I>next</I> link of <I>x</I>'s predecessor can be properly set to splice <I>x</I> out; in this case, deletion and searching have essentially the same running time.)<P>
<P>
<h2>Analysis of hashing with chaining</h2><P>
<a name="07c6_1444">How well does hashing with chaining perform? In particular, how long does it take to search for an element with a given key?<P>
<a name="07c6_1445">Given a hash table <I>T</I> with <I>m</I> slots that stores <I>n</I> elements, we define the <I><B>load factor</I></B> <img src="../images/alpha12.gif"> for <I>T</I> as <I>n/m</I>, that is, the average number of elements stored in a chain. Our analysis will be in terms of <img src="../images/alpha12.gif">; that is, we imagine <img src="../images/alpha12.gif"> staying fixed as <I>n</I> and <I>m</I> go to infinity. (Note that <img src="../images/alpha12.gif"> can be less than, equal to, or greater than l .)<P>
The worst-case behavior of hashing with chaining is terrible: all <I>n</I> keys hash to the same slot, creating a list of length <I>n</I>. The worst-case time for searching is thus <img src="../images/bound.gif">(<I>n</I>) plus the time to compute the hash function--no better than if we used one linked list for all the elements. Clearly, hash tables are not used for their worst-case performance.<P>
<a name="07c6_1446">The average performance of hashing depends on how well the hash function <I>h</I> distributes the set of keys to be stored among the <I>m</I> slots, on the average. Section 12.3 discusses these issues, but for now we shall assume that any given element is equally likely to hash into any of the <I>m</I> slots, independently of where any other element has hashed to. We call this the assumption of <I><B>simple uniform hashing</I></B>.<P>
We assume that the hash value <I>h</I>(<I>k</I>) can be computed in <I>O</I>(1) time, so that the time required to search for an element with key <I>k</I> depends linearly on the length of the list <I>T</I>[<I>h</I>(<I>k</I>)]. Setting aside the <I>O</I>(1) time required to compute the hash function and access slot <I>h</I>(<I>k</I>), let us consider the expected number of elements examined by the search algorithm, that is, the number of elements in the list <I>T</I>[<I>h</I>(<I>k</I>)] that are checked to see if their keys are equal to <I>k</I>. We shall consider two cases. In the first, the search is unsuccessful: no element in the table has key <I>k</I>. In the second, the search successfully finds an element with key <I>k</I>.<P>
<a name="07c6_1448">Theorem 12.1<a name="07c6_1448"><P>
In a hash table in which collisions are resolved by chaining, an unsuccessful search takes time <img src="../images/bound.gif">(1 + <img src="../images/alpha12.gif"> ), on the average, under the assumption of simple uniform hashing.<P>
<I><B>Proof </I></B>Under the assumption of simple uniform hashing, any key <I>k</I> is equally likely to hash to any of the <I>m</I> slots. The average time to search unsuccessfully for a key <I>k</I> is thus the average time to search to the end of one of the <I>m</I> lists. The average length of such a list is the load factor <img src="../images/alpha12.gif"> = <I>n/m</I>. Thus, the expected number of elements examined in an unsuccessful search is <img src="../images/alpha12.gif">, and the total time required (including the time for computing <I>h</I>(<I>k</I>)) is <img src="../images/bound.gif">(1 + <img src="../images/alpha12.gif">). <P>
<a name="07c6_1449">Theorem 12.2<a name="07c6_1449"><P>
<a name="07c6_1447">In a hash table in which collisions are resolved by chaining, a successful search takes time <img src="../images/bound.gif">(1 +<img src="../images/alpha12.gif">), on the average, under the assumption of simple uniform hashing.<P>
<I><B>Proof </I></B>We assume that the key being searched for is equally likely to be any of the <I>n</I> keys stored in the table. We also assume that the <FONT FACE="Courier New" SIZE=2>CHAINED</FONT>-<FONT FACE="Courier New" SIZE=2>HASH</FONT>-<FONT FACE="Courier New" SIZE=2>INSERT</FONT> procedure inserts a new element at the end of the list instead of the front. (By Exercise 12.2-3, the average successful search time is the same whether new elements are inserted at the front of the list or at the end.) The expected number of elements examined during a successful search is 1 more than the number of elements examined when the sought-for element was inserted (since every new element goes at the end of the list). To find the expected number of elements examined, we therefore take the average, over the <I>n</I> items in the table, of 1 plus the expected length of the list to which the <I>i</I>th element is added. The expected length of that list is (<I>i</I>- 1)/<I>m</I>, and so the expected number of elements examined in a successful search is<P>
<img src="225_a.gif"><P>
Thus, the total time required for a successful search (including the time for computing the hash function) is <img src="../images/bound.gif">(2 + <img src="../images/alpha12.gif">/2 - 1/2<I>m</I>) = <img src="../images/bound.gif">(1 + <img src="../images/alpha12.gif">). <P>
What does this analysis mean? If the number of hash-table slots is at least proportional to the number of elements in the table, we have <I>n</I> = <I>O</I>(<I>m</I>) and, consequently, <img src="../images/alpha12.gif"> = <I>n/m </I>= <I>O</I>(<I>m</I>)<I>/m</I> = <I>O</I>(1). Thus, searching takes constant time on average. Since insertion takes <I>O</I>( l ) worst-case time (see Exercise 12.2-3), and deletion takes <I>O</I>(l) worst-case time when the lists are doubly linked, all dictionary operations can be supported in <I>O</I>( l ) time on average.<P>
<P>
<h2><a name="07c7_1449">Exercises<a name="07c7_1449"></h2><P>
<a name="07c7_144a">12.2-1<a name="07c7_144a"><P>
Suppose we use a random hash function <I>h</I> to hash <I>n</I> distinct keys into an array <I>T</I> of length <I>m</I>. What is the expected number of collisions? More precisely, what is the expected cardinality of {(<I>x,y</I>):<I> h</I>(<I>x</I>) = <I>h</I>(<I>y</I>)}?<P>
<a name="07c7_144b">12.2-2<a name="07c7_144b"><P>
Demonstrate the insertion of the keys 5, 28, 19, 15, 20, 33, 12, 17, 10 into a hash table with collisions resolved by chaining. Let the table have 9 slots, and let the hash function be <I>h</I>(<I>k</I>) = <I>k</I> mod 9.<P>
<a name="07c7_144c">12.2-3<a name="07c7_144c"><P>
Argue that the expected time for a successful search with chaining is the same whether new elements are inserted at the front or at the end of a list. (<I>Hint:</I> Show that the expected successful search time is the same for <I>any </I>two orderings of any list.)<P>
<a name="07c7_144d">12.2-4<a name="07c7_144d"><P>
Professor Marley hypothesizes that substantial performance gains can be obtained if we modify the chaining scheme so that each list is kept in sorted order. How does the professor's modification affect the running time for successful searches, unsuccessful searches, insertions, and deletions?<P>
<a name="07c7_144e">12.2-5<a name="07c7_144e"><P>
<a name="07c7_1448">Suggest how storage for elements can be allocated and deallocated within the hash table itself by linking all unused slots into a free list. Assume that one slot can store a flag and either one element plus a pointer or two pointers. All dictionary and free-list operations should run in <I>O</I>(l) expected time. Does the free list need to be doubly linked, or does a singly linked free list suffice?<P>
<a name="07c7_144f">12.2-6<a name="07c7_144f"><P>
Show that if |<I>U</I>| > <I>nm</I>, there is a subset of <I>U</I> of size <I>n</I> consisting of keys that all hash to the same slot, so that the worst-case searching time for hashing with chaining is <img src="../images/bound.gif">(<I>n</I>).<P>
<P>
<P>
<h1><a name="07c8_144a">12.3 Hash functions<a name="07c8_144a"></h1><P>
<a name="07c8_1449">In this section, we discuss some issues regarding the design of good hash functions and then present three schemes for their creation: hashing by division, hashing by multiplication, and universal hashing.<P>
<h2>What makes a good hash function?</h2><P>
A good hash function satisfies (approximately) the assumption of simple uniform hashing: each key is equally likely to hash to any of the <I>m</I> slots. More formally, let us assume that each key is drawn independently from <I>U </I>according to a probability distribution <I>P</I>; that is, <I>P(k)</I> is the probability that <I>k</I> is drawn. Then the assumption of simple uniform hashing is that<P>
<img src="227_a.gif"><P>
<h4><a name="07c9_144b">(12.1)<a name="07c9_144b"></sub></sup></h4><P>
Unfortunately, it is generally not possible to check this condition, since <I>P </I>is usually unknown.<P>
Sometimes (rarely) we do know the distribution <I>P</I>. For example, suppose the keys are known to be random real numbers <I>k</I> independently and uniformly distributed in the range 0 <img src="../images/lteq12.gif"> <I>k</I> < 1. In this case, the hash function<P>
<pre><I>h</I>(<I>k</I>) = <img src="../images/hfbrdl12.gif"><I>km</I><img src="../images/hfbrdr12.gif"></sub></sup></pre><P>
can be shown to satisfy equation (12.1).<P>
<a name="07c9_144a">In practice, heuristic techniques can be used to create a hash function that is likely to perform well. Qualitative information about <I>P</I> is sometimes useful in this design process. For example, consider a compiler's symbol table, in which the keys are arbitrary character strings representing identifiers in a program. It is common for closely related symbols, such as <FONT FACE="Courier New" SIZE=2>pt</FONT> and <FONT FACE="Courier New" SIZE=2>pts</FONT>, to occur in the same program. A good hash function would minimize the chance that such variants hash to the same slot.<P>
A common approach is to derive the hash value in a way that is expected to be independent of any patterns that might exist in the data. For example, the "division method" (discussed further below) computes the hash value as the remainder when the key is divided by a specified prime number. Unless that prime is somehow related to patterns in the probability distribution <I>P</I>, this method gives good results.<P>
Finally, we note that some applications of hash functions might require stronger properties than are provided by simple uniform hashing. For example, we might want keys that are "close" in some sense to yield hash values that are far apart. (This property is especially desirable when we are using linear probing, defined in Section 12.4.)<P>
<P>
<h2>Interpreting keys as natural numbers</h2><P>
Most hash functions assume that the universe of keys is the set <B>N</B> = {0,1,2, . . .} of natural numbers. Thus, if the keys are not natural numbers, a way must be found to interpret them as natural numbers. For example, a key that is a character string can be interpreted as an integer expressed in suitable radix notation. Thus, the identifier <FONT FACE="Courier New" SIZE=2>pt</FONT> might be interpreted as the pair of decimal integers (112,116), since <FONT FACE="Courier New" SIZE=2>p</FONT> = 112 and <FONT FACE="Courier New" SIZE=2>t</FONT> = 116 in the ASCII character set; then, expressed as a radix-128 integer, <FONT FACE="Courier New" SIZE=2>pt</FONT> becomes (112 <img src="../images/dot10.gif"> 128) + 116 = 14452. It is usually straightforward in any given application to devise some such simple method for interpreting each key as a (possibly large) natural number. In what follows, we shall assume that the keys are natural numbers.<P>
<P>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -