📄 atom.html.svn-base
字号:
<em CLASS="Fragment"><allocate a new entry 39></em> return p->str;}</pre><pre CLASS="Chunk"><em CLASS="Fragment"><macros 37>+=</em>#define NELEMS(x) ((sizeof (x))/(sizeof ((x)[0])))</pre><p CLASS="Noindent">The definition of <em CLASS="Code">NELEMS</em> illustrates a common Cidiom: The number of elements in an array is the size of the array divided by the size ofeach element. <em CLASS="Code">sizeof</em> is a compile-time operator, so this computationapplies only to arrays whose size is known at compile time. As this definitionillustrates, macro parameters are italicized to highlight where they are used in the macrobody.</p><p>If <em CLASS="Code">str[0</em>..<em CLASS="Code">len-1]</em> isn't in the table, <em CLASS="Code">Atom_new</em> adds it by allocating a <em CLASS="Code">struct atom</em> andenough additional space to hold the sequence, copying <em CLASS="Code">str[0</em>..<em CLASS="Code">len-1]</em> into the additional space and linking the new entry onto thebeginning of the list emanating from em CLASS="Code">buckets[h]. The entrycould be appended to the end of the list, but adding it at the front of the list issimpler.</p><pre CLASS="Chunk"><em CLASS="Fragment"><allocate a new entry 39>=</em>p = ALLOC(sizeof (*p) + len + 1);p->len = len;p->str = (char *)(p + 1);if (len > 0) memcpy(p->str, str, len);p->str[len] = '\0';p->link = buckets[h];buckets[h] = p;</pre><pre CLASS="Chunk"><em CLASS="Fragment"><includes 34>+=</em>#include "mem.h"</pre><p CLASS="Noindent"><em CLASS="Code">ALLOC</em> is <em CLASS="Code">Mem</em>'s primaryallocation function, and it mimics the standard library function <em CLASS="Code">malloc</em>:its argument is the number of bytes needed. <em CLASS="Code">Atom_new</em> cannot use <em CLASS="Code">Mem</em>'s <em CLASS="Code">NEW</em>, which is illustrated in <em CLASS="Code">Stack_push</em>, because the number of bytes depends on <em CLASS="Code">len</em>;<em CLASS="Code">NEW</em> applies only when the number of bytes is known at compile time.The call to <em CLASS="Code">ALLOC</em> above allocates the space for both the <em CLASS="Code">atom</em> structure and for the sequence, and the sequence is stored in theimmediately succeeding bytes.</p><p>Hashing the sequence passed to <em CLASS="Code">Atom_new</em> involves computing anunsigned number to represent the sequence. Ideally, these hash numbers should bedistributed uniformly over the range zero to <em CLASS="Code">NELEMS(buckets)</em>-1 for <em>N</em>sequences. If they are so distributed, each list in <em CLASS="Code">buckets</em> willhave <em>N</em>/<em CLASS="Code">NELEMS(buckets)</em> elements, and the average time tosearch for a sequence will be <em>N</em>/2<em CLASS="Code">種ELEMS(buckets)</em>. If <em>N</em>is less than, say, 2<em CLASS="Code">種ELEMS(buckets)</em>, the search time isessentially a constant.</p><p>Hashing is a well-studied subject, and there are many good hash functions. <em CLASS="Code">Atom_new</em> uses a simple table-lookup algorithm:</p><pre CLASS="Chunk"><em CLASS="Fragment"><</em><em CLASS="Code">h = </em><em CLASS="Fragment">hash </em><em CLASS="Code">str[0</em>..<em CLASS="Code">len-1]</em><em CLASS="Fragment"> 39>=</em>for (h = 0, i = 0; i < len; i++) h = (h<<1) + scatter[(unsigned char)str[i]];</pre><p CLASS="Noindent"><em CLASS="Code">scatter</em> is a 256-entry array that maps bytes torandom numbers, which were generated by calling the standard library function <em CLASS="Code">rand</em>. Experience shows that this simple approach helps to more uniformlydistribute the hash values. Casting <em CLASS="Code">str[i]</em> to an unsigned characteravoids C's ambiguity about "plain" characters: they can be signed or unsigned.Without the cast, values of <em CLASS="Code">str[i]</em> that exceed 127 would yieldnegative indices on machines that use signed characters.</p><pre CLASS="Chunk"><em CLASS="Fragment"><data 36>+=</em>static unsigned long scatter[] = {2078917053, 143302914, 1027100827, 1953210302, 755253631, 2002600785,1405390230, 45248011, 1099951567, 433832350, 2018585307, 438263339,813528929, 1703199216, 618906479, 573714703, 766270699, 275680090,1510320440, 1583583926, 1723401032, 1965443329, 1098183682, 1636505764,980071615, 1011597961, 643279273, 1315461275, 157584038, 1069844923,471560540, 89017443, 1213147837, 1498661368, 2042227746, 1968401469,1353778505, 1300134328, 2013649480, 306246424, 1733966678, 1884751139,744509763, 400011959, 1440466707, 1363416242, 973726663, 59253759,1639096332, 336563455, 1642837685, 1215013716, 154523136, 593537720,704035832, 1134594751, 1605135681, 1347315106, 302572379, 1762719719,269676381, 774132919, 1851737163, 1482824219, 125310639, 1746481261,1303742040, 1479089144, 899131941, 1169907872, 1785335569, 485614972,907175364, 382361684, 885626931, 200158423, 1745777927, 1859353594,259412182, 1237390611, 48433401, 1902249868, 304920680, 202956538,348303940, 1008956512, 1337551289, 1953439621, 208787970, 1640123668,1568675693, 478464352, 266772940, 1272929208, 1961288571, 392083579,871926821, 1117546963, 1871172724, 1771058762, 139971187, 1509024645,109190086, 1047146551, 1891386329, 994817018, 1247304975, 1489680608,706686964, 1506717157, 579587572, 755120366, 1261483377, 884508252,958076904, 1609787317, 1893464764, 148144545, 1415743291, 2102252735,1788268214, 836935336, 433233439, 2055041154, 2109864544, 247038362,299641085, 834307717, 1364585325, 23330161, 457882831, 1504556512,1532354806, 567072918, 404219416, 1276257488, 1561889936, 1651524391,618454448, 121093252, 1010757900, 1198042020, 876213618, 124757630,2082550272, 1834290522, 1734544947, 1828531389, 1982435068, 1002804590,1783300476, 1623219634, 1839739926, 69050267, 1530777140, 1802120822,316088629, 1830418225, 488944891, 1680673954, 1853748387, 946827723,1037746818, 1238619545, 1513900641, 1441966234, 367393385, 928306929,946006977, 985847834, 1049400181, 1956764878, 36406206, 1925613800,2081522508, 2118956479, 1612420674, 1668583807, 1800004220, 1447372094,523904750, 1435821048, 923108080, 216161028, 1504871315, 306401572,2018281851, 1820959944, 2136819798, 359743094, 1354150250, 1843084537,1306570817, 244413420, 934220434, 672987810, 1686379655, 1301613820,1601294739, 484902984, 139978006, 503211273, 294184214, 176384212,281341425, 228223074, 147857043, 1893762099, 1896806882, 1947861263,1193650546, 273227984, 1236198663, 2116758626, 489389012, 593586330,275676551, 360187215, 267062626, 265012701, 719930310, 1621212876,2108097238, 2026501127, 1865626297, 894834024, 552005290, 1404522304,48964196, 5816381, 1889425288, 188942202, 509027654, 36125855,365326415, 790369079, 264348929, 513183458, 536647531, 13672163,313561074, 1730298077, 286900147, 1549759737, 1699573055, 776289160,2143346068, 1975249606, 1136476375, 262925046, 92778659, 1856406685,1884137923, 53392249, 1735424165, 1602280572};</pre><p><em CLASS="Code">Atom_length</em> can't hash its argument because it doesn't know itslength. But the argument must be an atom, so <em CLASS="Code">Atom_length</em> can simplyscream through the lists in <em CLASS="Code">buckets</em> comparing pointers. If it findsthe atom, it returns the atom's length:</p><pre CLASS="Chunk"><em CLASS="Fragment"><functions 35>+=</em>int Atom_length(const char *str) { struct atom *p; int i; assert(str); for (i = 0; i < NELEMS(buckets); i++) for (p = buckets[i]; p; p = p->link) if (p->str == str) return p->len; assert(0); return 0;}</pre><p CLASS="Noindent"><em CLASS="Code">assert(0)</em> implements the checked runtime errorthat <em CLASS="Code">Atom_length</em> must be called only with an atom, not just apointer to a string. <em CLASS="Code">assert(0)</em> is also used to signal conditionsthat are not supposed to occur—so-called "can't-happen" conditions.</p><h2 CLASS="Section">Further Reading</h2><p CLASS="Noindent">Atoms have long been used in LISP, which is the source of their name,and in string-manipulation languages, such as SNOBOL4, which implemented strings almostexactly as described in this chapter (Griswold 1972). The C compiler <a HREF="http://www.cs.princeton.edu/software/lcc/"><em CLASS="Code">lcc</em></a> (Fraser andHanson 1995) has a module that is similar to <em CLASS="Code">Atom</em> and is thepredecessor to <em CLASS="Code">Atom</em>'s implementation. <em CLASS="Code">lcc</em>stores the strings for <em>all</em> identifiers and constants that appear in the sourceprogram in a single table, and never deallocates them. Doing so never consumes too muchstorage because the number of distinct strings in C programs is remarkably smallregardless of the size of the source programs.</p><p>Sedgewick (1990) and Knuth (1973b) describe hashing in detail and give guidelines forwriting good hash functions. The hash function used in <em CLASS="Code">Atom</em> (and in <em CLASS="Code">lcc</em>) was suggested by Hans Boehm.</p><h2 CLASS="Section">Exercises</h2><ol CLASS="Exercises"> <li CLASS="Exercise">Most texts recommend using a prime number for the size of <em CLASS="Code">buckets</em>. Using a prime and a good hash function usually gives a better distribution of the lengths of the lists hanging off of <em CLASS="Code">buckets</em>. <em CLASS="Code">Atom</em> uses a power of two, which is sometimes explicitly cited as a bad choice. Write a program to generate or read, say, 10,000 typical strings and measure <em CLASS="Code">Atom_new</em>'s speed and the distribution of the lengths of the lists. Then change <em CLASS="Code">buckets</em> so that it has 2,039 entries (the largest prime less than 2,048), change the statement <pre CLASS="ExerciseCode">h &= NELEMS(buckets)-1;</pre> <p CLASS="Exercise">to</p> <pre CLASS="ExerciseCode">h %= NELEMS(buckets);</pre> <p CLASS="Exercise">and repeat the measurements. Does using a prime help? How much does your conclusion depend on your specific machine?</p> </li> <li CLASS="Exercise">Scour the literature for better hash functions; likely sources are Knuth (1973b), similar texts on algorithms and data structures and the papers they cite, and texts on compilers, such as Aho, Sethi, and Ullman (1986). Try these functions and measure their benefits.</li> <li CLASS="Exercise">Explain why <em CLASS="Code">Atom_new</em> doesn't use the standard C library function <em CLASS="Code">strncmp</em> to compare sequences.</li> <li CLASS="Exercise">Here's another way to declare the atom structure:<pre CLASS="ExerciseCode">struct atom { struct atom *link; int len; char str[1];};</pre> <p CLASS="Exercise">A <em CLASS="Code">struct</em> <em CLASS="Code">atom</em> for a string of <em CLASS="Code">len</em> bytes is allocated by <em CLASS="Code">ALLOC(sizeof(*p)+len)</em>, which allocates space for the <em CLASS="Code">link</em> and <em CLASS="Code">len</em> fields, and a <em CLASS="Code">str</em> field long enough to hold <em CLASS="Code">len</em>+1 bytes. This approach avoids the time and space required for the extra indirection induced by declaring <em CLASS="Code">str</em> to be a pointer. Unfortunately, this "trick" violates the C standard, because clients access the bytes beyond <em CLASS="Code">str[0]</em>, and the effect of these accesses is undefined. Implement this approach and measure the cost of the indirection. Are the savings worth violating the standard?</p> </li> <li CLASS="Exercise"><em CLASS="Code">Atom_new</em> compares the <em CLASS="Code">len</em> field of <em CLASS="Code">struct atom</em>s with the length of the incoming sequence to avoid comparing sequences of different lengths. If the hash numbers (not the indices into <em CLASS="Code">buckets</em>) for each atom were also stored in <em CLASS="Code">struct</em> <em CLASS="Code">atom</em>s, they could be compared, too. Implement this "improvement" and measure the benefits. Is it worthwhile?</li> <li CLASS="Exercise"><em CLASS="Code">Atom_length</em> is slow. Revise <em CLASS="Code">Atom</em>'s implementation so that <em CLASS="Code">Atom_length</em>'s running time is approximately the same as that of <em CLASS="Code">Atom_new</em>.</li> <li CLASS="Exercise">The <em CLASS="Code">Atom</em> interface evolved to its present form because its functions were the ones that clients used most often. There are other functions and designs that might be useful, which this exercise and those that follow explore. Implement<pre CLASS="ExerciseCode">extern void Atom_init(int hint);</pre> <p CLASS="Exercise">where <em CLASS="Code">hint</em> estimates the number of atoms the client expects to create. What checked runtime errors would you add to constrain when <em CLASS="Code">Atom_init</em> could be called?</p> </li> <li CLASS="Exercise">There are several functions to deallocate atoms that extensions to the <em CLASS="Code">Atom</em> interface might provide. For example, the functions<pre CLASS="ExerciseCode">extern void Atom_free (char *str);extern void Atom_reset(void);</pre> <p CLASS="Exercise">could deallocate the atom given by <em CLASS="Code">str</em> and deallocate all atoms, respectively. Implement these functions. Don't forget to specify and implement appropriate checked runtime errors.</p> </li> <li CLASS="Exercise">Some clients start execution by installing a bunch of strings as atoms for later use. Implement<pre CLASS="ExerciseCode">extern void Atom_vload(const char *str, ...);extern void Atom_aload(const char *strs[]);</pre> <p CLASS="Exercise"><em CLASS="Code">Atom_vload</em> installs the strings given in the variable length argument list up to a null pointer, and <em CLASS="Code">Atom_aload</em> does the same for a null-terminated array of pointers to strings.</p> </li> <li CLASS="Exercise">Copying the strings can be avoided if the client promises not to deallocate them, which is trivially true for string constants. Implement<pre CLASS="ExerciseCode">extern const char *Atom_add(const char *str, int len);</pre> <p CLASS="Exercise">which works like <em CLASS="Code">Atom_new</em> but doesn't make a copy of the sequence. If you provide <em CLASS="Code">Atom_add</em> and <em CLASS="Code">Atom_free</em> (and <em CLASS="Code">Atom_reset</em> from Exercise 3.8, what checked runtime errors must be specified and implemented?</p> </li></ol><address style="font-size: 7pt"> $Revision$ $Date$ </address></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -