📄 chapter6.html
字号:
its running time is likely to grow quadratically with the number of inputwords.) How can we organize the data to copy efficiently with a list orarbitrary words?<p>One solution is to keep the set of words seen so far sorted at all times, byplacing each word into its proper position in the order as it arrives. Thisshouldn't be done by shifting words in a linear array, though - that alsotakes too long. Instead we will use a data structure called a <em>binarytree</em>.<p>The tree contains one ``node'' per distinct word; each node contains<ul><li>A pointer to the text of the word,<li>A count of the number of occurrences,<li>A pointer to the left child node,<li>A pointer to the right child node.</ul>No node may have more than two children; it might have only zero or one.<p>The nodes are maintained so that at any node the left subtree contains onlywords that are lexicographically less than the word at the node, and theright subtree contains only words that are greater. This is the tree for thesentence ``now is the time for all good men to come to the aid of theirparty'', as built by inserting each word as it is encountered:<p align="center"><img src="pic63.gif"><p>To find out whether a new word is already in the tree, start at the root andcompare the new word to the word stored at that node. If they match, thequestion is answered affirmatively. If the new record is less than the treeword, continue searching at the left child, otherwise at the right child. Ifthere is no child in the required direction, the new word is not in the tree,and in fact the empty slot is the proper place to add the new word. Thisprocess is recursive, since the search from any node uses a search from oneof its children. Accordingly, recursive routines for insertion and printingwill be most natural.<p>Going back to the description of a node, it is most conveniently representedas a structure with four components:<pre> struct tnode { /* the tree node: */ char *word; /* points to the text */ int count; /* number of occurrences */ struct tnode *left; /* left child */ struct tnode *right; /* right child */ };</pre>This recursive declaration of a node might look chancy, but it's correct. Itis illegal for a structure to contain an instance of itself, but<pre> struct tnode *left;</pre>declares <tt>left</tt> to be a pointer to a <tt>tnode</tt>, not a <tt>tnode</tt>itself.<p>Occasionally, one needs a variation of self-referential structures: twostructures that refer to each other. The way to handle this is:<pre> struct t { ... struct s *p; /* p points to an s */ }; struct s { ... struct t *q; /* q points to a t */ };</pre>The code for the whole program is surprisingly small, given a handful ofsupporting routines like <tt>getword</tt> that we have already written. The mainroutine reads words with <tt>getword</tt> and installs them in the tree with<tt>addtree</tt>.<pre> #include <stdio.h> #include <ctype.h> #include <string.h> #define MAXWORD 100 struct tnode *addtree(struct tnode *, char *); void treeprint(struct tnode *); int getword(char *, int); /* word frequency count */ main() { struct tnode *root; char word[MAXWORD]; root = NULL; while (getword(word, MAXWORD) != EOF) if (isalpha(word[0])) root = addtree(root, word); treeprint(root); return 0; }</pre>The function <tt>addtree</tt> is recursive. A word is presented by<tt>main</tt> to the top level (the root) of the tree. At each stage, thatword is compared to the word already stored at the node, and is percolateddown to either the left or right subtree by a recursive call to<tt>adtree</tt>. Eventually, the word either matches something already in thetree (in which case the count is incremented), or a null pointer isencountered, indicating that a node must be created and added to the tree. Ifa new node is created, <tt>addtree</tt> returns a pointer to it, which isinstalled in the parent node.<pre> struct tnode *talloc(void); char *strdup(char *); /* addtree: add a node with w, at or below p */ struct treenode *addtree(struct tnode *p, char *w) { int cond; if (p == NULL) { /* a new word has arrived */ p = talloc(); /* make a new node */ p->word = strdup(w); p->count = 1; p->left = p->right = NULL; } else if ((cond = strcmp(w, p->word)) == 0) p->count++; /* repeated word */ else if (cond < 0) /* less than into left subtree */ p->left = addtree(p->left, w); else /* greater than into right subtree */ p->right = addtree(p->right, w); return p; }</pre>Storage for the new node is fetched by a routine <tt>talloc</tt>, which returnsa pointer to a free space suitable for holding a tree node, and the new wordis copied into a hidden space by <tt>strdup</tt>. (We will discuss these routinesin a moment.) The count is initialized, and the two children are made null.This part of the code is executed only at the leaves of the tree, when a newnode is being added. We have (unwisely) omitted error checking on the valuesreturned by <tt>strdup</tt> and <tt>talloc</tt>.<p><tt>treeprint</tt> prints the tree in sorted order; at each node, it printsthe left subtree (all the words less than this word), then the word itself,then the right subtree (all the words greater). If you feel shaky about howrecursion works, simulate <tt>treeprint</tt> as it operates on the tree shownabove.<pre> /* treeprint: in-order print of tree p */ void treeprint(struct tnode *p) { if (p != NULL) { treeprint(p->left); printf("%4d %s\n", p->count, p->word); treeprint(p->right); } }</pre>A practical note: if the tree becomes ``unbalanced'' because the words don'tarrive in random order, the running time of the program can grow too much.As a worst case, if the words are already in order, this program does anexpensive simulation of linear search. There are generalizations of thebinary tree that do not suffer from this worst-case behavior, but we willnot describe them here.<p>Before leaving this example, it is also worth a brief digression on a problemrelated to storage allocators. Clearly it's desirable that there be only onestorage allocator in a program, even though it allocates different kinds ofobjects. But if one allocator is to process requests for, say, pointers to<tt>char</tt>s and pointers to <tt>struct tnode</tt>s, two questions arise. First,how does it meet the requirement of most real machines that objects of certaintypes must satisfy alignment restrictions (for example, integers often mustbe located at even addresses)? Second, what declarations can cope with thefact that an allocator must necessarily return different kinds of pointers?<p>Alignment requirements can generally be satisfied easily, at the cost of somewasted space, by ensuring that the allocator always returns a pointer thatmeets <em>all</em> alignment restrictions. The <tt>alloc</tt> of <ahref="chapter5.html">Chapter 5</a> does not guarantee any particularalignment, so we will use the standard library function <tt>malloc</tt>, whichdoes. In <a href="chapter8.html">Chapter 8</a> we will show one way toimplement <tt>malloc</tt>.<p>The question of the type declaration for a function like <tt>malloc</tt> is avexing one for any language that takes its type-checking seriously. In C, theproper method is to declare that <tt>malloc</tt> returns a pointer to<tt>void</tt>, then explicitly coerce the pointer into the desired type witha cast. <tt>malloc</tt> and related routines are declared in the standardheader <tt><stdlib.h></tt>. Thus <tt>talloc</tt> can be written as<pre> #include <stdlib.h> /* talloc: make a tnode */ struct tnode *talloc(void) { return (struct tnode *) malloc(sizeof(struct tnode)); }</pre><tt>strdup</tt> merely copies the string given by its argument into a safeplace, obtained by a call on <tt>malloc</tt>:<pre> char *strdup(char *s) /* make a duplicate of s */ { char *p; p = (char *) malloc(strlen(s)+1); /* +1 for '\0' */ if (p != NULL) strcpy(p, s); return p; }</pre><tt>malloc</tt> returns <tt>NULL</tt> if no space is available; <tt>strdup</tt>passes that value on, leaving error-handling to its caller.<p>Storage obtained by calling <tt>malloc</tt> may be freed for re-use bycalling <tt>free</tt>; see <a href="chapter8.html">Chapters 8</a> and<a href="chapter7.html">7</a>.<p><strong>Exercise 6-2.</strong> Write a program that reads a C program and printsin alphabetical order each group of variable names that are identical in thefirst 6 characters, but different somewhere thereafter. Don't count wordswithin strings and comments. Make 6 a parameter that can be set from thecommand line.<p><strong>Exercise 6-3.</strong> Write a cross-referencer that prints a list of all wordsin a document, and for each word, a list of the line numbers on which itoccurs. Remove noise words like ``the,'' ``and,'' and so on.<p><strong>Exercise 6-4.</strong> Write a program that prints the distinct words in itsinput sorted into decreasing order of frequency of occurrence. Precede eachword by its count.<h2><a name="s6.6">6.6 Table Lookup</a></h2>In this section we will write the innards of a table-lookup package, toillustrate more aspects of structures. This code is typical of what might befound in the symbol table management routines of a macro processor or acompiler. For example, consider the <tt>#define</tt> statement. When a linelike<pre> #define IN 1</pre>is encountered, the name <tt>IN</tt> and the replacement text <tt>1</tt> arestored in a table. Later, when the name <tt>IN</tt> appears in a statementlike<pre> state = IN;</pre>it must be replaced by <tt>1</tt>.<p>There are two routines that manipulate the names and replacement texts.<tt>install(s,t)</tt> records the name <tt>s</tt> and the replacement text<tt>t</tt> in a table; <tt>s</tt> and <tt>t</tt> are just character strings.<tt>lookup(s)</tt> searches for <tt>s</tt> in the table, and returns a pointer tothe place where it was found, or <tt>NULL</tt> if it wasn't there.<p>The algorithm is a hash-search - the incoming name is converted into a smallnon-negative integer, which is then used to index into an array of pointers.An array element points to the beginning of a linked list of blocks describingnames that have that hash value. It is <tt>NULL</tt> if no names have hashed tothat value.<p align="center"><img src=pic64.gif><p>A block in the list is a structure containing pointers to the name, thereplacement text, and the next block in the list. A null next-pointer marksthe end of the list.<pre> struct nlist { /* table entry: */ struct nlist *next; /* next entry in chain */ char *name; /* defined name */ char *defn; /* replacement text */ };</pre>The pointer array is just<pre> #define HASHSIZE 101 static struct nlist *hashtab[HASHSIZE]; /* pointer table */</pre>The hashing function, which is used by both <tt>lookup</tt> and <tt>install</tt>,adds each character value in the string to a scrambled combination of theprevious ones and returns the remainder modulo the array size. This is not thebest possible hash function, but it is short and effective.<pre> /* hash: form hash value for string s */ unsigned hash(char *s) { unsigned hashval; for (hashval = 0; *s != '\0'; s++) hashval = *s + 31 * hashval; return hashval % HASHSIZE; }</pre>Unsigned arithmetic ensures that the hash value is non-negative.<p>The hashing process produces a starting index in the array <tt>hashtab</tt>; ifthe string is to be found anywhere, it will be in the list of blocks beginningthere. The search is performed by <tt>lookup</tt>. If <tt>lookup</tt> finds the entryalready present, it returns a pointer to it; if not, it returns <tt>NULL</tt>.<pre> /* lookup: look for s in hashtab */ struct nlist *lookup(char *s) { struct nlist *np; for (np = hashtab[hash(s)]; np != NULL; np = np->next) if (strcmp(s, np->name) == 0) return np; /* found */ return NULL; /* not found */ }</pre>The <tt>for</tt> loop in <tt>lookup</tt> is the standard idiom for walkingalong a linked list:<pre> for (ptr = head; ptr != NULL; ptr = ptr->next) ...</pre>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -