📄 gb_words.w
字号:
% This file is part of the Stanford GraphBase (c) Stanford University 1993@i boilerplate.w %<< legal stuff: PLEASE READ IT BEFORE MAKING ANY CHANGES!@i gb_types.w\def\title{GB\_WORDS}\font\logosl=logosl10\prerequisites{GB\_\,GRAPH}{GB\_\,IO}@* Introduction. This GraphBase module provides two external subroutines:$$\vcenter{\halign{#\hfil\cr |words|, a routine that creates a graph based on five-letter words;\cr |find_word|, a routine that looks for a given vertex in such a graph.\cr}}$$Examples of the use of these routines can be found in two demo programs,{\sc WORD\_\,COMPONENTS} and {\sc LADDERS}.@(gb_words.h@>=extern Graph *words();extern Vertex *find_word();@ The subroutine call |words(n,wt_vector,wt_threshold,seed)|constructs a graph based on the five-letter words in \.{words.dat}.Each vertex of the graph corresponds to a single five-letter word. Twowords are adjacent in the graph if they are the same except in oneletter position. For example, `\.{words}' is adjacent to other words such as`\.{cords}', `\.{wards}', `\.{woods}', `\.{worms}', and `\.{wordy}'.The constructed graph has at most |n| vertices; indeed, it has exactly|n| vertices if there are enough qualifying words. A word qualifiesif its ``weight'' is |wt_threshold| or more, when weights arecomputed from a table pointed to by~|wt_vector| according to rulesdescribed below. (If parameter~|wt_vector|is |NULL|, i.e., \.{NULL}, default weights are used.) The fourth parameter,|seed|, is the seed of a random number generator.All words of \.{words.dat} will be sorted by weight. The first vertex ofthe graph will be the word of largestweight, the second vertex will have second-largest weight, and so on.Words of equal weight will appear in pseudo-random order, as determinedby the value of |seed| in a system-independent fashion.The first |n| words in order of decreasing weight are chosen to bevertices of the graph. However, if fewer than |n| words have weight |>=wt_threshold|, the graph will contain only the words that qualify. Insuch cases the graph will have fewer than |n| vertices---possibly none at all.Exception: The special case |n=0| is equivalent to the case when |n|has been set to the highest possible value. It causes all qualifyingwords to appear.@ Every word in \.{words.dat} has been classified as `common' (\.*), `advanced'(\.+), or `unusual' (\.\ ). Each word has also been assigned sevenfrequency counts $c_1$, \dots,~$c_7$, separated by commas; these counts showhow often the word has occurred in different publication contexts:$$\vcenter{\halign{$c_#$ times in &#\hfil\cr1&the American Heritage Intermediate Corpus of elementary school material;\cr2&the Brown Corpus of reading material from America;\cr3&the Lancaster-Oslo/Bergen Corpus of reading material from Britain;\cr4&the Melbourne-Surrey Corpus of newspaper material from Australia;\cr5&the Revised Standard Version of the Bible;\cr6&{\sl The \TEX/book\/} and {\sl The {\logosl METAFONT\kern1pt}book\/} by D. E. Knuth;\cr7&{\sl Concrete Mathematics\/} by Graham, Knuth, and Patashnik.\cr}}$$@^Graham, Ronald Lewis@>@^Knuth, Donald Ervin@>@^Patashnik, Oren@>For example, one of the entries in \.{words.dat} is$$\.{happy*774,92,121,2,26,8,1}$$indicating a common word with $c_1=774$, \dots, $c_7=1$.Parameter |wt_vector| points to an array of nine integers$(a,b,w_1,\ldots,w_7)$.The weight of each word is computed from these nine numbers by using theformula$$c_1w_1+\cdots+c_7w_7+ \cases{a,&if the word is `common';\cr b,&if the word is `advanced';\cr 0,&if the word is `unusual'.\cr}$$The components of |wt_vector| must be chosen so that$$\max\bigl(\vert a\vert, \vert b\vert\bigr) + C_1\vert w_1\vert + \cdots +C_7\vert w_7\vert < 2^{30},$$where $C_j$ is the maximum value of $c_j$ in the file; this restrictionensures that the |words| procedure will produce the same results on allcomputer systems.@ The maximum frequency counts actually present are $C_1=15194$, $C_2=3560$,$C_3=4467$, $C_4=460$, $C_5=6976$, $C_6=756$, and $C_7=362$; these can befound in the entries for the common words `\.{shall}', `\.{there}',`\.{which}', and `\.{would}'.The default weights are $a=100$, $b=10$, $c_1=4$, $c_2=c_3=2$, $c_4=c_5=c_6=c_7=1$.File \.{words.dat} contains 5757 words, of which 3300 are `common', 1194 are`advanced', and 1263 are `unusual'. Included among the unusual words are891 having $c_1=\cdots=c_7=0$; such wordswill always have weight zero, regardless of the weight vector parameter.@<Private variables@>=static long max_c[]={15194,3560,4467,460,6976,756,362}; /* maximum counts $C_j$ */static long default_wt_vector[]={100,10,4,2,2,1,1,1,1}; /* use this if |wt_vector=NULL| */@ Examples: If you call |words(2000,NULL,0,0)|, you get a graph with2000 of the most common five-letter words of English, using thedefault weights. The GraphBase programs are designed to besystem-independent, so that identical graphs will be obtained byeverybody who asks for |words(2000,NULL,0,0)|. Equivalent experimentson algorithms for graph manipulation can therefore be performed byresearchers in different parts of the world.The subroutine call |words(2000,NULL,0,s)| will produce slightlydifferent graphs when the random seed |s| varies, because some wordshave equal weight. However, the graph for any particular value of~|s|will be the same on all computers. The seed value can be any integerin the range $0\le s<2^{31}$.Suppose you call |words(0,w,1,0)|, with |w| defined by the \CEE/ declaration$$\hbox{|long w[9] = {1};|}$$this means that $a=1$ and $b=w_1=\cdots=w_7=0$. Therefore you'll get a graphcontaining only the 3300 `common' words. Similarly, it's possible to obtainonly the $3300+1194=4494$ non-`unusual' words, by specifying the weight vector$$\hbox{|long w[9] = {1,1};|}$$this makes $a=b=1$ and $w_1=\cdots=w_7=0$. In both of these examples, thequalifying words all have weight~1, so the vertices of the graph will appearin pseudo-random order.If |w| points to an array of nine 0's, the call |words(n,w,0,s)| gives arandom sample of |n| words, depending on |s| in a system-independent fashion.If the entries of the weight vector are all nonnegative, and if theweight threshold is zero, every word of \.{words.dat} will qualify. Thusyou will obtain a graph with $\min(n,5757)$ vertices.If |w| points to an array with {\sl negative\/} weights, the call|words(n,w,-0x7fffffff,0)| selects |n| of the {\sl least\/} commonwords in \.{words.dat}.@ If the |words| routine encounters a problem, it returns |NULL|, after puttinga code number into the external variable |panic_code|. This code numberidentifies the type of failure. Otherwise |words| returns a pointer to thenewly created graph, which will be represented with the data structuresexplained in {\sc GB\_\,GRAPH}. (The external variable |panic_code| is itselfdefined in {\sc GB\_\,GRAPH}.)@d panic(c) @+{@+gb_free(node_blocks); panic_code=c;@+gb_trouble_code=0;@+return NULL;@+}@ Now let's get going on the program. The \CEE/ file \.{gb\_words.c} beginsas follows:@p#include "gb_io.h" /* we will use the {\sc GB\_\,IO} routines for input */#include "gb_flip.h" /* we will use the {\sc GB\_\,FLIP} routines for random numbers */#include "gb_graph.h" /* we will use the {\sc GB\_\,GRAPH} data structures */#include "gb_sort.h" /* and |gb_linksort| for sorting */@h@#@<Type declarations@>@;@<Private variables@>@;@<Private functions@>@;@#Graph *words(n,wt_vector,wt_threshold,seed) unsigned long n; /* maximum number of vertices desired */ long wt_vector[]; /* pointer to array of weights */ long wt_threshold; /* minimum qualifying weight */ long seed; /* random number seed */{@+@<Local variables@>@;@# gb_init_rand(seed); @<Check that |wt_vector| is valid@>; @<Input the qualifying words to a linked list, computing their weights@>; @<Sort and output the words, determining adjacencies@>; if (gb_trouble_code) { gb_recycle(new_graph); panic(alloc_fault); /* oops, we ran out of memory somewhere back there */ } return new_graph;}@ @<Local var...@>=Graph *new_graph; /* the graph constructed by |words| */@* Validating the weights. The first job that |words| needs to tackle iscomparatively trivial:We want to verify the condition$$\max\bigl(\vert a\vert, \vert b\vert\bigr) + C_1\vert w_1\vert + \cdots +C_7\vert w_7\vert < 2^{30}.\eqno(*)$$This proves to be an interesting exercise in ``portable\CEE/ programming,'' because we don't want to risk integer overflow.Our approach is to do thecalculation first in floating point arithmetic, thereby ruling out casesthat are clearly unacceptable. Once that test is passed, we can safelytest the condition with ordinary integer arithmetic. Floatingpoint arithmetic is system dependent, but we use it carefully so thatsystem-independent results are obtained.@<Check that |wt_vector| is valid@>=if (!wt_vector) wt_vector=default_wt_vector;else {@+register double flacc; register long *p,*q; register long acc; @<Use floating point arithmetic to check that |wt_vector| isn't totally off base@>; @<Use integer arithmetic to check that |wt_vector| is truly OK@>;}@ The floating-point calculations are facilitated by a routine thatconverts an integer to its absolute value, expressed as a |double|:@<Private functions@>=static double flabs(x) long x;{@+if (x>=0) return (double)x; return -((double)x);}@ Although floating point arithmetic is system dependent, we can certainlyassume that at least 16 bits of precision are used. This implies thatthe difference between |flabs(x)| and $\vert x\vert$ must be lessthan $2^{14}$. Also, if $x$ and $y$ are nonnegative values less than $2^{31}$,the difference between their floating-point sum and their true sum must beless than $2^{14}$.The floating point calculations in the following test will never reject avalid weight vector. For if condition $(*)$ holds, the floating-point value of$\max(\hbox{|flabs(a)|},\hbox{|flabs(b)|})+C_1*|flabs|(w_1)+\cdots+C_7*|flabs|(w_7)$ will be less than $2^{30}+(8+C_1+\cdots+C_7)2^{14}$,which is less than $2^{30}+2^{29}$.@<Use float...@>=p=wt_vector;flacc=flabs(*p++);if (flacc<flabs(*p)) flacc=flabs(*p); /* now $|flacc|=\max(\vert a\vert,\vert b\vert)$ */for (q=&max_c[0]; q<&max_c[7]; q++) flacc += *q * flabs(*++p);if (flacc>=(double)0x60000000) /* this constant is $6\times2^{28}=2^{30}+2^{29}$ */ panic(very_bad_specs); /* whoa; the weight vector is way too big */@ Conversely, if the floating point test just made is passed, the truevalue of the sum will be less than $2^{30}+2^{29}+2^{29}=2^{31}$; henceinteger overflow will never occur when we make the following morerefined test:@<Use int...@>=p=wt_vector;acc=iabs(*p++);if (acc<iabs(*p)) acc=iabs(*p); /* now $|acc|=\max(\vert a\vert,\vert b\vert)$ */for (q=&max_c[0]; q<&max_c[7]; q++) acc += *q * iabs(*++p);if (acc>=0x40000000) panic(bad_specs); /* the weight vector is a bit too big */@ @<Private f...@>=static long iabs(x) long x;{@+if (x>=0) return (long)x; return -((long)x);}@* The input phase. Now we're ready to read \.{words.dat}.@<Local...@>=register long wt; /* the weight of the current word */char word[5]; /* the current five-letter word */long nn=0; /* the number of qualifying words found so far */@ As we read the words, we will form a linked list of nodes containingeach qualifying word and its weight, using the memory managementroutines of {\sc GB\_\,GRAPH} to allocate space for 111 nodes at atime. These nodes should be returned to available memory later, so wewill keep them in a separate area under local control.The nodes start out with |key| and |link| fields, as required by the|gb_linksort| routine, which we'll use to sort by weight. The sort key must benonnegative; we obtain it by adding $2^{30}$ to the weight.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -