📄 refer

📁 unix v7是最后一个广泛发布的研究型UNIX版本
💻
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
.... refer | tbl | nroff -ms.de UC\\s-2\\$1\\s0\\$2...ds . \&\s+2.\s0.if t .ds -- \(em.if n .ds -- --.TR 69.TM 77-1274-17 39199 39199-11.ND October 27, 1977.ND June 21, 1978.TLSome Applications of Inverted Indexes on the UNIX System.AU "MH 2C-572" 6377M. E. Lesk.AI.MH.AB.LP.ft BI. Some Applications of Inverted Indexes \- Overview.ft R.PPThis memorandum describes a set of programs whichmake inverted indexes toUNIX*text files, and theirapplication toretrieving and formatting citations for documents prepared using.I troff..PPThese indexing and searching programs make keywordindexes to volumes of material too large for linear searching.Searches for combinations of single words can be performed quickly.The programs are divided intotwo phases.  The first makes an index from the originaldata; the second searches the index and retrievesitems.Both of these phases are further divided into two partsto separate the data-dependent and algorithm dependentcode..PPThe major current application of these programs isthe.I troffpreprocessor.I refer.A list of 4300 references is maintained on line,containing primarily papers written and cited bylocal authors.Whenever one of these references is requiredin a paper, a few words from the title or author listwill retrieve it, and the user need not bother to re-enterthe exact citation.Alternatively, authors can use their own lists of papers..PPThis memorandum is of interest tothose who are interested in facilities for searching largebut relatively unchanging text files ontheUNIXsystem,and those who are interested in handling bibliographiccitations withUNIX.I troff..LP.ft BII. Updating Publication Lists.PPThis section is a brief note describing theauxiliary programs for managing the updatingprocessing.It is written to aid clerical users inmaintaining lists of references.Primarily, the programs described permit a largeamount of individual control over the contentof publication lists while retaining theusefulness of the files to other users..LP.ft BIII. Manual Pages.PPThis section contains the pages from theUNIX programmer's manualfor the.I lookall,.I pubindex,and.I refercommands.It is useful for reference..sp\l'3i'.br* UNIX is a Trademark of Bell Laboratories..AE.CS 10 4 14 0 0 4.NHIntroduction..PPThe.UXsystemhas many utilities(e.g. \fIgrep, awk, lex, egrep, fgrep, ...\fR)to search through files of text,but most of them are based on a linear scan through theentire file, using some deterministic automaton..ev 1.ps 8.vs 10p.evThis memorandum discusses a program which uses invertedindexes.[%A D. Knuth%T The Art of Computer Programming: Vol. 3, Sorting and Searching%I Addison-Wesley%C Reading, Mass.%D 1977%O See section 6.5..]and can thus be used on much larger data bases..PPAs with any indexing system, of course, there are some disadvantages;once an index is made, the files that have been indexed can not be changedwithout remaking the index.Thus applications are restrictedto those making many searchesof relatively stable data.Furthermore, these programs depend on hashing, and can onlysearch for exact matches of whole keywords.It is not possible to look forarithmetic or logical expressions (e.g. ``date greater than 1970'') orfor regular expression searching such as that in.I lex..[lex lesk cstr.].PPCurrently there are two uses of this software,the.I referpreprocessor to format references,and the.I lookallcommand to search through all text files onthe.UXsystem..PPThe remaining sections of this memorandum discussthe searching programs and their uses.Section 2 explains the operation of the searching algorithm and describesthe data collected for use with the.I lookallcommand.The more important application,.I referhas a user's description in section 3.Section 4 goes into more detail onreference filesfor the benefit of those whowish to add references to data bases orwrite new.I troffmacros for use with.I refer.The options to make.I refercollect identical citations, or otherwise relocate and adjust references,are described in section 5.The.UXmanual sections for.I "refer, lookall,"and associated commands are attached as appendices..NHSearching..PPThe indexing and searching process is divided into two phases,each made of two parts.These areshown below..IP A.Construct the index..RS.IP (1)Find keys \*(-- turn the input files into a sequence of tags and keys,where each tag identifies a distinct item in the inputand the keys for each such item are the strings under which it isto be indexed..IP (2)Hash and sort \*(--prepare a set of inverted indexes from which, given a set of keys,the appropriate item tags can be found quickly..RE.IP B.Retrieve an item in response to a query..RS.IP (3)Search \*(--Given some keys, look through the files prepared by the hashingand sorting facility and derive the appropriate tags..IP (4)Deliver \*(--Given the tags, find the original items.  This completes thesearching process..RE.LPThe first phase, making the index, is presumably done relatively infrequently.It should, of course, be done whenever the data beingindexed change.In contrast, the second phase, retrieving items,is presumably done often, and must be rapid..PPAn effort is made to separate code which depends on the databeing handled from code which depends on the searching procedure.The search algorithm is involved only in steps(2) and (3), while knowledge of the actual data files isneeded only by steps (1) and (4).Thus it is easy to adapt to different data files or differentsearch algorithms..PPTo start with, it is necessary to have some way of selectingor generating keys from input files.For dealing with files that are basically English, we havea key-making program which automatically selects wordsand passes them to the hashing and sorting program (step 2).The format used has one line for each input item,arrangedas follows:.DSname:start,length (tab) key1 key2 key3 ....DEwhere.I nameis the file name,.I startis the starting byte number,and.I lengthis the number of bytes in the entry..PPThese lines are the only input used to make theindex.The first field (the file name, byte position, and byte count)is the tag of the itemand can be used to retrieve it quickly.Normally, an item is either a whole file or a section of a filedelimited by blank lines.After the tab, the second field contains the keys.The keys, if selected by the automatic program, areany alphanumeric strings whichare not among the 100 most frequent words in Englishand which are not entirely numeric (except for four-digitnumbers beginning 19, which are accepted as dates).Keys are truncated to six characters and converted to lower case.Some selection is needed if the original items are very large.We normally just take the first.I nkeys, with.I nless than 100 or so; this replaces any attempt at intelligent selection.One file in our system isa complete English dictionary; it would presumably be retrieved for all queries..PPTo generate an inverted index to the list of record tags and keys,the keysare hashedand sorted to produce an index.What is wanted, ideally, is a series of lists showing the tags associatedwith each key.To condense this,what is actually produced is a list showing the tags associatedwith each hash code, and thus with some set of keys.To speed up access and further save space,a set of three or possibly four files is produced.These files are:.KS.bd 2 2.TScenter;c clI l.File	Contentsentry	Pointers to posting file	for each hash codeposting	Lists of tag pointers for	each hash codetag	Tags for each itemkey	Keys for each item	(optional).TE.bd 2.KEThe posting file comprises the real data: it contains a sequence of listsof items posted under each hash code.  To speed up searching,the entry file is an array of pointers into the posting file, one per potentialhash code.Furthermore, the items in the lists in the posting file are not referred to by theircomplete tag, but just by an address in the tag file, whichgives the complete tags.The key file is optional and contains a copy of the keysused in the indexing..PPThe searching process starts with a query, containing several keys.The goal is to obtain all items which were indexed under these keys.The query keys are hashed, and the pointers in the entry file usedto access the lists in the posting file.  These listsare addresses in the tag file of documents posted under thehash codes derived from the query.The common items from all lists are determined;this must include the items indexed by every key, but may alsocontain some items which are false drops, since items referenced bythe correct hash codes need not actually have contained the correct keys.Normally, if there are several keys in the query, there are notlikely to be many false drops in the final combined list even thougheach hash code is somewhat ambiguous.The actual tags are then obtained from the tag file, and to guard againstthe possibility that an item has false-dropped on some hash codein the query, the original items are normally obtained from the deliveryprogram (4) and the query keys checked against themby string comparison..PPUsually, therefore, the check for bad drops is made against the original file.However, if the key derivation procedure is complex, it may be preferableto check against the keys fed to program (2).In this case the optional key file which contains thekeys associated with each item is generated, and the item tag is supplementedby a string.DS;start,length.DEwhich indicates the starting byte number in the key file and the length ofthe string of keys for each item.This file is not usually necessary with the presentkey-selection program, since the keysalways appear in the original document..PPThere is also an option(\f3-C\f2n\|\f1)for coordination level searching.This retrieves items which match all but.I nof the query keys.The items are retrieved in the order of the numberof keys that they match.Of course,.I nmust be less than the number of query keys (nothing isretrieved unless it matches at least one key)..PPAs an example, consider one set of 4377 references, comprising660,000 bytes.This included 51,000 keys, of which 5,900 were distinctkeys.The hash table is kept full to save space (at the expense of time);995 of 997 possible hash codes were used.The total set of index files (no key file) included 171,000 bytes,about 26% of the original file size.It took 8 minutes of processor time tohash, sort, and write the index.To search for a single query with the resulting index took 1.9 secondsof processor time,while to find the same paperwith a sequential linear searchusing.I grep(reading all of the tags and keys)took 12.3 seconds of processor time..PPWe have also used this software to index all of the English stored on our.UXsystem.This is the index searched by the.I lookallcommand.On a typical day there were29,000 files in our user file system, containing about 152,000,000bytes.Of these5,300 files, containing 32,000,000 bytes (about 21%)were English text.The total number of `words' (determined mechanically)was 5,100,000.Of these 227,000 were selected as keys;19,000 were distinct, hashing to 4,900 (of 5,000 possible) different hash codes.Theresulting inverted file indexes used 845,000 bytes, or about2.6% of the size of the original files.The particularly small indexes are caused by thefact that keys are taken from only the first 50 non-common words ofsome very long input files..PPEven this large \f2lookall\f1 index can be searched quickly.For example, to find this documentby looking for the keys``lesk inverted indexes''required1.7 seconds of processor timeand system time.By comparison, just to search the 800,000 byte dictionary (smaller than eventhe inverted indexes, let alone the 32,000,000 bytes of text files) with.I greptakes 29 seconds of processor time.The.I lookallprogram is thus useful when looking for a document which you believeis stored on-line, but do not know where.  For example, many memosfrom the Computing Science Research Center are in its.UXfile system, butit is oftendifficult to guess where a particular memo might be (it might have severalauthors, each with many directories, and have been worked on bya secretary with yet more directories).Instructions for the use of the.I lookallcommand are given in the manual section, shownin the appendix to this memorandum..PPThe only indexes maintained routinely are those of publication lists andall English files.To make other indexes, the programs for making keys, sorting them,searching the indexes, and delivering answers must be used.Since they are usually invoked as parts of higher-level commands,they are not in the default commanddirectory, but are available to any user in the directory.I /usr/lib/refer .Three programs are of interest:.I mkey ,which isolates keys from input files;.I inv ,which makes an index from a set of keys;and.I hunt ,which searches the index and delivers the items.Note that the two parts of the retrieval phase are combined intoone program, to avoid the excessive system work and delay whichwould result from running these as separate processes..PPThese three commands have a large number of options to adapt to differentkinds of input.The user not interested in the detailed description that now follows mayskip to section 3, which describes the.I referprogram, a packaged-up version of these tools specificallyoriented towards formatting references..PP.BMake Keys..RThe program.I mkeyis the key-making program corresponding to step (1) in phase A.Normally, it reads its input from the file names given as arguments,and if there are no arguments it reads from the standard input.It assumes that blank lines in the input delimitseparate items, for each of which a different line ofkeys should be generated.The lines of keys are written on the standard output.Keys are any alphanumeric string in the input notamong the most frequent words in English and not entirely numeric(except that all-numeric strings are acceptable if theyare between 1900 and 1999).In the output, keys are translated to lower case, and truncatedto six characters in length; any associated punctuation is removed.The following flag arguments are recognized by.I mkey:.TScenter;lB lw(4i).\-c \f2name	T{Name of file of common words;default is.I /usr/lib/eign.T}\-f \f2name	T{Read a list of files from.I nameand take each as an input argument.T}\-i \f2chars	T{Ignore all lines which begin with `%' followed by any characterin.I chars .T}\-k\f2n	T{Use at most.I nkeys per input item.T}\-l\f2n	T{Ignore items shorter than.I nletters long.T}\-n\f2m	T{Ignore as a key any word in the first.I mwords of the list of common English words.The default is 100.T}\-s	T{Remove the labels.I (file:start,length)from the output; just give the keys.Used when searching rather than indexing.T}\-w	T{Each whole file is a separate item;blank lines in files are irrelevant.T}.TE.PPThe normal arguments for indexing references arethe defaults, which are.I "\-c /usr/lib/eign" ,.I \-n100 ,and.I \-l3 .For searching, the.I \-soption is also needed.When the big.I lookallindex of all English files is run,the options are.I \-w ,.I \-k50 ,and.I "\-f (filelist)" .When running on textual input,the.I mkey
12 3 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -