📄 design-of-wordnet.txt
字号:
word form in one or more index files to obtain all data file addresses of the synsets
containing the word form. Each address is the byte offset (in the data file corresponding
to the syntactic category of the index file) at which the synset's information begins. The
information pertaining to a single synset is encoded as described in the Data Files
section below.
One shortcoming of the database's structure is that although all the files are in
ASCII, and are therefore editable, and in theory extensible, in practice this is almost
impossible. One of the Grinder's primary functions is the calculation of addresses for the
synsets in the data files. Editing any of the database files would (most likely) create
incorrect byte offsets, and would thus derail many searching strategies. At the present
time, building a WordNet database requires the use of the Grinder and the processing of
all lexicographers' source files at the same time.
The descriptions of the Index and Data files that follow are brief and are intended to
provide only a glimpse into the structure, syntax, and organization of the database. More
detailed descriptions can be found in the manual page wndb(5) included in the
``WordNet Reference Manual''.
Index Files
Word forms in an index file are in lower case regardless of how they were entered in
the lexicographers' files. The files are sorted according to the ASCII character set
collating sequence and can be searched quickly with a binary search.
Each index file begins with several lines containing a copyright notice, version
number and license agreement, followed by the data lines. Each line of data contains the
following information: the sense count from the on-line dictionary; a list of the relational
pointer types used in all synsets containing the word (this is used by the retrieval
software to indicate to a user which searches are applicable); a list of indices which are
byte offsets into the corresponding data file, one for each occurrence of the word form in
a synset. Each data line is terminated with an end-of-line character.
Data Files
A data file contains information corresponding to the synsets that were defined in
the lexicographers' files with pointers resolved to byte offsets in data.pos files.
Each data file begins with several lines containing a copyright notice, version
number and license agreement. This is followed by a list of the names of all the input
files that were specified to the Grinder, in the order that they were given on the command
line, followed by the data lines. Each line of data contains an encoding of the
information entered by the lexicographer for a synset, as well as additional information
provided by the Grinder which is useful to the retrieval software and other programs.
Each data line is terminated with an end-of-line character. In the data files, word forms
in a synset match the orthographic representation entered in the lexicographers' files.
The first piece of information on each line is the byte offset, or address, of the
synset. This is slightly redundant, since almost any computer program that reads a synset
from a data file knows the byte offset that it read it from; however this piece of
information is useful when using UNIX utilities like grep to trace synsets and pointers
without the use of sophisticated software. It also provides a unique ``key'' for a synset,
if a user's application requires one. An integer, corresponding to the location in the list
of file names of the file from which the synset originated, follows. This can be used by
retrieval software to annotate the display of a synset with the name of the originating file,
and can be helpful for distinguishing senses. A list of word forms, relational pointers,
and verb frames follows. An optional textual gloss is the final component of a data line.
Relational pointers are represented by several pieces of information. The symbol
for the pointer comes first, followed by the address of the target synset and its syntactic
category (necessary for pointers that cross over into a different syntactic category),
followed by a field which differentiates lexical and semantic pointers. If a lexical pointer
is being represented, this field indicates which word forms in the source and target
synsets the pointer pertains to. For a semantic pointer, this field is 0.
Retrieving Lexical Information
In order to give a user access to information in the database, an interface is required.
Interfaces enable end users to retrieve the lexical data and display it via a window-based
tool or the command line. When considering the role of the interface, it is important to
recognize the difference between a printed dictionary and a lexical database. WordNet's
interface software creates its responses to a user's requests on the fly. Unlike an on-line
version of a printed dictionary, where information is stored in a fixed format and
displayed on demand, WordNet's information is stored in a format that would be
meaningless to an ordinary reader. The interface provides a user with a variety of ways
to retrieve and display lexical information. Different interfaces can be created to serve
the purposes of different users, but all of them will draw on the same underlying lexical
database, and may use the same software functions that interface to the database files.
User interfaces to WordNet can take on many forms. The standard interface is an X
Windows application, which has been ported to several computer platforms. Microsoft
Windows and Macintosh interfaces have also been written. An alternative command line
interface allows the user to retrieve the same data, with exactly the same output as the
window-based interfaces, although the specification of the retrieval criteria is more
cumbersome, and the whole effect is less impressive. Nevertheless, the command line
interface is useful because some users do not have access to windowing environments.
Shell scripts and other programs can also be written around the command line interface.
The search process is the same regardless of the type of search requested. The first
step is to retrieve the index entry located in the appropriate index file. This will contain a
list of addresses of the synsets in the data file in which the word appears. Then each of
the synsets in the data file is searched for the requested information, which is retrieved
and formatted for output. Searching is complicated by the fact that each synset
containing the search word also contains pointers to other synsets in the data file that may
need to be retrieved and displayed, depending on the search type. For example, each
synset in the hypernymic pathway points to the next synset in the hierarchy. If a user
requests a recursive search on hypernyms a recursive retrieval process is repeated until a
synset is encountered that contains no further pointers.
The user interfaces to WordNet and other software tools rely upon a library of
functions that interface to the database files. A fairly comprehensive set of functions is
provided: they perform searches and retrievals, morphology, and various other utility
functions. Appendix C contains a brief description of these functions. The structured,
flexible design of the library provides a simple programming interface to the WordNet
database. Low-level, complex, and utility functions are included. The user interface
software depends upon the more complex functions to perform the actual data retrieval
and formatting of the search results for display to the user. Low-level functions provide
basic access to the lexical data in the index and data files, while shielding the
programmer from the details of opening files, reading files, and parsing a line of data.
These functions return the requested information in a data structure that can be
interpreted and used as required by the application. Utility functions allow simple
manipulations of the search strings.
The basic searching function, findtheinfo(), receives as its input arguments a word
form, syntactic category, and search type; findtheinfo() calls a low-level function to find
the corresponding entry in the index file, and for each sense calls the appropriate function
to trace the pointer corresponding to the search type. Most traces are done with the
function traceptrs(), but specialized functions exist for search types which do not
conform to the standard hierarchical search. As a synset is retrieved from the database, it
is formatted as required by the search type into a large output buffer. The resulting
buffer, containing all of the formatted synsets for all of the senses of the search word, is
returned to the caller. The calling function simply has to print the buffer returned from
findtheinfo().
This general search and retrieval algorithm is used in several different ways to
implement the user interfaces to WordNet. Search types vary by syntactic category but
correspond to the relational pointers listed in Table 2. Hierarchical searches may be
performed on all relational pointers except for antonyms and ``also see''. In addition, a
call to findtheinfo() may retrieve polysemy information, verb sentence frames, or noun
coordinate terms (those with the same hypernym as the search string).
The searching function does not perform morphological operations; therefore calls
to findtheinfo() are made from within a loop that calls morphstr() to translate the search
string into one or more base forms before calling the searching function.
X Windows Interface
An attempt is made here to give the reader an idea of the look and feel of the X
Windows interface to the WordNet database. The Microsoft Windows and Macintosh
interfaces are very similar. The command line interface provides the same functions, but
the user must specify the search string and search type, as well as other options, on the
command line. The command line interface allows multiple searches on a search string
with a single command, but a separate command line must be constructed for each search
word.
The command xwn runs the xwordnet program in the background, freeing up the
window from which it was started for other tasks. The xwordnet window provides full
access to the WordNet database. The standard X Windows mouse functions are used to
open and close the xwordnet window, move the window, and change its size. Help on
the general operation of xwordnet can be obtained by pressing the middle mouse button
with the cursor in the top part of the window.
Searching the Database
The top part of the xwordnet window provides a buffer for entering a search string
and buttons corresponding to syntactic categories and options. Below this area a status
line indicates which type of search is being displayed in the large buffer below.
To search the WordNet database, a user moves the cursor into the large, horizontal
box below "Enter Search Word:" and enters a search string, followed by a carriage
return. A single word, hyphenated word, or collocation may be entered. A highlighted
button indicates each syntactic category in the WordNet database that contains the search
string. If the search string is not present exactly as typed (except for case, which is
ignored), a morphological process is applied to the search string in an attempt to
automatically generate a form that is present in WordNet. See the section on Morphy
for a discussion of this process.
Holding any mouse button on a highlighted part-of-speech button reveals a pull-
down menu of searches specific to that syntactic category. All of the searches available
for the search string are highlighted. The user selects a search by scrolling down with the
mouse until the desired search type is in reverse video, then releasing the mouse button.
The retrieval is then performed and the formatted results are displayed in the lower
window. The status line shows the type of search that was selected.
Although most searches return very quickly, the WordNet hierarchies can be quite
deep and broad, and some retrievals can take a long time. While a search is running, the
mouse pointer displays as a watchface when the mouse is in the upper part of the window
(above the output buffer), and the message Searching... is displayed in the output buffer.
By default, all of the senses found in WordNet that match the selected search are
displayed. The search may be restricted to one or more specific senses by entering a
comma-separated list of sense numbers in the "Sense Number:" box. These numbers
are used for one search only, and the box is cleared after the search is completed.
Options
The Options menu displays a list of options that are not directly associated with
WordNet searches. The Help, Textual Gloss, and Log options are toggles. Help and Log
are initially Off, and Textual Gloss is initially On. An option is toggled by highlighting
the option and releasing the mouse button. The following options are available:
[1] The Help option is used to display information that is helpful in understanding
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -