📄 design-of-wordnet.txt

📁 此文档为wordnet的介绍文档
💻 TXT
📖 第 1 页 / 共 4 页
字号:
relation to the noun that it modifies. Each word form in WordNet is known by its
orthographic representation, syntactic category, semantic field, and sense number.
Together, these data make a ``key'' which uniquely identifies each word form in the
database.

Relational Pointers

     Relational pointers represent the relations between the word forms in a synset and
other synsets, and are either lexical or semantic. Lexical relations exists between
relational adjectives and the nouns that they relate to, and between adverbs and the
adjectives from which they are derived. The semantic relation between adjectives and
the nouns for which they express values are encoded as attributes. The semantic relation
between noun attributes and the adjectives expressing their values are also encoded.
Presently these are the only pointers that cross from one syntactic category to another.
Antonyms are also lexically related. Synonymy of word forms is implicit by inclusion in
the same synset. Table 2 summarizes the relational pointers by syntactic category.
Meronymy is further specified by appending one of the following characters to the
meronymy pointer: p to indicate a part of something; s to indicate the substance of
something; m to indicate a member of some group. Holonymy is specified in the same
manner, each pointer representing the semantic relation opposite to the corresponding
meronymy relation.
     Many pointers are reflexive, meaning that if a synset contains a pointer to another
synset, the other synset should contain a corresponding reflexive pointer back to the
original synset. The Grinder automatically generates the relations for missing reflexive
pointers of the types listed in Table 3.
     A relational pointer can be entered by the lexicographer in one of two ways. If a
pointer is to represent a relation between synsets - a semantic relation - it is entered
following the list of word forms in the synset. Hypernymy always relates one synset to
another, and is an example of a semantic relation. The lexicographer can also enclose a
word form and a list of pointers within square brackets ([...]) to define a lexical relation
between word forms. Relational adjectives are entered in this manner, showing the
lexical relation between the adjective and the noun that it pertains to.

Table 2   WordNet Relational Pointers
                                                                       
Noun           Verb          Adjective            Adverb   

Antonym !     Antonym !      Antonym !           Antonym !  
Hyponym ~     Troponym ~     Similar &           Derived from \
Hypernym @    Hypernym @     Relational Adj. \  
Meronym  #    Entailment *   Also See 
Holonym  %    Cause >        Attribute =
Attribute =   Also See                                    
                                   

Table 3  Reflexive Pointers

Pointer      Reflect

Antonym      Antonym
Hyponym      Hypernym
Hypernym     Hyponym
Holonym      Meronym
Meronym      Holonym
Similar to   Similar to
Attribute    Attribute

Verb Sentence Frames

 Each verb synset contains a list of verb frames illustrating the types of simple
sentences in which the verbs in the synset can be used. A list of verb frames can be
restricted to a word form by using the square bracket syntax described above. See
Appendix B for a list of the verb sentence frames.

Synset Syntax
 Strings in the source files that conform to the following syntactic rules are treated as
synsets. Note that this is a brief description of the general synset syntax and is not a
formal description of the source file format. A formal specification is found in the
manual page wninput(5) of the ``WordNet Reference Manual''.

     [1] Each synset begins with a left curly bracket ({).
     [2] Each synset is terminated with a right curly bracket (}).
     [3] Each synset contains a list of one or more word forms, each followed by a
         comma.
     [4] To code semantic relations, the list of word forms is followed by a list of
         relational pointers using the following syntax: a word form (optionally preceded
         by "filename:" to indicate a word form in a different lexicographer file) followed
         by a comma, followed by a relational pointer symbol.
     [5] For verb synsets, "frames:" is followed by a comma separated list of applicable
         verb frames. The verb frames follow all relational pointers.
     [6] To code lexical relations, a word form is followed by a list of elements from [4]
         and/or [5] inside square brackets ([...]).
     [7] To code adjective clusters, each part of a cluster (a head synset, optionally
         followed by satellite synsets) is separated from other parts of a cluster by a line
         containing only hyphens. Each entire cluster is enclosed in square brackets.

Archive System

     The lexicographers' source files are maintained in an archive system based on the
Unix Revision Control System (RCS) for managing multiple revisions of text files. The
archive system has been established for several reasons - to allow the reconstruction of
any version of the WordNet database, to keep a history of all the changes to
lexicographers' files, to prevent people from making conflicting changes to the same file,
and to ensure that it is always possible to produce an up-to-date version of the WordNet
database. The programs in the archive system are Unix shell scripts which envelop RCS
commands in a manner that maintains the desired control over the lexicographers' source
files and provides a user-friendly interface for the lexicographers.
     The reserve command extracts from the archive the most recent revision of a given
file or files and locks the file for as long as a user is working on it. The review command
extracts from the archive the most recent revision of a given file or files for the purpose
of examination only, therefore the file is not locked. To discourage making changes,
review files do not have write permission since any such changes could not be
incorporated into the archive. The restore command verifies the integrity of a reserved
file and returns it to the archive system. The release command is used to break a lock
placed on a file with the reserve command. This is generally used if the lexicographer
decides that changes should not be returned to the archive. The whose command is used
to find out whether files are currently reserved, and if so, by whom.

Grinder Utility

     The Grinder is a versatile utility with the primary purpose of compiling the
lexicographers' files into a database format that facilitates machine retrieval of the
information in WordNet. The Grinder has several options that control its operation on a
set of input files. To build a complete WordNet database, all of the lexicographers' files
must be processed at the same time. The Grinder is also used as a verification tool to
ensure the syntactic integrity of the lexicographers' files when they are returned to the
archive system with the restore command.

Implementation

     The Grinder is a multi-pass compiler that is coded in C. The first pass uses a parser,
written in yacc and lex, to verify that the syntax of the input files conforms to the
specification of the input grammar and lexical items, and builds an internal representation
of the parsed synsets. Additional passes refer only to this internal representation of the
lexicographic data. Pass one attempts to find as many syntactic and structural errors as
possible. Syntactic errors are those in which the input file fails to conform to the input
grammar's specification, and structural errors refer to relational pointers that cannot be
resolved for some reason. Usually these errors occur because the lexicographer has made
a typographical error, such as constructing a pointer to a non-existent file, or fails to
specify a sense number when referring to an ambiguous word form. Pass one cannot
determine structural errors in pointers to files that are not processed together. When used
as a verification tool, as from the restore command, only pass one is run.
     In its second pass, the Grinder resolves all of the semantic and lexical pointers. To
do this, the pointers that were specified in each synset are examined in turn, and the
target of each pointer (either a synset or a word form in a synset) is found. The source
pointer is then resolved by adding an entry to the internal data structure which notes the
``location'' of the target. In the case of reflexive pointers, the target pointer's synset is
then searched for a corresponding reflexive pointer. If found, the data structure
representing the reflexive pointer is modified to note the ``location'' of its target, the
original source pointer. If a reflexive pointer is not found, the Grinder automatically
creates one with all the pertinent information.
     A subsequent pass through the list of word forms assigns a polysemy index value, or
sense count, to each word form found in the on-line dictionary. There is a separate sense
count for each syntactic category that the word form is found in. The Grinder's final pass
generates the WordNet database.

Internal Representation

     The internal representation of the lexicographic data is a network of interrelated
linked lists. A hash table of word forms is created as the lexicographers' files are parsed.
Lower-case strings are used as keys; the original orthographic word form, if not in
lower-case, is retained as part of the data structure for inclusion in the database files. As
the parser processes an input file, it calls functions which create data structures for the
word forms, pointers, and verb frames in a synset. Once an entire synset had been
parsed, a data structure is created for it which includes pointers to the various structures
representing the word forms, pointers, and verb frames. All of the synsets from the input
files are maintained as a single linked list. The Grinder's different passes access the
structures either through the linked list of synsets or the hash table of word forms. A 
list of synsets that specify each word form is maintained for the purposes of resolving 
pointers and generating the database's index files.

WordNet Database

     For each syntactic category, two files represent the WordNet database - index.pos
and data.pos, where pos is either noun, verb, adj or adv (the actual file names may be
different on platforms other than Sun-4). The database is in an ASCII format that is
human- and machine-readable, and is easily accessible to those who wish to use it with
their own applications. Each index file is an alphabetized list of all of the word forms in
WordNet for the corresponding syntactic category. Each data file contains all of the
lexicographic data gathered from the lexicographers' files for the corresponding syntactic
category, with relational pointers resolved to addresses in data files.
     The index and data files are interrelated. Part of each entry in an index file is a list
of one or more byte offsets, each indicating the starting address of a synset in a data file.
The first step to the retrieval of synsets or other information is typically a search for a
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -