📄 vocab.3
字号:
Vocab(3) Vocab(3)NNAAMMEE Vocab - Vocabulary indexing for SRILMSSYYNNOOPPSSIISS ##iinncclluuddee <<VVooccaabb..hh>>DDEESSCCRRIIPPTTIIOONN The VVooccaabb class represents sets of string tokens as typi- cally used for vocabularies, word class names, etc. Addi- tionally, Vocab provides a mapping from such string tokens (type VVooccaabbSSttrriinngg) to integers (type VVooccaabbIInnddeexx). VocabIndex values are typically used to index words in language models to conserve space and speed up comparisons etc. Thus, VVooccaabb essentially implements a symbol table into which strings can be ``interned.''TTYYPPEESS VVooccaabbIInnddeexx A non-negative integer for representing a string internally. VVooccaabbSSttrriinngg A character array representing a vocabulary item (e.g., a word).CCOONNSSTTAANNTTSS mmaaxxWWoorrddLLeennggtthh Maximum number of characters in a VocabString. VVooccaabb__NNoonnee A special VocabIndex used to denote no vocabulary item and to terminate VocabIndex arrays. VVooccaabb__UUnnkknnoowwnn VVooccaabb__SSeennttSSttaarrtt VVooccaabb__SSeennttEEnndd VVooccaabb__PPaauussee Default VocabString values for some common, prede- fined vocabulary items: unknown word, sentence begin, sentence end, and pause, respectively.CCLLAASSSS MMEEMMBBEERRSS VVooccaabb((VVooccaabbIInnddeexx _s_t_a_r_t == 00,, VVooccaabbIInnddeexx _e_n_d == 00xx77ffffffffffffff)) When initializing a Vocab object, _s_t_a_r_t and _e_n_d optionally set the minimum and maximum VocabIndex values assigned by the vocabulary. Indices are allocated in increasing order starting at _s_t_a_r_t. VVooccaabbIInnddeexx aaddddWWoorrdd((VVooccaabbSSttrriinngg _n_a_m_e)) Looks up the index of a word string _n_a_m_e, adding the word if not already part of the vocabulary. VVooccaabbSSttrriinngg ggeettWWoorrdd((VVooccaabbIInnddeexx _i_n_d_e_x)) Returns the VocabString for _i_n_d_e_x, or 0 if the index isn't defined. ggeettIInnddeexx((VVooccaabbSSttrriinngg _n_a_m_e)) Returns the VocabIndex for word _n_a_m_e, or VVooccaabb__NNoonnee if the word isn't defined. (Unlike aaddddWWoorrdd(()), this will not extend the vocabulary if the word is unde- fined.) vvooiidd rreemmoovvee((VVooccaabbSSttrriinngg _n_a_m_e)) vvooiidd rreemmoovvee((VVooccaabbIInnddeexx _i_n_d_e_x)) Deletes a vocabulary item, either by name or by index. uunnssiiggnneedd iinntt nnuummWWoorrddss(()) Returns the number of current vocabulary entries. VVooccaabbIInnddeexx hhiigghhIInnddeexx(()) Returns the highest VocabIndex value assigned so far. The next word added will receive an index that is one greater. When allocating various mean- ingful vocabulary subsets into contiguous ranges, this function can be used to determine the corre- sponding boundaries in VocabIndex space, and then use these values to test subset membership etc. VVooccaabbIInnddeexx uunnkkIInnddeexx The index of the unknown word (by default assigned to VVooccaabb__UUnnkknnoowwnn). VVooccaabbIInnddeexx ssssIInnddeexx The index of the sentence-start tag (by default assignedrto VVooccaabb__SSeennttSSttaarrtt). VVooccaabbIInnddeexx sseeIInnddeexx The index of the sentence-end tag (by default assigned to VVooccaabb__SSeennttEEnndd). VVooccaabbIInnddeexx ppaauusseeIInnddeexx The index of the pause tag (by default assigned to VVooccaabb__PPaauussee). BBoooolleeaann uunnkkIIssWWoorrdd When ttrruuee, the unknown word is considered a regular word (default ffaallssee). BBoooolleeaann ttooLLoowweerr When ttrruuee, all word strings are mapped to lower- case. This is convenient to combine vocabularies, language models, etc., whose vocabularies differ only in the case convention (default ffaallssee). BBoooolleeaann iissNNoonnEEvveenntt((VVooccaabbSSttrriinngg _w_o_r_d)) BBoooolleeaann iissNNoonnEEvveenntt((VVooccaabbIInnddeexx _w_o_r_d)) Tests a word string or index for being an ``non- event'', i.e., a token that is not assigned proba- bility in a language model. By default, sentence- start, pauses, and unknown words are non-events. uunnssiiggnneedd rreeaadd((FFiillee &&_f_i_l_e)) Reads word strings from a file and adds them to the vocabulary. For convenience, only the first word on each line is significant (so extra information could be contained in such a file). Returns the number of words read. vvooiidd wwrriittee((FFiillee &&_f_i_l_e,, BBoooolleeaann _s_o_r_t_e_d == ttrruuee)) Write the vocabulary strings to a file in a format compatible with rreeaadd(()). The _s_o_r_t_e_d argument con- trols whether the output is lexicographically sorted. Often times one wants to manipulate not single vocabulary items, but strings of them, e.g., to represent sentences. Word strings are represented as self-delimiting arrays of type VVooccaabbSSttrriinngg ** or VVooccaabbIInnddeexx **. The last element in a string is 0 or VVooccaabb__NNoonnee, respectively. uunnssiiggnneedd ggeettWWoorrddss((ccoonnsstt VVooccaabbIInnddeexx **_w_i_d_s,, VVooccaabbSSttrriinngg **_w_o_r_d_s,, uunnssiiggnneedd _m_a_x)) Extends ggeettWWoorrdd(()) to strings of word. The result is placed in _w_o_r_d_s, which must have room for at least _m_a_x words. Returns the actual number of indices in _w_i_d_s. uunnssiiggnneedd aaddddWWoorrddss((ccoonnsstt VVooccaabbSSttrriinngg **_w_o_r_d_s,, VVooccaabbIInnddeexx **_w_i_d_s,, uunnssiiggnneedd _m_a_x)) Extends aaddddWWoorrdd(()) to strings of indices. The result is placed in _w_i_d_s, which must have room for at least _m_a_x indices. Returns the actual number of words in _w_o_r_d_s. uunnssiiggnneedd ggeettIInnddiicceess((ccoonnsstt VVooccaabbSSttrriinngg **_w_o_r_d_s,, VVooccaabbIInnddeexx **_w_i_d_s,, uunnssiiggnneedd _m_a_x)) Extends ggeettIInnddeexx(()) to strings of indices. The result is placed in _w_i_d_s, which must have room for at least _m_a_x indices. Returns the actual number of words in _w_o_r_d_s.FFUUNNCCTTIIOONNSS The following static member functions are utilities to manipulate strings of vocabulary items, independent of a particular vocabulary. uunnssiiggnneedd ppaarrsseeWWoorrddss((cchhaarr **_l_i_n_e,, VVooccaabbSSttrriinngg **_w_o_r_d_s,, uunnssiiggnneedd _m_a_x)) Parses a character string _l_i_n_e into whitespace- delimited words. On return, _w_o_r_d_s contains point- ers to null-terminated substrings of _l_i_n_e (whose contents is modified in the process). _w_o_r_d_s must have room for at least _m_a_x pointers. Returns the actual number of words parsed. uunnssiiggnneedd lleennggtthh((ccoonnsstt VVooccaabbIInnddeexx **_w_o_r_d_s)) uunnssiiggnneedd lleennggtthh((ccoonnsstt VVooccaabbSSttrriinngg **_w_o_r_d_s)) Returns the number items in a word string. BBoooolleeaann ccoonnttaaiinnss((ccoonnsstt VVooccaabbIInnddeexx **_w_o_r_d_s,, VVooccaabbIInnddeexx _w_o_r_d)) Returns _t_r_u_e if the _w_o_r_d occurs among _w_o_r_d_s. VVooccaabbIInnddeexx **rreevveerrssee((VVooccaabbIInnddeexx **_w_o_r_d_s)) VVooccaabbSSttrriinngg **rreevveerrssee((VVooccaabbSSttrriinngg **_w_o_r_d_s)) Reverses a string of words in place (and returns it as a result). vvooiidd wwrriittee((FFiillee &&_f_i_l_e,, ccoonnsstt VVooccaabbSSttrriinngg **_w_o_r_d_s)) Writes a string of space-delimited words to a file. iinntt ccoommppaarree((VVooccaabbIInnddeexx _w_o_r_d_1,, VVooccaabbIInnddeexx _w_o_r_d_2)) iinntt ccoommppaarree((VVooccaabbSSttrriinngg _w_o_r_d_1,, VVooccaabbSSttrriinngg _w_o_r_d_2)) Compares two vocabulary items lexicographically. Returns -1, 0, +1 for less than, equal, or greater than, respectively. iinntt ccoommppaarree((ccoonnsstt VVooccaabbIInnddeexx **_w_o_r_d_s_1,, ccoonnsstt VVooccaabbIInnddeexx **_w_o_r_d_s_2)) iinntt ccoommppaarree((ccoonnsstt VVooccaabbIInnddeexx **_w_o_r_d_s_1,, ccoonnsstt VVooccaabbIInnddeexx **_w_o_r_d_s_2)) Extends the order of _c_o_m_p_a_r_e_(_) to strings of words. For compatibilty with the C library calling conventions, ccoommppaarree(()) cannot be a member function of a Vocab object. For index-based comparisons the associated vocabulary needs to be set globally. This is achieved by calling the ccoommppaarreeIInnddeexx(()) member function of a Vocab object. oossttrreeaamm &&ooppeerraattoorr<<<< ((oossttrreeaamm &&,, ccoonnsstt VVooccaabbSSttrriinngg **_w_o_r_d_s)) oossttrreeaamm &&ooppeerraattoorr<<<< ((oossttrreeaamm &&,, ccoonnsstt VVooccaabbIInnddeexx **_w_o_r_d_s)) These operators output strings of words to a stream. For the second variant, the Vocab object used for interpreting indices needs to be identi- fied globally by calling the _u_s_e_(_) member function on the object.IITTEERRAATTOORRSS The VVooccaabbIItteerr class provides iteration over vocabularies. An iteration returns the elements of a Vocab in some unspecified, but deterministic order. When copied or used in initialization of other objects, VocabIter objects retain the current ``position'' in an iteration. This allows nested iterations that enumerate all pairs of distinct elements, etc. NOTE: While an iteration over a Vocab object is ongoing, no modifications are allowed to the object, _e_x_c_e_p_t removal of the ``current'' vocabulary item. VVooccaabbIItteerr((VVooccaabb &&_v_o_c_a_b,, BBoooolleeaann _s_o_r_t_e_d == ffaallssee)) Creates an iteration over _v_o_c_a_b. If _s_o_r_t_e_d is set to ttrruuee the vocabulary items will be enumerated in lexicographic order. vvooiidd iinniitt(()) Reinitializes the iteration to its beginning. VVooccaabbSSttrriinngg nneexxtt(()) VVooccaabbSSttrriinngg nneexxtt((VVooccaabbIInnddeexx &&_i_n_d_e_x)) Steps the iteration and returns the next word string. Optionally, the associated word index is returned in _i_n_d_e_x. Returns 0 if the vocabulary is exhausted.SSEEEE AALLSSOO LM(3), File(3)BBUUGGSS There is no good way to synchronize VocabIndex values across multiple Vocab objects.AAUUTTHHOORR Andreas Stolcke <stolcke@speech.sri.com>. Copyright 1995, 1996 SRI InternationalSRILM $Date: 1996/07/13 01:35:40 $ Vocab(3)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -