📄 ngram-format.5
字号:
ngram-format(5) ngram-format(5)NNAAMMEE ngram-format - File format for ARPA backoff N-gram modelsSSYYNNOOPPSSIISS \\ddaattaa\\ nnggrraamm 11==_n_1 nnggrraamm 22==_n_2 ... nnggrraamm _N==_n_N \\11--ggrraammss:: _p _w [_b_o_w] ... \\22--ggrraammss:: _p _w_1 _w_2 [_b_o_w] ... \\_N--ggrraammss:: _p _w_1 ... _w_N ... \\eenndd\\DDEESSCCRRIIPPTTIIOONN The so-called ARPA (or Doug Paul) format for N-gram back- off models starts with a header, introduced by the keyword \\ddaattaa\\, listing the number of N-grams of each length. Following that, N-grams are listed one per line, grouped into sections by length, each section starting with the keyword \\_N--ggrraamm::, where _N is the length of the N-grams to follow. Each N-gram line starts with the logarithm (base 10) of conditional probability _p of that N-gram, followed by the words _w_1..._w_N making up the N-gram. These are optionally followed by the logarithm (base 10) of the backoff weight for the N-gram. The keyword \\eenndd\\ con- cludes the model representation. Backoff weights are required only for those N-grams that form a prefix of longer N-grams in the model. The high- est-order N-grams in particular will not need backoff weights (they would be useless). Since log(0) (minus infinity) has no portable representa- tion, such values are mapped to a large negative number. However, the designated dummy value (-99 in SRILM) is interpreted as log(0) when read back from file into mem- ory. The correctness of the N-gram counts _n_1, _n_2, ... in the header is not enforced by SRILM software when reading mod- els (although a warning is printed when an inconsistency is encountered). This allows easy textual insertion or deletion of parameters in a model file. The proper format can be recovered by passsing the model through the command ngram -order _N -lm _i_n_p_u_t -write-lm _o_u_t_p_u_t Note that the format is self-delimiting, allowing multiple models to be stored in one file, or to be surrounded by ancillary information. Some extensions of N-gram models in SRILM store additional parameters after a basic N-gram section in the standard format.SSEEEE AALLSSOO ngram(1), ngram-count(1), lm-scripts(1), pfsg-scripts(1).BBUUGGSS The ARPA format does not allow N-grams that have only a backoff weight associated with them, but no conditional probability. This makes the format less general than would otherwise be useful (e.g., to support pruned models, or ones containing a mix of words and classes). The nnggrraamm--ccoouunntt(1) tool satisfies this constraint by inserting dummy probabilities where necessary. For simplicity, an N-gram model containing N-grams up to length _N is referred to in the SRILM programs as an _N-th order model, although techncally it represents a Markov model of order _N-1.BBUUGGSS There is no way to specify words with embedded whitespace.AAUUTTHHOORR The ARPA backoff format was developed by Doug Paul at MIT Lincoln Labs for research sponsored by the U.S. Department of Defense Advanced Research Project Agency (ARPA). Man page by Andreas Stolcke <stolcke@speech.sri.com>. Copyright 1999, 2004 SRI InternationalSRILM File Formats $Date: 2004/02/27 03:33:40 $ ngram-format(5)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -