📄 ngram-format.html
字号:
<! $Id: ngram-format.5,v 1.4 2004/02/27 03:33:40 stolcke Exp $><HTML><HEADER><TITLE>ngram-format</TITLE><BODY><H1>ngram-format</H1><H2> NAME </H2>ngram-format - File format for ARPA backoff N-gram models<H2> SYNOPSIS </H2><BR><B>\data\</B><BR><B>ngram 1=</B><I>n1</I><BR><B>ngram 2=</B><I>n2</I><BR>...<BR><B>ngram</B> <I>N</I><B>=</B><I>nN</I><BR><B>\1-grams:</B><BR><I>p</I> <I>w</I> [<I>bow</I>]<BR>...<BR><B>\2-grams:</B><BR><I>p</I> <I>w1 w2</I> [<I>bow</I>]<BR>...<BR><B>\</B><I>N</I><B>-grams:</B><BR><I>p</I> <I>w1</I> ... <I>wN</I><BR>...<BR><B>\end\</B><H2> DESCRIPTION </H2>The so-called ARPA (or Doug Paul) format for N-gram backoff modelsstarts with a header, introduced by the keyword <B>\data\</B>,listing the number of N-grams of each length.Following that, N-grams are listed one per line, grouped into sectionsby length, each section starting with the keyword <B>\</B><I>N</I><B>-gram:</B>,where<I> N </I>is the length of the N-grams to follow.Each N-gram line starts with the logarithm (base 10) of conditional probability <I> p </I>of that N-gram, followed by the words<I>w1</I>...<I>wN</I>making up the N-gram.These are optionally followed by the logarithm (base 10) of thebackoff weight for the N-gram.The keyword <B>\end\</B>concludes the model representation.<P>Backoff weights are required only for those N-gramsthat form a prefix of longer N-grams in the model.The highest-order N-grams in particular will not need backoff weights(they would be useless).<P>Since log(0) (minus infinity) has no portable representation, such valuesare mapped to a large negative number.However, the designated dummy value (-99 in SRILM) is interpreted as log(0)when read back from file into memory.<P>The correctness of the N-gram counts <I>n1</I>,<I></I><I>n2</I>,<I></I>... in the header is not enforced by SRILM software when reading models (although a warning is printed when an inconsistency is encountered).This allows easy textual insertion or deletion of parameters in a model file.The proper format can be recovered by passsing the model throughthe command<BR> ngram -order <I>N</I> -lm <I>input</I> -write-lm <I>output</I><P>Note that the format is self-delimiting, allowing multiple models tobe stored in one file, or to be surrounded by ancillary information.Some extensions of N-gram models in SRILM store additional parameters after a basic N-gram section in the standard format.<H2> SEE ALSO </H2><A HREF="ngram.html">ngram(1)</A>, <A HREF="ngram-count.html">ngram-count(1)</A>, <A HREF="lm-scripts.html">lm-scripts(1)</A>, <A HREF="pfsg-scripts.html">pfsg-scripts(1)</A>.<H2> BUGS </H2>The ARPA format does not allow N-grams that have only a backoff weightassociated with them, but no conditional probability.This makes the format less general than would otherwise be useful(e.g., to support pruned models, or ones containing a mix of words andclasses). The<A HREF="ngram-count.html">ngram-count(1)</A>tool satisfies this constraint by inserting dummy probabilities wherenecessary.<P>For simplicity, an N-gram model containing N-grams up to length<I> N </I>is referred to in the SRILM programs as an <I>N</I>-th<I></I>order model, although techncally it represents a Markov model of order <I>N</I>-1.<I></I><H2> BUGS </H2>There is no way to specify words with embedded whitespace.<H2> AUTHOR </H2>The ARPA backoff format was developed by Doug Paul at MIT Lincoln Labsfor research sponsored by the U.S. Department of DefenseAdvanced Research Project Agency (ARPA).<BR>Man page by Andreas Stolcke <stolcke@speech.sri.com>.<BR>Copyright 1999, 2004 SRI International</BODY></HTML>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -