📄 ngramprocesslm.java
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.lm;import com.aliasi.corpus.TextHandler;import com.aliasi.io.BitInput;import com.aliasi.io.BitOutput;import com.aliasi.stats.Model;import com.aliasi.util.AbstractExternalizable;import com.aliasi.util.Strings;import java.io.Externalizable;import java.io.InputStream;import java.io.IOException;import java.io.ObjectInput;import java.io.ObjectInputStream;import java.io.ObjectOutput;import java.io.ObjectOutputStream;import java.io.OutputStream;import java.io.Serializable;import java.util.LinkedList;/** * An <code>NGramProcessLM</code> provides a dynamic conditional * process language model process for which training, estimation, and * pruning may be interleaved. A process language model normalizes * probablities for a given length of input. * * <P>The model may be compiled to an object output stream; the * compiled model read back in will be an instance of {@link * CompiledNGramProcessLM}. * * <P>This class implements a generative language model based on the * chain rule, as specified by {@link LanguageModel.Conditional}. * The maximum likelihood estimator (see {@link CharSeqCounter}), * is smoothed by linear interpolation with the next lower-order context * model: * * <blockquote><code> * P'(c<sub><sub>k</sub></sub>|c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>) * <br> * = lambda(c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>) * * P<sub><sub>ML</sub></sub>(c<sub><sub>k</sub></sub>|c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>) * <br> * + (1-lambda(c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>)) * * P'(c<sub><sub>k</sub></sub>|c<sub><sub>j+1</sub></sub>,...,c<sub><sub>k-1</sub></sub>) * </code></blockquote> * * <p>The <code>P<sub><sub>ML</sub></sub> terms in the above definition * are maximum likelihood estimates based on frequency: * * <blockquote><pre> * P<sub><sub>ML</sub></sub>(c<sub><sub>k</sub></sub>|c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>) * = count(c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>, c<sub><sub>k</sub></sub>) * / extCount(c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>)</pre></blockquote> * * The <code>count</code> is just the number of times a given string * has been seein the data, whereas <code>extCount</code> is the number * of times an extension to the string has been seen in the data: * * <blockquote><pre> * extCount(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub></code>) * = <big><big>Σ</big></big><sub><sub>d</sub></sub> count(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub></code>,d)</pre></blockquote> * * * <p>In the parametric Witten-Bell method, the interpolation ratio * <code>lambda</code> is defined based on extensions of the context * of estimation to be: * * <blockquote><code> * lambda(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>) * <br> = extCount(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>) * <br> / (extCount(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>) * + L * numExtensions(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>)) * </code></blockquote> * * where * <code>c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub></code> * is the conditioning context for estimation, <code>extCount</code> * is as defined above, where <code>numExtensions</code> is the * number of extensions of a context: * * <blockquote><pre> * numExtensions(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>)</code> * = cardinality( { d | count(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>,d) > 0 } )</pre></blockquote> * * and where <code>L</code> is a hyperparameter of the distribution * (described below). * * <p>As a base case, <code>P(c<sub><sub>k</sub></sub>)</code> is * interpolated with the uniform distribution * <code>P<sub><sub>U</sub></sub></code>, with interpolation defined * as usual with the argument to <code>lambda</code> being the * empty (i.e. zero length) sequence: * * <blockquote><pre> * P(d) = lambda() * P<sub><sub>ML</sub></sub>(d) * + (1-lambda()) * P<sub><sub>U</sub></sub>(d)</pre></blockquote> * * The uniform distribution <code>P<sub><sub>U</sub></sub></code> only * depends on the number of possible characters used in training and * tests: * * <blockquote><pre> * P<sub><sub>U</sub></sub>(c) = 1/alphabetSize</pre></blockquote> * * where <code>alphabetSize</code> is the maximum number of distinct * characters in this model. * * <P>The free hyperparameter <code>L</code> in the smoothing equation * determines the balance between higher-order and lower-order models. * A higher value for <code>L</code> gives more of the weight to * lower-order contexts. As the amount of data grows against a fixed * alphabet of characters, the impact of <code>L</code> is reduced. * In Witten and Bell's original paper, the hyperparameter * <code>L</code> was set to 1.0, which is not a particularly good * choice for most text sources. A value of the lambda factor that is * roughly the length of the longest n-gram seems to be a good rule of * thumb. * * <P>Methods are provided for computing a sample cross-entropy rate * for a character sequence. The sample cross-entropy * <code>H(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>;P<sub><sub>M</sub></sub>)</code> for * sequence <code>c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub></code> in * probability model <code>P<sub><sub>M</sub></sub></code> is defined to be the * average log (base 2) probability of the characters in the sequence * according to the model. In symbols: * * <blockquote><code> * H(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>;P<sub><sub>M</sub></sub>) = (-log<sub><sub>2</sub></sub> P<sub><sub>M</sub></sub>(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>))/n * </code></blockquote> * * The cross-entropy rate of distribution <code>P'</code> * with respect to a distribution <code>P</code> is defined by: * * <blockquote><code> * H(P',P) * = <big><big><big>Σ</big></big></big><sub><sub>x</sub></sub> * P(x) * log<sub><sub>2</sub></sub> P'(x) * </code></blockquote> * * The Shannon-McMillan-Breiman theorem shows that as the length of * the sample drawn from the true distribution <code>P</code> grows, * the sample cross-entropy rate approaches the actual cross-entropy * rate. In symbols: * * <blockquote> * <code> * H(P,P<sub><sub>M</sub></sub>) * = lim<sub><sub>n->infinity</sub></sub> * H(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>;P<sub><sub>M</sub></sub>)/n * </code> * </blockquote> * * The entropy of a distribution <code>P</code> is defined by its * cross-entropy against itself, <code>H(P,P)</code>. A * distribution's entropy is a lower bound on its cross-entropy; in * symbols, <code>H(P',P) > H(P,P)</code> for all distributions * <code>P'</code>. * * <h3>Pruning</h3> * * <P>Models may be pruned by pruning the underlying substring * counter for the language model. This counter is returned by * the method {@link #substringCounter()}. See the class documentat * for the return result {@link TrieCharSeqCounter} for more information. * * <h3>Serialization</h3> * * <p>Models may be serialized in the usual way by creating an object * output object and writing the object: * * <blockquote><pre> * NGramProcessLM lm = ...; * ObjectOutput out = ...; * out.writeObject(lm);</pre></blockquote> * * Reading just reverses the process: * * <blockquote><pre> * ObjectInput in = ...; * NGramProcessLM lm = (NGramProcessLM) in.readObject();</pre></blockquote> * * Serialization is based on the methods {@link #writeTo(OutputStream)} * and {@link #readFrom(InputStream)}. These write compressed forms of * the model to streams in binary format. * * <p><b>Warning:</b> The object input and output used for * serialization must extend {@link InputStream} and {@link * OutputStream}. The only implementations of {@link ObjectInput} and * {@link ObjectOutput} as of the 1.6 JDK do extend the streams, so * this will only be a problem with customized object input or output * objects. If you need this method to work with custom input and * output objects that do not extend the corresponding streams, drop * us a line and we can perhaps refactor the output methods to remove * this restriction. * * <h3>References</h3> * * <P>For information on the Witten-Bell interpolation method, see: * <ul> * <li> * Witten, Ian H. and Timothy C. Bell. 1991. The zero-frequency * problem: estimating the probabilities of novel events in adaptive * text compression. <i>IEEE Transactions on Information Theory</i> * <b>37</b>(4). * </ul> * * @author Bob Carpenter * @version 3.5.1 * @since LingPipe2.0 */public class NGramProcessLM implements Model<CharSequence>, LanguageModel.Process, LanguageModel.Conditional, LanguageModel.Dynamic, TextHandler, Serializable { private final TrieCharSeqCounter mTrieCharSeqCounter; private final int mMaxNGram; private double mLambdaFactor; private int mNumChars; private double mUniformEstimate; private double mLog2UniformEstimate; /** * Constructs an n-gram process language model with the specified * maximum n-gram length. The number of characters is set to the * default of {@link Character#MAX_VALUE} and the interpolation * parameter is set to the default of being equal to the n-gram * length. * * @param maxNGram Maximum length n-gram for which counts are * stored. */ public NGramProcessLM(int maxNGram) { this(maxNGram,Character.MAX_VALUE); } /** * Construct an n-gram language process with the specified maximum * n-gram length and maximum number of characters. The interpolation * hyperparameter will be set to the same value as the maximum n-gram * length. * * @param maxNGram Maximum length n-gram for which counts are * stored. * @param numChars Maximum number of characters in training data. * @throws IllegalArgumentException If the maximum n-gram is * less than 1 or if the number of characters is not between 1 and * the maximum number of characters. */ public NGramProcessLM(int maxNGram, int numChars) { this(maxNGram,numChars,maxNGram); } /** * Construct an n-gram language process with the specified maximum * n-gram length, number of characters, and interpolation ratio * hyperparameter. * * @param maxNGram Maximum length n-gram for which counts are * stored. * @param numChars Maximum number of characters in training data. * @param lambdaFactor Central value of interpolation * hyperparameter explored. * @throws IllegalArgumentException If the maximum n-gram is * less than 1, the number of characters is not between 1 and * the maximum number of characters, of if the lambda factor * is not greater than or equal to 0. */ public NGramProcessLM(int maxNGram, int numChars, double lambdaFactor) { this(numChars,lambdaFactor,new TrieCharSeqCounter(maxNGram)); } /**
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -