📄 ngramprocesslm.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.lm;import com.aliasi.corpus.TextHandler;import com.aliasi.io.BitInput;import com.aliasi.io.BitOutput;import com.aliasi.stats.Model;import com.aliasi.util.AbstractExternalizable;import com.aliasi.util.Strings;import java.io.Externalizable;import java.io.InputStream;import java.io.IOException;import java.io.ObjectInput;import java.io.ObjectInputStream;import java.io.ObjectOutput;import java.io.ObjectOutputStream;import java.io.OutputStream;import java.io.Serializable;import java.util.LinkedList;/** * An <code>NGramProcessLM</code> provides a dynamic conditional * process language model process for which training, estimation, and * pruning may be interleaved.  A process language model normalizes * probablities for a given length of input. * * <P>The model may be compiled to an object output stream; the * compiled model read back in will be an instance of {@link * CompiledNGramProcessLM}. * * <P>This class implements a generative language model based on the * chain rule, as specified by {@link LanguageModel.Conditional}. * The maximum likelihood estimator (see {@link CharSeqCounter}), * is smoothed by linear interpolation with the next lower-order context * model: * * <blockquote><code> *   P'(c<sub><sub>k</sub></sub>|c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>) *   <br> *   = lambda(c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>) *     * P<sub><sub>ML</sub></sub>(c<sub><sub>k</sub></sub>|c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>) *   <br> &nbsp; &nbsp; *   + (1-lambda(c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>)) *     * P'(c<sub><sub>k</sub></sub>|c<sub><sub>j+1</sub></sub>,...,c<sub><sub>k-1</sub></sub>) * </code></blockquote> * * <p>The <code>P<sub><sub>ML</sub></sub> terms in the above definition * are maximum likelihood estimates based on frequency: * * <blockquote><pre> * P<sub><sub>ML</sub></sub>(c<sub><sub>k</sub></sub>|c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>) * = count(c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>, c<sub><sub>k</sub></sub>) * / extCount(c<sub><sub>j</sub></sub>,...,c<sub><sub>k-1</sub></sub>)</pre></blockquote> * * The <code>count</code> is just the number of times a given string * has been seein the data, whereas <code>extCount</code> is the number * of times an extension to the string has been seen in the data: * * <blockquote><pre> * extCount(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub></code>) * = <big><big>&Sigma;</big></big><sub><sub>d</sub></sub> count(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub></code>,d)</pre></blockquote> * * * <p>In the parametric Witten-Bell method, the interpolation ratio * <code>lambda</code> is defined based on extensions of the context * of estimation to be: * * <blockquote><code> *   lambda(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>) *   <br> = extCount(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>) *   <br> &nbsp; &nbsp; / (extCount(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>) *                         + L * numExtensions(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>)) * </code></blockquote> * * where * <code>c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub></code> * is the conditioning context for estimation, <code>extCount</code> * is as defined above, where <code>numExtensions</code> is the * number of extensions of a context: * * <blockquote><pre> * numExtensions(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>)</code> * = cardinality( { d | count(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>,d) &gt; 0 } )</pre></blockquote> * * and where <code>L</code> is a hyperparameter of the distribution * (described below). * * <p>As a base case, <code>P(c<sub><sub>k</sub></sub>)</code> is * interpolated with the uniform distribution * <code>P<sub><sub>U</sub></sub></code>, with interpolation defined * as usual with the argument to <code>lambda</code> being the * empty (i.e. zero length) sequence: * * <blockquote><pre> * P(d) = lambda() * P<sub><sub>ML</sub></sub>(d) *      + (1-lambda()) * P<sub><sub>U</sub></sub>(d)</pre></blockquote> * * The uniform distribution <code>P<sub><sub>U</sub></sub></code> only * depends on the number of possible characters used in training and * tests: * * <blockquote><pre> * P<sub><sub>U</sub></sub>(c) = 1/alphabetSize</pre></blockquote> * * where <code>alphabetSize</code> is the maximum number of distinct * characters in this model. * * <P>The free hyperparameter <code>L</code> in the smoothing equation * determines the balance between higher-order and lower-order models. * A higher value for <code>L</code> gives more of the weight to * lower-order contexts.  As the amount of data grows against a fixed * alphabet of characters, the impact of <code>L</code> is reduced. * In Witten and Bell's original paper, the hyperparameter * <code>L</code> was set to 1.0, which is not a particularly good * choice for most text sources.  A value of the lambda factor that is * roughly the length of the longest n-gram seems to be a good rule of * thumb. * * <P>Methods are provided for computing a sample cross-entropy rate * for a character sequence.  The sample cross-entropy * <code>H(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>;P<sub><sub>M</sub></sub>)</code> for * sequence <code>c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub></code> in * probability model <code>P<sub><sub>M</sub></sub></code> is defined to be the * average log (base 2) probability of the characters in the sequence * according to the model. In symbols: * * <blockquote><code> * H(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>;P<sub><sub>M</sub></sub>) =  (-log<sub><sub>2</sub></sub> P<sub><sub>M</sub></sub>(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>))/n * </code></blockquote> * * The cross-entropy rate of distribution <code>P'</code> * with respect to a distribution <code>P</code> is defined by: * * <blockquote><code> *   H(P',P) *   = <big><big><big>&Sigma;</big></big></big><sub><sub>x</sub></sub> *     P(x) * log<sub><sub>2</sub></sub> P'(x) * </code></blockquote> * * The Shannon-McMillan-Breiman theorem shows that as the length of * the sample drawn from the true distribution <code>P</code> grows, * the sample cross-entropy rate approaches the actual cross-entropy * rate.  In symbols: * * <blockquote> * <code> *  H(P,P<sub><sub>M</sub></sub>) *  = lim<sub><sub>n->infinity</sub></sub> *    H(c<sub><sub>1</sub></sub>,...,c<sub><sub>n</sub></sub>;P<sub><sub>M</sub></sub>)/n * </code> * </blockquote> * * The entropy of a distribution <code>P</code> is defined by its * cross-entropy against itself, <code>H(P,P)</code>.  A * distribution's entropy is a lower bound on its cross-entropy; in * symbols, <code>H(P',P) > H(P,P)</code> for all distributions * <code>P'</code>. * * <h3>Pruning</h3> * * <P>Models may be pruned by pruning the underlying substring * counter for the language model.  This counter is returned by * the method {@link #substringCounter()}.  See the class documentat * for the return result {@link TrieCharSeqCounter} for more information. * * <h3>Serialization</h3> * * <p>Models may be serialized in the usual way by creating an object * output object and writing the object: * * <blockquote><pre> * NGramProcessLM lm = ...; * ObjectOutput out = ...; * out.writeObject(lm);</pre></blockquote> * * Reading just reverses the process: * * <blockquote><pre> * ObjectInput in = ...; * NGramProcessLM lm = (NGramProcessLM) in.readObject();</pre></blockquote> * * Serialization is based on the methods {@link #writeTo(OutputStream)} * and {@link #readFrom(InputStream)}.  These write compressed forms of * the model to streams in binary format. * * <p><b>Warning:</b> The object input and output used for * serialization must extend {@link InputStream} and {@link * OutputStream}.  The only implementations of {@link ObjectInput} and * {@link ObjectOutput} as of the 1.6 JDK do extend the streams, so * this will only be a problem with customized object input or output * objects.  If you need this method to work with custom input and * output objects that do not extend the corresponding streams, drop * us a line and we can perhaps refactor the output methods to remove * this restriction. * * <h3>References</h3> * * <P>For information on the Witten-Bell interpolation method, see: * <ul> * <li> * Witten, Ian H. and Timothy C. Bell.  1991. The zero-frequency * problem: estimating the probabilities of novel events in adaptive * text compression. <i>IEEE Transactions on Information Theory</i> * <b>37</b>(4). * </ul> * * @author Bob Carpenter * @version 3.5.1 * @since   LingPipe2.0 */public class NGramProcessLM    implements Model<CharSequence>,               LanguageModel.Process,               LanguageModel.Conditional,               LanguageModel.Dynamic,               TextHandler,               Serializable {    private final TrieCharSeqCounter mTrieCharSeqCounter;    private final int mMaxNGram;    private double mLambdaFactor;    private int mNumChars;    private double mUniformEstimate;    private double mLog2UniformEstimate;    /**     * Constructs an n-gram process language model with the specified     * maximum n-gram length.  The number of characters is set to the     * default of {@link Character#MAX_VALUE} and the interpolation     * parameter is set to the default of being equal to the n-gram     * length.     *     * @param maxNGram Maximum length n-gram for which counts are     * stored.     */    public NGramProcessLM(int maxNGram) {        this(maxNGram,Character.MAX_VALUE);    }    /**     * Construct an n-gram language process with the specified maximum     * n-gram length and maximum number of characters.  The interpolation     * hyperparameter will be set to the same value as the maximum n-gram     * length.     *     * @param maxNGram Maximum length n-gram for which counts are     * stored.     * @param numChars Maximum number of characters in training data.     * @throws IllegalArgumentException If the maximum n-gram is     * less than 1 or if the number of characters is not between 1 and     * the maximum number of characters.     */    public NGramProcessLM(int maxNGram,                          int numChars) {        this(maxNGram,numChars,maxNGram);    }    /**     * Construct an n-gram language process with the specified maximum     * n-gram length, number of characters, and interpolation ratio     * hyperparameter.     *     * @param maxNGram Maximum length n-gram for which counts are     * stored.     * @param numChars Maximum number of characters in training data.     * @param lambdaFactor Central value of interpolation     * hyperparameter explored.     * @throws IllegalArgumentException If the maximum n-gram is     * less than 1, the number of characters is not between 1 and     * the maximum number of characters, of if the lambda factor     * is not greater than or equal to 0.     */    public NGramProcessLM(int maxNGram,                          int numChars,                          double lambdaFactor) {        this(numChars,lambdaFactor,new TrieCharSeqCounter(maxNGram));    }    /**
12 3 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -