📄 lmclassifier.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.classify;import com.aliasi.stats.MultivariateDistribution;import com.aliasi.lm.LanguageModel;import com.aliasi.util.Scored;import com.aliasi.util.ScoredObject;import com.aliasi.util.Strings;import java.util.Arrays;import java.util.HashMap;/** * An <code>LMClassifier</code> performs joint probability-based * classification of character sequences into non-overlapping * categories based on language models for each category and a * multivariate distribution over categories.  Thus the subclass of * <code>Classification</code> returned by the classify method is * {@link JointClassification}.  In addition to joint and conditional * probabilities of categories given the input, the score of the * returned joint classification is the character plus category * sample entropy rate. * * <P>A language-model classifier is constructed from a fixed, finite * set of categories which are assumed to have disjoint * (non-overlapping) sets of members.  The categories are represented * as simple strings.  Each category is assigned a language model. * Furthermore, a multivariate distribution over the set of categories * assigns marginal category probabilities. * * <P>Joint log probabilities are determined in the usual way: * * <blockquote><code> *    log<sub><sub>2</sub></sub> P(cs,cat) *    = log<sub><sub>2</sub></sub> P(cs|cat) *      + log<sub><sub>2</sub></sub> P(cat) * </code></blockquote> * * where <code>P(cs|cat)</code> is the probability of the character * sequence <code>cs</code> in the language model for category * <code>cat</code> and where <code>P(cat)</code> is the probability * assigned by the multivariate distribution over categories.  Scores * are defined to be adjusted sample cross-entropy rates: * * <blockquote><code> * score(cs,cat) * <br>&nbsp; = (log<sub><sub>2</sub></sub> P(cs,cat)) / (cs.length() + 2) * <br>&nbsp; = (log<sub><sub>2</sub></sub> P(cs|cat) *    + log<sub><sub>2</sub></sub> P(cat)) / (cs.length() + 2) * </code></blockquote> * * Note that the contribution of the category probability to the score * approaches zero as the sample size grows and the data overwhelms * the pre-data expectation.  Also note that each category has its * estimate divided by the same amount, so the probabilistic ordering * is preserved.  If the language models are process models, the * cross-entroy rate is just * <code>(log<sub><sub>2</sub></sub>P(cs|cat))/cs.length(); for * process models, add one to the denominator to account for * figuratively generating the end-of-character-sequence symbol. * * <P>Note that maximizing joint probabilities is the same as * maximizing conditional probabilities because the character sequence * <code>cs</code> is constant: * * <blockquote><code> * ARGMAX<sub><sub>cat</sub></sub> P(cat|cs) * <br> =  ARGMAX<sub><sub>cat</sub></sub> P(cs,cat) / P(cs) * <br> =  ARGMAX<sub><sub>cat</sub></sub> P(cs,cat) * </code></blockquote> * * A computation of conditional estimates <code>P(cat|cs)</code> given * the joint estimates is defined in {@link JointClassification}. * * <P>To ensure consistent estimates, all of the language models * should either be process language models or sequence language * models over the same set of characters, depending on whether * probability normalization is over fixed length sequences or over * all strings.  On the other hand, the models themselves may be a * mixture of n-gram lengths and smoothing parameters, or even in the * case of sequence models, tokenized models and sequence character * models. * * <P>Boolean classifiers for membership can be constructed with this * class by means of a positive language model and a negative model. * A character sequence is considered an instance of the category if * they are more likely in the positive model than the negative model. * There are several strategies for constructing anti-models.  The * most common methodology is to build an anti-model from an unbiased * sample of negative cases, but this requires supervision for * negative cases and tends to bias toward the model with more * training data given the way language model cross-entropy tends to * go down with more training data in general.  Another approach is to * build a weaker model from the same training data as the positive * model, for instance by using lower order n-grams for the negative * model.  The simplest approach is to use a uniform negative model, * which amounts to a cross-entropy rejection threshold; this is the * basis of {@link BinaryLMClassifier}. * * <P>Language model classifiers may be trained using {@link * DynamicLMClassifier} using a trainable multivariate estimator and * dynamic language models. * * @author  Bob Carpenter * @version 3.1 * @since   LingPipe2.0 */public class LMClassifier<L extends LanguageModel, M extends MultivariateDistribution>    implements Classifier<CharSequence,JointClassification> {    final L[] mLanguageModels;    final M mCategoryDistribution;    final HashMap<String,L> mCategoryToModel;    final String[] mCategories;    /**     * Construct a joint classifier for character sequences     * classifying over a specified set of categories, with a     * multivariate distribution over those categories and a language     * model per category.  The language models are supplied in a     * parallel array to the categories.     *     * <P>The category distribution is the marginal over categories,     * and each language model provides a conditional estimate given     * its category.  Categorization is as described in the class     * documentation.     *     * @param categories Array of categories for classification.     * @param languageModels A parallel array of language models for     * the categories.     * @param categoryDistribution The marginal distribution over the     * categories for classification.     * @throws IllegalArgumentException If there are not at least two     * categories, or if the category and language model arrays are not     * the same lenght.     */    public LMClassifier(String[] categories,                        L[] languageModels,                        M categoryDistribution) {        if (categories.length < 2) {            String msg = "Require at least two categories."                + " Found categories.length=" + categories.length;            throw new IllegalArgumentException(msg);        }        if (categories.length != categoryDistribution.numDimensions()) {            String msg = "Require same number of categories as dimensions."                + " Found categories.length="                + categories.length                + " Found categoryDistribution.numDimensions()="                + categoryDistribution.numDimensions();            throw new IllegalArgumentException(msg);        }        mCategories = categories;        if (categories.length != languageModels.length) {            String msg = "Categories and language models must be same length."                + " Found categories length=" + categories.length                + " Found language models length=" + languageModels.length;            throw new IllegalArgumentException(msg);        }        mLanguageModels = languageModels;        mCategoryDistribution = categoryDistribution;        mCategoryToModel = new HashMap<String,L>();        for (int i = 0; i < categories.length; ++i)            mCategoryToModel.put(categories[i],languageModels[i]);    }    /**     * Returns the array of categories for this classifier.     *     * <P>This method copies the array and thus changes to     * it do not affect the categories for this classifier.     *     * @return The array of categories for this classifier.     */    public String[] categories() {        return (String[]) mCategories.clone();    }    /**     * Returns the language model for the specified category.  The     * model for a specified category is used to provide estimates of     * <code>P(cSeq|category)</code>, the conditional probability of a     * character sequence given the specified category as described in     * the class documentation above.     *     * <P>Changes to the returned model affect this classifier's     * behavior.     *     * @param category The specified category.     * @return The language model for the specified category.     * @throws IllegalArgumentException If the category is not known.     */    public L languageModel(String category) {        for (int i = 0; i < mCategories.length; ++i)            if (category.equals(mCategories[i]))                return mLanguageModels[i];        String msg = "Category not known.  Category=" + category;        throw new IllegalArgumentException(msg);    }    /**     * Returns a multivariate distribution over categories for this     * classifier.  This is method returns <code>P(category)</code>,     * the marginal distribution over categories used during     * classification as described in the class documentation.     *     * <P>Changes to the returned distribution affect this classifier's     * behavior.     *     * @return The distribution over categories.     */    public M categoryDistribution() {        return mCategoryDistribution;    }    /**     * Returns the joint classification of the specified character sequence.     *     * @param cSeq Character sequence being classified.     * @return Joint classification of the specified character     * sequence.     * @throws IllegalArgumentException If the specified object is not     * a character sequence.     */    public JointClassification classify(CharSequence cSeq) {        if (!(cSeq instanceof CharSequence)) {            String msg = "LM Classification requires CharSequence input."                + " Found class=" + (cSeq == null ? null : cSeq.getClass());            throw new IllegalArgumentException(msg);        }        return classifyJoint(Strings.toCharArray(cSeq),0,cSeq.length());    }    /**     * A convenience method returning a joint classification over a     * character array slice.  Note that     * <code>estimateJoint(cs,start,end)</code> returns the same result     * as <code>esimtateJoint(new String(cs,start,end-start))</code>.     *     * @param cs Underlying character array.     * @param start Index of first character in slice.     * @param end One plus the index of the last character in the slice.     * @throws IllegalArgumentException If the start index is less than zero     * or greater than the end index or if the end index is not within bounds.     */    public JointClassification classifyJoint(char[] cs, int start, int end) {        Strings.checkArgsStartEnd(cs,start,end);        ScoredObject[] estimates = new ScoredObject[categories().length];        for (int i = 0; i < categories().length; ++i) {            String category = categories()[i];            LanguageModel model = mLanguageModels[i];            double charsGivenCatLogProb                = model.log2Estimate(new String(cs,start,end-start));            double catLogProb                = mCategoryDistribution.log2Probability(category);            double charsCatJointLogProb = charsGivenCatLogProb + catLogProb;            estimates[i]                = new ScoredObject(category,                                   charsCatJointLogProb);        }        return toJointClassification(estimates,                                     end-start+2); // divide by length + 1    }    static JointClassification toJointClassification(ScoredObject[] estimates,                                                     double length) {        Arrays.sort(estimates,Scored.REVERSE_SCORE_COMPARATOR);        String[] categories = new String[estimates.length];        double[] jointEstimates = new double[estimates.length];        double[] scores = new double[estimates.length];        for (int i = 0; i < estimates.length; ++i) {            categories[i] = estimates[i].getObject().toString();            jointEstimates[i] = estimates[i].score();            scores[i] = jointEstimates[i] / length;        }        return new JointClassification(categories,scores,jointEstimates);    }}
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -