📄 tfidfclassifiertrainer.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
📖 第 1 页 / 共 2 页
字号:
12 下一页
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.classify;import com.aliasi.corpus.ClassificationHandler;import com.aliasi.util.AbstractExternalizable;import com.aliasi.util.Compilable;import com.aliasi.util.FeatureExtractor;import com.aliasi.util.ObjectToDoubleMap;import com.aliasi.util.ScoredObject;import com.aliasi.symbol.MapSymbolTable;import java.io.IOException;import java.io.ObjectInput;import java.io.ObjectOutput;import java.io.Serializable;import java.util.HashMap;import java.util.Map;import java.util.Set;/** * A <code>TfIdfClassifierTrainer</code> provides a framework for * training discriminative classifiers based on term-frequency (TF) * and inverse document frequency (IDF) weighting of features. * * <h3>Construction</h3> * * <p>A <code>TfIdfClassifierTrainer</code> is constructed from a * feature extractor of a specified type.  If the instance is to * be compiled, the feature extractor must be either serializable * or compilable., producing an instance * that may be trained through * * <h3>Training</h3> * * <p>Categories may be added dynamically.  The initial classifier * will be empty and not defined for any categories. * * <p>A TF/IDF classifier trainer is trained through the {@link * ClassificationHandler}.  Specifically, the method * <code>handle(E,Classification)</code> is called, the generic * object being the training instance and the classification * being a simple first-best classification. * * <p>For multiple training examples of the same category, * their feature vectors are added together to produce * the raw category vectors. * * <h3>Classification</h3> * * <p>The compiled models perform scored classification.  That is, * they implement the method <code>classify(E)</code> to return a * <code>ScoredClassification</code>.  The scores assigned to the * different categories are normalized dot products after term * frequency and inverse document frequency weighting. * * <p>Suppose training supplied <code>n</code> training * categories <code>cat[0], ..., cat[n-1]</code>, with * associated raw feature vectors <code>v[0], ..., v[n-1]</code>. * The dimensions of these vectors are the features, so that * if <code>f</code> is a feature, <code>v[i][f]</code> is * the raw score for the feature <code>f</code> in * category <code>cat[i]</code>. * <p>First, the inverse document frequency weighting of * each term is defined: * * <pre> *     idf(f) = ln (df(f) / n)</pre> * * where <code>df(f)</code> is the document frequency of * feature <code>f</code>, defined to be the number of * distinct categories in which feature <code>f</code> is * defined.  This has the effect of upweighting the scores of * features that occur in few categories and downweighting * the scores of features that occur in many categories * * <p>Term frequency normalization dampens the term * frequencies using square roots: * * <pre> *     tf(x) = sqrt(x)</pre> * This produces a linear relation in pairwise growth rather than the * usual quadratic one derived from a simple cross-product. * * <p>The weighted feature vectors are as follows: * * <pre> *     v'[i][f] = tf(v[i][f]) * idf(f)</pre> * * <p>Given an instance to classify, first the feature * extractor is used to produce a raw feature vector * <code>x</code>.  This is then normalized in the same * way as the document vectors <code>v[i]</code>, namely: * * <pre> *     x'[f] = tf(x[f]) * idf(f)</pre> * * The resulting query vector <code>x'</code> is then compared * against each normalized document vector <code>v'[i]</code> * using vector cosine, which defines its classification score: * * <pre> *     score(v'[i],x') *     = cos(v'[i],x') *     = v'[i] * x' / ( length(v'[i]) * length(x') )</pre> * * where <code>v'[i] * x'</code> is the vector dot product: * * <pre> *     <big><big>&Sigma;</big></big><sub><sub>f</sub></sub> v'[i][f] * x'[f]</pre> * * and where the length of a vector is defined to be * the square root of its dot product with itself: * * <pre> *     length(y) = sqrt(y * y)</pre> * * <p>Cosine scores will vary between <code>-1</code> and * <code>1</code>.  The cosine is only <code>1</code> between two * vectors if they point in the same direction; that is, one is a * positive scalar product of the other.  The cosine is only * <code>-1</code> between two vectors if they point in opposite * direction; that is, one is a negative scalar product of the other. * The cosine is <code>0</code> for two vectors that are orthogonal, * that is, at right angles to each other.  If all the values * in all of the category vectors and the query vector are * positive, cosine will run between <code>0</code> and <code>1</code>. * * <p><i>Warning:</i> Because of floating-point arithmetic rounding, * these results about signs and bounds are not strictly guaranteed to * hold; instances may return cosines slightly below <code>-1</code> * or above <code>1</code>, or not return exactly <code>0</code> for * orthogonal vectors. * * <h3>Serialization</h3> * * <p>A TF/IDF classifier trainer may be serialized at any point. * The object read back in will be an instance of the same * class with the same parametric type for the objects being * classified.  During serialization, the feature extractor * will be serialized if it's serializable, or compiled if * it's compilable but not serializable.  If the feature extractor * is neither serializable nor compilable, serialization will * throw an error. * * <h3>Compilation</h3> * * <p>At any point, a TF/IDF classifier may be compiled to an object * output stream.  The object read back in will be an instance of * <code>Classifier&lt;E,ScoredClassification&gt;</code>.  During * compilation, the feature extractor will be compiled if it's * compilable, or serialized if it's serializable but not compilable. * If the feature extractor is neither compilable nor serializable, * compilation will throw an error. * * <h3>Reverse Indexing</h3> * * <p>The TF/IDF classifier indexes instances by means of * their feature values. * * @author  Bob Carpenter * @version 3.1.2 * @since   LingPipe3.1 */public class TfIdfClassifierTrainer<E>    implements ClassificationHandler<E,Classification>,               Compilable, Serializable {    final FeatureExtractor mFeatureExtractor;    final Map<Integer,ObjectToDoubleMap<Integer>> mFeatureToCategoryCount;    final MapSymbolTable mFeatureSymbolTable;    final MapSymbolTable mCategorySymbolTable;    /**     * Construct a TF/IDF classifier trainer based on the specified     * feature extractor.  This feature extractor must be either     * serializable or compilable if the resulting trainer is to be     * compilable.     *     * @param featureExtractor Feature extractor for examples.     */    public TfIdfClassifierTrainer(FeatureExtractor<E> featureExtractor) {        this(featureExtractor,             new HashMap<Integer,ObjectToDoubleMap<Integer>>(),             new MapSymbolTable(),             new MapSymbolTable());    }    TfIdfClassifierTrainer(FeatureExtractor<E> featureExtractor,                           Map<Integer,ObjectToDoubleMap<Integer>> featureToCategoryCount,                           MapSymbolTable featureSymbolTable,                           MapSymbolTable categorySymbolTable) {        mFeatureExtractor = featureExtractor;        mFeatureToCategoryCount = featureToCategoryCount;        mFeatureSymbolTable = featureSymbolTable;        mCategorySymbolTable = categorySymbolTable;    }    /**     * Return the set of categories for which at least one training     * instance has been seen.  The resulting set is immutable.     *     * @return The set of categories for this trainer.     */    public Set<String> categories() {        return mCategorySymbolTable.symbolSet();    }    /**     * Train the classifier on the specified object with the specified     * classification.     *     * @param input Classified object.     * @param classification Classification of the the object.     */    public void handle(E input, Classification classification) {        String category = classification.bestCategory();        int categoryId = mCategorySymbolTable.getOrAddSymbol(category);        Map<String,? extends Number> featureVector            = mFeatureExtractor.features(input);        for (Map.Entry<String,? extends Number> entry                 : featureVector.entrySet()) {            String feature = entry.getKey();            double value = entry.getValue().doubleValue();            int featureId = mFeatureSymbolTable.getOrAddSymbol(feature);            ObjectToDoubleMap<Integer> categoryCounts                = mFeatureToCategoryCount.get(featureId);            if (categoryCounts == null) {                categoryCounts = new ObjectToDoubleMap<Integer>();                mFeatureToCategoryCount.put(featureId,categoryCounts);            }            categoryCounts.increment(categoryId,value);        }    }    /**     * Compile this trainer to the specified object output.     *     * @param out Stream to which a compiled classifier is written.     * @throws UnsupportedOperationException If the underlying feature     * extractor is neither compilable nor serializable.     */    public void compileTo(ObjectOutput out) throws IOException {        out.writeObject(new Externalizer<E>(this));    }    // called via reflection during serialization
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -