📄 compiledspellchecker.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.spell;import com.aliasi.lm.CompiledNGramProcessLM;import com.aliasi.tokenizer.Tokenizer;import com.aliasi.tokenizer.TokenizerFactory;import com.aliasi.util.AbstractExternalizable;import com.aliasi.util.BoundedPriorityQueue;import com.aliasi.util.Compilable;import com.aliasi.util.Iterators;import com.aliasi.util.ObjectToSet;import com.aliasi.util.Scored;import com.aliasi.util.ScoredObject;import com.aliasi.util.SmallSet;import com.aliasi.util.Strings;import java.io.IOException;import java.io.ObjectInput;import java.io.ObjectOutput;import java.util.Arrays;import java.util.Collections;import java.util.HashMap;import java.util.Iterator;import java.util.Map;import java.util.Set;/** * The <code>CompiledSpellChecker</code> class implements a first-best * spell checker based on models of what users are likely to mean and * what errors they are likely to make in expressing their meaning. * This class is based on a character language model which represents * likely user intentions, and a weighted edit distance, which * represents how noise is introduced into the signal via typos, * brainos, or other sources such as case-normalization, diacritic * removal, bad character encodings, etc. * <P>The usual way of creating a compiled checker is through an * instance of {@link TrainSpellChecker}.  The result of compiling the * spell checker training class and reading it back in is a compiled * spell checker.  Only the basic models, weighted edit distance, and * token set are supplied through compilation; all * other parameters described below need to be set after an instance * is read in from its compiled form.  The token set may be null * at construction time and may be set later. * * <P>This class adopts the noisy-channel model approach to decoding * likely user intentions given received signals.  Spelling correction * simply returns the most likely intended message given the message * actually received.  In symbols: * * <blockquote><code> * didYouMean(received) * <br>= ArgMax<sub><sub>intended</sub></sub> *       P(intended | received) * <br>= ArgMax<sub><sub>intended</sub></sub> *       P(intended,received) / P(received) * <br>= ArgMax<sub><sub>intended</sub></sub> *       P(intended,received) * <br>= ArgMax<sub><sub>intended</sub></sub> *       P(intended) * P(received|intended) * </code></blockquote> * * The estimator <code>P(intended)</code>, called the <i>source * model</i>, estimates which signals are likely to be sent along the * channel.  For instance, the source might be a model of user's * intent in entering information on a web page.  The estimator * <code>P(received|intended)</code>, called the <i>channel model</i>, * estimates how intended messages are likely to be garbled. * * <P>For this class, the source language model must be a compiled * n-gram character language model.  Compiled models are required for * the efficiency of their suffix-tree encodings in evaluating * sequences of characters.  Optimizing held-out sample cross-entropy * is not necessarily the best approach to building these language * models, because they are being used here in a discriminitive * fashion, much as in language-model-based classification, tagging * or chunking. * * <P>For this class, the channel model must be a weighted edit * distance.  For traditional spelling correction, this is a model of * typos and brainos.  There are two static constant weighted edit * distances supplied in this class which are useful for other * decoding tasks.  The {@link #CASE_RESTORING} distance may be used * to restore case in single-case text.  The {@link #TOKENIZING} model * may be used to tokenize untokenized text, and is used in our <a * href="http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html">Chinese * tokenization demo</a>. * * <P>All input is normalized for whitespace, which consists of * removing initial and final whitespaces and reducing all other space * sequences to a single space character.  A single space character is * used as the initial context for the source language model.  A * single final uneditable space character is estimated at the end of * the language model, thus adapting the process language model to be * used as a bounded sequence language model just as in the language * model package itself. * * <P>This class optionally restricts corrections to sequences of * valid tokens.  The valid tokens are supplied as a set either during * construction time or later.  If the set of valid tokens is * <code>null</code>, then the output is not token sensitive, and * results may include tokens that are not in the training data. * Token-matching is case sensitive. * * <P>If a set of valid tokens is supplied, then a tokenizer factory * should also be supplied to carry out tokenization normalization on * input.  This tokenizer factory will be used to separate input * tokens with single spaces.  This tokenization may also be done * externally and normalized text passed into the * <code>didYouMean</code> method; this approach makes sense if the * tokenization is happening elsewhere already. * * <P>There are a number of tuning parameters for this class.  The * coarsest form of tuning simply sets whether or not particular edits * may be performed.  For instance, {@link #setAllowDelete(boolean)} * is used to turn deletion on or off.  Although edits with negative * infinity scores will never be used, it is more efficient to simply * disallow them if they are all infinite.  This is used in the * Chinese tokenizer, for instance, to only allow insertions and * matches. * * <P>There are three scoring parameters that determine how expensive * input characters are to edit.  The first of these is {@link * #setKnownTokenEditCost(double)}, which provides a penalty to be * added to the cost of editing characters that fall within known * tokens.  This value is only used for token-sensitive correctors. * Setting this to a low value makes it less likely to suggest an edit * on a known token.  The default value is -2.0, which on a log (base * 2) scale makes editing characters in known tokens 1/4 as likely * as editing characters in unknown tokens. * * <P>The next two scoring parameters provide penalties for editing * the first or second character in a token, whether it is known or * not.  In most cases, users make more mistakes later in words than * in the first few characters.  These values are controlled * independently through values provided at construction time or by * using the methods {@link #setFirstCharEditCost(double)} and {@link * #setSecondCharEditCost(double)}.  The default values for these are * -2.0 and -1.0 respectively. * * <P>The final tuning parameter is controlled with {@link * #setNumConsecutiveInsertionsAllowed(int)}, which determines how * many characters may be inserted in a row.  The default value is 1, * and setting this to 2 or higher may seriously slow down correction, * especially if it not token sensitive. * * <P>Search is further controlled by an n-best parameter, which * specifies the number of ongoing hypotheses considered after * inspecting each character.  This value is settable either in the * constructor or for models compiled from a trainer, by using the * method {@link #setNBest(int)}.  This lower this value, the faster * the resulting spelling correction.  But the danger is that with * low values, there may be search errors, where the correct * hypothesis is pruned because it did not look promising enough * early on.  In general, this value should be set as low as * possible without causing search errors. * * <P>This class requires external concurrent-read/synchronous-write * (CRSW) synchronization. All of the methods begining with * <code>set</code> must be executed exclusively in order to guarantee * consistent results; all other methods may be executed concurrently. * The {@link #didYouMean(String)} method for spelling correction may * be called concurrently with the same blocking and thread safety * constraints as the underlying language model and edit distance, * both of which are called repeatedly by this method.  If both the * language model and edit distance are thread safe and non-blocking, * as in all of LingPipe's implementations, then * <code>didYouMean</code> will also be concurrently executable and * non-blocking. * * <h4>Blocking Corrections</h4> * * <p>There are two ways to block tokens from being edited.  The first * is by setting a minimum length of edited tokens.  Standard language * models trained on texts tend to overestimate the likelihood of * queries that contain well-known short words or phrases like * <code>of</code> or <code>a</code>.  The method {@link * #setMinimumTokenLengthToCorrect(int)} sets a minimum token length * that is corrected.  The default value is <code>0</code>. * * <p>The second way to block corrections is to provide a set of * tokens that are never corrected.  One way to construct such a set * during training is by taking large-count tokens from the counter * returned by {@link TrainSpellChecker#tokenCounter()}. * <p>Note that these methods are heuristics that move the spelling * corrector in the same direction as two existing parameters.  First, * there is the pair of methods {@link #setFirstCharEditCost(double)} * and {@link #setSecondCharEditCost(double)} which make it less * likely to edit the first two characters (which are all of the * characters in a two-character token).  Second, there is a flexible * penalty for editing known tokens that may be set with {@link * #setKnownTokenEditCost(double)}. * * <p>Blocking corrections has a positive effect on speed, because * it eliminates any search over the tokens that are excluded from * correction. * * <h4>N-best Output</h4> * * It is possible to retrieve a list of possible spelling corrections, * ordered by plausibility.  The method {@link * #didYouMeanNBest(String)} returns an iterator over corrections in * decreasing order of likelihood.  Note that the same exact string * may be proposed more than once as a correction because of * alternative edits leading to the same result.  For instance, * &quot;a&quot; may be turned into &quot;b&quot; by substitution in one * step, or by deletion and insertion (or insertion then deletion) * in two steps.  These alternatives typically have different scores * and only the highest-scoring one is maintained at any given stage * of the algorithm by the first-best analyzer. * * <p>The n-best analyzer needs a much wider n-best list in order * to return sensible results, especially for very long inputs.  The * specified n-best size for the spell checker should, in fact, be * substantially larger than the desired number of n-best results. * * @author  Bob Carpenter * @version 3.6 * @since   LingPipe2.0 */public class CompiledSpellChecker implements SpellChecker {    CompiledNGramProcessLM mLM;    WeightedEditDistance mEditDistance;    Set<String> mTokenSet;    Set<String> mDoNotEditTokens = SmallSet.<String>create();    int mNBestSize;    boolean mAllowInsert = true;    boolean mAllowDelete = true;    boolean mAllowMatch = true;    boolean mAllowSubstitute = true;    boolean mAllowTranspose = true;    int mMinimumTokenLengthToCorrect = 0;    int mNumConsecutiveInsertionsAllowed = 1;    double mKnownTokenEditCost;    double mFirstCharEditCost;    double mSecondCharEditCost;    TokenizerFactory mTokenizerFactory;    TokenTrieNode mTokenPrefixTrie;    /**     * Construct a compiled spell checker based on the specified     * language model and similarity edit distance, set of valid     * output tokens, maximum n-best size per character, and the     * specified edit penalities for editing known tokens or the first     * or second characters of tokens.   The set of do-not-edit tokens     * is initiall empty; set it using {@link #setDoNotEditTokens(Set)}.     *     * <p>The weighted edit distance is required to be a similarity     * measure for compatibility with the order of log likelihoods in     * the source (language) model.  See {@link WeightedEditDistance}     * for more information about similarity versus dissimilarity     * distance measures.     *     * <P>If the set of tokens is <code>null</code>, the constructed     * spelling checker will not be token-sensitive.  That is, it     * will allow edits to strings which are not tokens in the token set.     *     * @param lm Source language model.     * @param editDistance Channel edit distance model.     * @param factory Tokenizer factory for tokenizing inputs.     * @param tokenSet Set of valid tokens for outputs or     * <code>null</code> if output is not token sensitive.     * @param nBestSize Size of n-best list for spell checking.     * hypothesis is pruned.     * @param knownTokenEditCost Penalty for editing known tokens per edit.     * @param firstCharEditCost Penalty for editing while scanning the     * first character in a token.     * @param secondCharEditCost Penalty for editing while scanning     * the second character in a token.     */    public CompiledSpellChecker(CompiledNGramProcessLM lm,                                WeightedEditDistance editDistance,                                TokenizerFactory factory,                                Set<String> tokenSet,                                int nBestSize,                                double knownTokenEditCost,                                double firstCharEditCost,                                double secondCharEditCost) {        mLM = lm;        mEditDistance = editDistance;        mTokenizerFactory = factory;        mNBestSize = nBestSize;        setTokenSet(tokenSet);        mKnownTokenEditCost = knownTokenEditCost;        mFirstCharEditCost = firstCharEditCost;        mSecondCharEditCost = secondCharEditCost;    }    /**     * Construct a compiled spell checker based on the specified
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -