📄 compiledspellchecker.java
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.spell;import com.aliasi.lm.CompiledNGramProcessLM;import com.aliasi.tokenizer.Tokenizer;import com.aliasi.tokenizer.TokenizerFactory;import com.aliasi.util.AbstractExternalizable;import com.aliasi.util.BoundedPriorityQueue;import com.aliasi.util.Compilable;import com.aliasi.util.Iterators;import com.aliasi.util.ObjectToSet;import com.aliasi.util.Scored;import com.aliasi.util.ScoredObject;import com.aliasi.util.SmallSet;import com.aliasi.util.Strings;import java.io.IOException;import java.io.ObjectInput;import java.io.ObjectOutput;import java.util.Arrays;import java.util.Collections;import java.util.HashMap;import java.util.Iterator;import java.util.Map;import java.util.Set;/** * The <code>CompiledSpellChecker</code> class implements a first-best * spell checker based on models of what users are likely to mean and * what errors they are likely to make in expressing their meaning. * This class is based on a character language model which represents * likely user intentions, and a weighted edit distance, which * represents how noise is introduced into the signal via typos, * brainos, or other sources such as case-normalization, diacritic * removal, bad character encodings, etc. * <P>The usual way of creating a compiled checker is through an * instance of {@link TrainSpellChecker}. The result of compiling the * spell checker training class and reading it back in is a compiled * spell checker. Only the basic models, weighted edit distance, and * token set are supplied through compilation; all * other parameters described below need to be set after an instance * is read in from its compiled form. The token set may be null * at construction time and may be set later. * * <P>This class adopts the noisy-channel model approach to decoding * likely user intentions given received signals. Spelling correction * simply returns the most likely intended message given the message * actually received. In symbols: * * <blockquote><code> * didYouMean(received) * <br>= ArgMax<sub><sub>intended</sub></sub> * P(intended | received) * <br>= ArgMax<sub><sub>intended</sub></sub> * P(intended,received) / P(received) * <br>= ArgMax<sub><sub>intended</sub></sub> * P(intended,received) * <br>= ArgMax<sub><sub>intended</sub></sub> * P(intended) * P(received|intended) * </code></blockquote> * * The estimator <code>P(intended)</code>, called the <i>source * model</i>, estimates which signals are likely to be sent along the * channel. For instance, the source might be a model of user's * intent in entering information on a web page. The estimator * <code>P(received|intended)</code>, called the <i>channel model</i>, * estimates how intended messages are likely to be garbled. * * <P>For this class, the source language model must be a compiled * n-gram character language model. Compiled models are required for * the efficiency of their suffix-tree encodings in evaluating * sequences of characters. Optimizing held-out sample cross-entropy * is not necessarily the best approach to building these language * models, because they are being used here in a discriminitive * fashion, much as in language-model-based classification, tagging * or chunking. * * <P>For this class, the channel model must be a weighted edit * distance. For traditional spelling correction, this is a model of * typos and brainos. There are two static constant weighted edit * distances supplied in this class which are useful for other * decoding tasks. The {@link #CASE_RESTORING} distance may be used * to restore case in single-case text. The {@link #TOKENIZING} model * may be used to tokenize untokenized text, and is used in our <a * href="http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html">Chinese * tokenization demo</a>. * * <P>All input is normalized for whitespace, which consists of * removing initial and final whitespaces and reducing all other space * sequences to a single space character. A single space character is * used as the initial context for the source language model. A * single final uneditable space character is estimated at the end of * the language model, thus adapting the process language model to be * used as a bounded sequence language model just as in the language * model package itself. * * <P>This class optionally restricts corrections to sequences of * valid tokens. The valid tokens are supplied as a set either during * construction time or later. If the set of valid tokens is * <code>null</code>, then the output is not token sensitive, and * results may include tokens that are not in the training data. * Token-matching is case sensitive. * * <P>If a set of valid tokens is supplied, then a tokenizer factory * should also be supplied to carry out tokenization normalization on * input. This tokenizer factory will be used to separate input * tokens with single spaces. This tokenization may also be done * externally and normalized text passed into the * <code>didYouMean</code> method; this approach makes sense if the * tokenization is happening elsewhere already. * * <P>There are a number of tuning parameters for this class. The * coarsest form of tuning simply sets whether or not particular edits * may be performed. For instance, {@link #setAllowDelete(boolean)} * is used to turn deletion on or off. Although edits with negative * infinity scores will never be used, it is more efficient to simply * disallow them if they are all infinite. This is used in the * Chinese tokenizer, for instance, to only allow insertions and * matches. * * <P>There are three scoring parameters that determine how expensive * input characters are to edit. The first of these is {@link * #setKnownTokenEditCost(double)}, which provides a penalty to be * added to the cost of editing characters that fall within known * tokens. This value is only used for token-sensitive correctors. * Setting this to a low value makes it less likely to suggest an edit * on a known token. The default value is -2.0, which on a log (base * 2) scale makes editing characters in known tokens 1/4 as likely * as editing characters in unknown tokens. * * <P>The next two scoring parameters provide penalties for editing * the first or second character in a token, whether it is known or * not. In most cases, users make more mistakes later in words than * in the first few characters. These values are controlled * independently through values provided at construction time or by * using the methods {@link #setFirstCharEditCost(double)} and {@link * #setSecondCharEditCost(double)}. The default values for these are * -2.0 and -1.0 respectively. * * <P>The final tuning parameter is controlled with {@link * #setNumConsecutiveInsertionsAllowed(int)}, which determines how * many characters may be inserted in a row. The default value is 1, * and setting this to 2 or higher may seriously slow down correction, * especially if it not token sensitive. * * <P>Search is further controlled by an n-best parameter, which * specifies the number of ongoing hypotheses considered after * inspecting each character. This value is settable either in the * constructor or for models compiled from a trainer, by using the * method {@link #setNBest(int)}. This lower this value, the faster * the resulting spelling correction. But the danger is that with * low values, there may be search errors, where the correct * hypothesis is pruned because it did not look promising enough * early on. In general, this value should be set as low as * possible without causing search errors. * * <P>This class requires external concurrent-read/synchronous-write * (CRSW) synchronization. All of the methods begining with * <code>set</code> must be executed exclusively in order to guarantee * consistent results; all other methods may be executed concurrently. * The {@link #didYouMean(String)} method for spelling correction may * be called concurrently with the same blocking and thread safety * constraints as the underlying language model and edit distance, * both of which are called repeatedly by this method. If both the * language model and edit distance are thread safe and non-blocking, * as in all of LingPipe's implementations, then * <code>didYouMean</code> will also be concurrently executable and * non-blocking. * * <h4>Blocking Corrections</h4> * * <p>There are two ways to block tokens from being edited. The first * is by setting a minimum length of edited tokens. Standard language * models trained on texts tend to overestimate the likelihood of * queries that contain well-known short words or phrases like * <code>of</code> or <code>a</code>. The method {@link * #setMinimumTokenLengthToCorrect(int)} sets a minimum token length * that is corrected. The default value is <code>0</code>. * * <p>The second way to block corrections is to provide a set of * tokens that are never corrected. One way to construct such a set * during training is by taking large-count tokens from the counter * returned by {@link TrainSpellChecker#tokenCounter()}. * <p>Note that these methods are heuristics that move the spelling * corrector in the same direction as two existing parameters. First, * there is the pair of methods {@link #setFirstCharEditCost(double)} * and {@link #setSecondCharEditCost(double)} which make it less * likely to edit the first two characters (which are all of the * characters in a two-character token). Second, there is a flexible * penalty for editing known tokens that may be set with {@link * #setKnownTokenEditCost(double)}. * * <p>Blocking corrections has a positive effect on speed, because * it eliminates any search over the tokens that are excluded from * correction. * * <h4>N-best Output</h4> * * It is possible to retrieve a list of possible spelling corrections, * ordered by plausibility. The method {@link * #didYouMeanNBest(String)} returns an iterator over corrections in * decreasing order of likelihood. Note that the same exact string * may be proposed more than once as a correction because of * alternative edits leading to the same result. For instance, * "a" may be turned into "b" by substitution in one * step, or by deletion and insertion (or insertion then deletion) * in two steps. These alternatives typically have different scores * and only the highest-scoring one is maintained at any given stage * of the algorithm by the first-best analyzer. * * <p>The n-best analyzer needs a much wider n-best list in order * to return sensible results, especially for very long inputs. The * specified n-best size for the spell checker should, in fact, be * substantially larger than the desired number of n-best results. * * @author Bob Carpenter * @version 3.6 * @since LingPipe2.0 */public class CompiledSpellChecker implements SpellChecker { CompiledNGramProcessLM mLM; WeightedEditDistance mEditDistance; Set<String> mTokenSet; Set<String> mDoNotEditTokens = SmallSet.<String>create(); int mNBestSize; boolean mAllowInsert = true; boolean mAllowDelete = true; boolean mAllowMatch = true; boolean mAllowSubstitute = true; boolean mAllowTranspose = true; int mMinimumTokenLengthToCorrect = 0; int mNumConsecutiveInsertionsAllowed = 1; double mKnownTokenEditCost; double mFirstCharEditCost; double mSecondCharEditCost; TokenizerFactory mTokenizerFactory; TokenTrieNode mTokenPrefixTrie; /** * Construct a compiled spell checker based on the specified * language model and similarity edit distance, set of valid * output tokens, maximum n-best size per character, and the * specified edit penalities for editing known tokens or the first * or second characters of tokens. The set of do-not-edit tokens * is initiall empty; set it using {@link #setDoNotEditTokens(Set)}. * * <p>The weighted edit distance is required to be a similarity * measure for compatibility with the order of log likelihoods in * the source (language) model. See {@link WeightedEditDistance} * for more information about similarity versus dissimilarity * distance measures. * * <P>If the set of tokens is <code>null</code>, the constructed * spelling checker will not be token-sensitive. That is, it * will allow edits to strings which are not tokens in the token set. * * @param lm Source language model. * @param editDistance Channel edit distance model. * @param factory Tokenizer factory for tokenizing inputs. * @param tokenSet Set of valid tokens for outputs or * <code>null</code> if output is not token sensitive. * @param nBestSize Size of n-best list for spell checking. * hypothesis is pruned. * @param knownTokenEditCost Penalty for editing known tokens per edit. * @param firstCharEditCost Penalty for editing while scanning the * first character in a token. * @param secondCharEditCost Penalty for editing while scanning * the second character in a token. */ public CompiledSpellChecker(CompiledNGramProcessLM lm, WeightedEditDistance editDistance, TokenizerFactory factory, Set<String> tokenSet, int nBestSize, double knownTokenEditCost, double firstCharEditCost, double secondCharEditCost) { mLM = lm; mEditDistance = editDistance; mTokenizerFactory = factory; mNBestSize = nBestSize; setTokenSet(tokenSet); mKnownTokenEditCost = knownTokenEditCost; mFirstCharEditCost = firstCharEditCost; mSecondCharEditCost = secondCharEditCost; } /** * Construct a compiled spell checker based on the specified
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -