📄 charlmhmmchunker.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
📖 第 1 页 / 共 2 页
字号:
12 下一页
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.chunk;import com.aliasi.corpus.TagHandler;import com.aliasi.corpus.ChunkHandler;import com.aliasi.corpus.ChunkHandlerAdapter;import com.aliasi.hmm.AbstractHmmEstimator;import com.aliasi.hmm.HiddenMarkovModel;import com.aliasi.hmm.HmmDecoder;import com.aliasi.symbol.SymbolTable;import com.aliasi.tokenizer.Tokenizer;import com.aliasi.tokenizer.TokenizerFactory;import com.aliasi.util.AbstractExternalizable;import com.aliasi.util.Compilable;import com.aliasi.util.Strings;import java.io.ObjectInput;import java.io.ObjectOutput;import java.io.IOException;import java.util.HashSet;import java.util.Iterator;/** * A <code>CharLmHmmChunker</code> employs a hidden Markov model * estimator and tokenizer factory to learn a chunker.  This * estimator used is an instance of {@link AbstractHmmEstimator} * for underlying HMM estimation.  It uses a tokenizer factory to * break the chunks down into sequences of tokens and tags. * * <h4>Training</h4> * * <p>This class implements the {@link ChunkHandler} and {@link * TagHandler} interfaces, either of which may be used to supply * training instances.  Every training event is used to train the * underlying HMM.  Training instances are supplied through the chunk * handler in the usual way. * * <p>Training instances for the tag handler * require the standard BIO tagging scheme in which the first token in * a chunk of type <code><i>X</i></code> is tagged * <code>B-<i>X</i></code> (&quot;begin&quot;), with all subsequent * tokens in the same chunk tagged <code>I-<i>X</i></code> * (&quot;in&quot;).  All tokens not in chunks are tagged * <code>O</code>.  For example, the tags required for training are: * * <blockquote><pre> * Yestereday      O * afternoon       O * ,               O * John            B-PER * J               I-PER * .               I-PER * Smith           I-PER * traveled        O * to              O * Washington      O * .               O</pre></blockquote> * * This is the same tagging scheme supplied in several corpora (Penn * BioIE, ConNLL, etc.)  Note that this is <i>not</i> the same tag * scheme used for the underlying HMM.  This simpler tag scheme shown * above is first converted to the more fine-grained tag scheme * described in the class documentation for {@link HmmChunker}. * * <h4>Training with a Dictionary</h4> * * This chunker may be trained with dictionary entries through the * method {@link #trainDictionary(CharSequence cSeq, String type)}. * Calling this method trains the emission probabilities for * the relevant tags determined by tokenizing the specifid character * sequence (after conversion to the underlying tag scheme defined * in {@link HmmChunker}). * * <p><b>Warning:</b>It is not enough to just train with a dictionary. * Dictionaries do not train the contexts in which elements show up. * Ordinary training data must also be supplied, and this data must have * some elements which are not part of chunks in order to train the * out tags.  If only a dictionary is used to train, null pointer exceptions * will show up at run time. * * <p>For example, calling * * <blockquote><pre> * charLmHmmChunker.trainDictionary(&quot;Washington&quot;, &quot;LOCATION&quot;);</pre></blockquote> * * would provide the token &quot;Washington&quot; as a training case * for emission from the tag <code>W_LOCATION</code>--the 'W_' * annotation is emitted because the trainDictionary uses the richer tag * set of {@link HmmChunker}.  Alterantively, calling: * * <blockquote><pre> * charLmHmmChunker.trainDictionary(&quot;John J. Smith&quot;, &quot;PERSON&quot;);</pre></blockquote> * * would train the tag <code>B_PERSON</code> to be trained * with the sequence &quot;John&quot;, the tag <code>I_PERSON</code> * to be trained with the tokens &quot;J&quot; and &quot;.&quot;, * and the tag <code>E_PERSON</code> to be trained with the * token &quot;Smith&quot;.  Furthermore, in this case, the transition * probabilities receive training instances for the three * transitions: <code>B_PERSON</code> to <code>M_PERSON</code>, * <code>M_PERSON</code> to <code>M_PERSON</code>, and finally, * <code>M_PERSON</code> to <code>E_PERSON</code>. * * <p>Note that there is no method to train non-chunk tokens, because * the categories assigned to them are context-specific, being * determined by the surrounding tokens.  An effective way to train * out categories in general is to supply them as part of entire * sentences that have no chunks in them.  Note that this only trains * the begin-sentence, end-sentence and internal tags for non-chunked * tokens. * * <p>To be useful, the dictionary entries must match the chunks that * should be found.  For instance, in the MUC training data, there are * many instances of <code>USAir</code>, the name of a United States * airline.  It might be thought that stock listings would help the * extraction of company names, but in fact, the company is * &quot;officially&quot; known as <code>USAirways Group</code>. * * <p>It is also important that training with dictionaries not be * done with huge diffuse dictionaries that wind up smoothing the * language models too much.  For example, training just locations * with a 2 million location gazzeteer once per entry will leave * obscure locations with an estimate close to those of New York * or Beijing. * * <h4>Tag Smoothing</h4> * * <p>The constructor {@link * #CharLmHmmChunker(TokenizerFactory,AbstractHmmEstimator,boolean)})} * accepts a flag that determines whether to smooth tag transition * probabilities.  If the flag is set to <code>true</code> in the * constructor, every time a new symbol is seen in the training data, * all of its relevant underlying tags are added to the symbol table * and all legal transitions among them and all other tags are * incremented by one. * * <p>If smoothing is turned off, only tag-tag transitions seen in the * training data are allowed. * * <p>The begin-sentence and end-sentence tags are automatically added * in the constructor, so that if no training data is provided, a * chunking with no chunks is returned.  This smoothing may not be * turned off.  Thus there will always be a non-zero probability in * the underlying HMM of starting with tag <code>BB_O_BOS</code> and * <code>WW_O_BOS</code>, of ending with the tag <code>EE_O_BOS</code> * or <code>WW_O_BOS</code>.  There will also always be a non-zero * probability of transitioning from * <code>BB_O_BOS</code> to <code>MM_O</code> and * to <code>EE_O_BOS</code>, and of transitioning from <code>MM_O</code> to * <code>MM_O</code> and <code>EE_O_BOS</code>. * * * <h4>Compilation</h4> * * <P>This class implements the {@link Compilable} interface.  To * compile a static model from the current state of training, call the * method {@link #compileTo(ObjectOutput)}.  The result of reading an * object from the corresponding object input stream will produce a * compiled HMM chunker of class {@link HmmChunker}, with the same * estimates as the current state of the chunker being compiled. * * <h4>Caching</h4> * <p> * Caching is turned off on the HMM decoder for this class by default. * If caching is turned on for instances of this class (through the * method {@link #getDecoder()} inherited from * <code>HmmChunker</code>), then training instances will fail to be * reflected in cached estimates and the results may be inconsistent * and may lead to exceptions.  Caching may be turned on as long as * there are no more training instances, but in this case, it is * almost always more efficient to just compile the model and turn * caching on for that. * * <p>After compilation, the returned chunker will have caching turned * off by default. To turn on caching for the compiled model, which is * highly recommended for efficiency, retrieve the HMM decoder and set * its cache.  For instance, to set up caching for both log estimates * and linear estimates, use the code: * * <blockquote><pre> * ObjectInput objIn = ...; * HmmChunker chunker = (HmmChunker) objIn.readObject(); * HmmDecoder decoder = chunker.getDecoder(); * decoder.setEmissionCache(new FastCache(1000000)); * decoder.setEmissionLog2Cache(new FastCache(1000000)); * </pre></blockquote> * * <h3>Reserved Tag</h3> * * <p>The tag <code>BOS</code> is reserved for use by the system * for encoding document start/end positions.  See {@link HmmChunker} * for more information. * * @author  Bob Carpenter * @version 3.1 * @since   LingPipe2.2 */public class CharLmHmmChunker extends HmmChunker    implements Compilable, ChunkHandler, TagHandler {    private final AbstractHmmEstimator mHmmEstimator;    private final TokenizerFactory mTokenizerFactory;    private final HashSet mTagSet = new HashSet();    private final boolean mSmoothTags;    /**     * Construct a <code>CharLmHmmChunker</code> from the specified     * tokenizer factory and hidden Markov model estimator.  Smoothing     * is turned off by default.  See {@link     * #CharLmHmmChunker(TokenizerFactory,AbstractHmmEstimator,boolean)})     * for more information.     *     * @param tokenizerFactory Tokenizer factory to tokenize chunks.     * @param hmmEstimator Underlying HMM estimator.     */    public CharLmHmmChunker(TokenizerFactory tokenizerFactory,                            AbstractHmmEstimator hmmEstimator) {        this(tokenizerFactory,hmmEstimator,false);    }    /**     * Construct a <code>CharLmHmmChunker</code> from the specified     * tokenizer factory, HMM estimator and tag-smoothing flag.     *     * <p>If smoothing is turned on, then every time a new entity     * type is seen in the training data, all possible underlying     * tags involving that object are added to the symbol table,     * and every legal transition among these tags and all other tags     * is increment by count 1.     *     * <p>The tokenizer factory must be compilable in order for the model     * to be compiled.  If it is not compilable, then attempting to     * compile the model will raise an exception.     *     * @param tokenizerFactory Tokenizer factory to tokenize chunks.     * @param hmmEstimator Underlying HMM estimator.     * @param smoothTags Set to <code>true</code> for tag smoothing.     */    public CharLmHmmChunker(TokenizerFactory tokenizerFactory,                            AbstractHmmEstimator hmmEstimator,                            boolean smoothTags) {        super(tokenizerFactory,new HmmDecoder(hmmEstimator));        mHmmEstimator = hmmEstimator;        mTokenizerFactory = tokenizerFactory;        mSmoothTags = smoothTags;        smoothBoundaries();    }    /**     * Returns the underlying hidden Markov model estimator for this     * chunker estimator.  This is the actual estimator used by this     * class, so changes to it will affect wthis class's chunk     * estimates.     *     * @return The underlying HMM estimator.     */    public AbstractHmmEstimator getHmmEstimator() {        return mHmmEstimator;    }    /**     * Return the tokenizer factory for this chunker.     *     * @return The tokenizer factory for this chunker.     */    public TokenizerFactory getTokenizerFactory() {        return mTokenizerFactory;    }    /**     * Train the underlying hidden Markov model based on the specified     * character sequence being of the specified type.  As described     * in the class documentation above, this only trains the emission     * probabilities and internal transitions for the character     * sequence, based on the underlying tokenizer factory.     *     * <p><b>Warning:</b> Chunkers cannot only be trained with
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -