📄 hmmchunker.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
📖 第 1 页 / 共 2 页
字号:
12 下一页
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.chunk;import com.aliasi.corpus.ChunkTagHandlerAdapter;import com.aliasi.hmm.HmmDecoder;import com.aliasi.hmm.TagWordLattice;import com.aliasi.symbol.SymbolTable;import com.aliasi.tokenizer.Tokenizer;import com.aliasi.tokenizer.TokenizerFactory;import com.aliasi.util.BoundedPriorityQueue;import com.aliasi.util.Iterators;import com.aliasi.util.Scored;import com.aliasi.util.ScoredObject;import com.aliasi.util.Strings;import java.util.ArrayList;import java.util.Collection;import java.util.Iterator;import java.util.List;/** * An <code>HmmChunker</code> uses a hidden Markov model to perform * chunking over tokenized character sequences.  Instances contain a * hidden Markov model, decoder for the model and tokenizer factory. * * <p>Chunking results are available through three related methods. * The method {@link #chunk(CharSequence)} and its sister method * {@link #chunk(char[],int,int)} implement the {@link Chunking} * interface by returning the first-best chunking for the argument * character sequence or slice.  The method {@link * #nBest(char[],int,int,int)} return an iterator over complete * chunkings and their joint probability estimates in descending order * of probability, with the last argument supplying an upper bound on * the number of chunkings returned.  Finally, the method {@link * #nBestChunks(char[],int,int,int)} returns an iterator over the * chunks themselves, this time in descending order of the chunk's * conditional probability given the input (i.e. descending * confidence), with the final argument providing an upper bound on * the number of such chunks returned. * * <h4>Internal Token/Tag Encoding</h4> * * <P>The chunker requires a hidden Markov model whose states * conform to a token-by-token encoding of a chunking.  This * class assumes the following encoding: * * <blockquote><table border="1" cellpadding="5"> * <tr><th align="left">Tag</th> *     <th align="left">Description of Tokens to which it is Assigned</th> *     <th align="left">May Follow</th> *     <th align="left">May Precede</th> * </tr> * <tr><td><code>B_<i>X</i></code></td> *     <td>Initial (begin) token of chunk of *         type <code><i>X</i></code></td> *     <td><code>E_<i>Y</i>, W_<i>Y</i>, EE_O_<i>X</i>, WW_O_<i>X</i></code></td> *     <td><code>M_<i>X</i>, W_<i>X</i></code></td></tr> * <tr><td><code>M_<i>X</i></code></td> *     <td>Interior (middle) token of chunk of *         type <code><i>X</i></code></td> *     <td><code>B_<i>X</i>, M_<i>X</i></code></td> *     <td><code>M_<i>X</i>, W_<i>X</i></code></td></tr> * <tr><td><code>E_<i>X</i></code></td> *     <td>Final (end) token of chunk of *         type <code><i>X</i></code></td> *     <td><code>B_<i>X</i>, M_<i>X</i></code></td> *     <td><code>B_<i>Y</i>, W_<i>Y</i>, BB_O_<i>X</i>, WW_O_<i>Y</i></code></td></tr> * <tr><td><code>W_<i>X</i></code></td> *     <td>Token by itself comprising a (whole) chunk of *         type <code><i>X</i></code></td> *     <td><code>E_<i>Y</i>, W_<i>Y</i>, EE_O_<i>X</i>, WW_O_<i>X</i></code></td> *     <td><code>B_<i>Y</i>, W_<i>Y</i>, BB_O_<i>X</i>, WW_O_<i>Y</i></code></td></tr> * <tr><td><code>BB_O_<i>X</i></code></td> *     <td>Token not in chunk, previous token ending chunk of *         type <code><i>X</i></code></td> *     <td><code>E_<i>X</i>, W_<i>X</i></code></td> *     <td><code>MM_O, EE_O_<i>Y</i></code></td></tr> * <tr><td><code>MM_O</code></td> *     <td>Token, previous token and following token not in chunk</td> *     <td><code>BB_O_<i>Y</i>, MM_O</code></td> *     <td><code>MM_O, EE_O_<i>Y</i></code></td></tr> * <tr><td><code>EE_O_<i>X</i></code></td> *     <td>Token and previous token not in a chunk, following *         token begins chunk of type <code><i>X</i></code></td> *     <td><code>BB_O_<i>Y</i>, MM_O</code></td> *     <td><code>B_<i>X</i>, W_<i>X</i></code></td></tr> * <tr><td><code>WW_O_<i>X</i></code></td> *     <td>Token not in chunk, previous token ended a chunk, *         following token begins chunk of *         type <code><i>X</i></code></td> *     <td><code>E_<i>X</i>, W_<i>X</i></code></td> *     <td><code>B_<i>Y</i>, W_<i>Y</i></code></td></tr> * </table></blockquote> * * The intention here is that the <code><i>X</i></code> tags in the * last two columns (legal followers and preceders) match the tag * in the first column, whereas the <code><i>Y</i></code> tags * vary freely. * * Note that this produces the following number of states: * * <blockquote><pre> * numTags = (7 * numTypes) + 1<pre></blockquote> * * Not all transitions between states are legal; the ones ruled out * in the table above must receive zero probability estimates.  The * number of legal transitions is given by: * * <blockquote><pre> * numTransitions = 5*numTypes<sup><sup>2</sup></sup> + 13*numTypes + 1<pre></blockquote> * <p> By including an indication of the position in a chunk, an HMM * is able to model tokens that start and end chunks, as well as those * that fall in the middle of chunks or make up chunks on their own. * In addition, it also models tokens that precede or follow chunks of * a given type.  For instance, consider the following tokenization * and tagging, with an implicit tag <code>W_OOS</code> for the * out-of-sentence tag: * * <blockquote><pre> *                 (W_OOS) * Yestereday      BB_O_OOS * afternoon       MM_O * ,               EE_O_PER * John            B_PER * J               M_PER * .               M_PER * Smith           E_PER * traveled        BB_O_PER * to              EE_O_LOC * Washington      W_LOC * .               WW_O_OOS *                 (W_OOS)</pre></blockquote> * * First note that the person chunk <code>John J. Smith</code> consists * of three tokens: <code>John</code> with a begin-person tag, * <code>J</code> and <code>.</code> with in-person tokens, and * <code>Smith</code> an end-person token.  In contrast, the token * <code>Washington</code> makes up a location chunk all by itself. * * <p> There are several flavors of tags assigned to tokens that are * not part of chunks based on the status of the surrounding tokens. * First, <code>BB_O_OOS</code> is the tag assigned to * <code>Yesterday</code>, because it is an out token that follows (an * implicit) <code>OOS</code> tag.  That is, it's the first out token * following out-of-sentence.  This allows the tag to capture the * capitalization pattern of sentence-initial tokens that are not part * of chunks.  The interior token <code>afternoon</code> is simply * assigned <code>MM_O</code>; its context does not allow it to see * any surrounding chunks.  At the other end of the sentence, the * final period token is assigned the tag * <code>WW_O_OOS</code>, because it preceds the (implicit) <code>OOS</code> * (out of sentence) chunk.  This allows some discrimination between * sentence-final punctuation and other punctuation. * * <p> Next note that the token <code>traveled</code> is assigned to * the category of first tokens following person, whereas * <code>to</code> is assigned to the category of a final token * preceding a location.  Finally note the tag <code>MM_O</code> * assigned to the token <code>afternoon</code> which appears between * two other tokens that are not part of chunks. * * <p>If taggings of this sort are required rather than chunkings, the * HMM decoder may be retrieved via {@link #getDecoder()} and used * along with the tokenizer factory retrieved through {@link * #getTokenizerFactory()} to produce taggings. * * <h4>Training</h4> * * <p>The class {@link CharLmHmmChunker} may be used to train a * chunker using an HMM estimator such as {@link * com.aliasi.hmm.HmmCharLmEstimator} to estimate the HMM.  This * estimator uses bounded character language models to estimate * emission probabilities. * * * <h4>Caching</h4> * * The efficiency of the chunker may be improved (at the expense of * increased memory usage) if caching is turned on for the HMM * decoder.  The first-best and n-best results only use log * probabilities, and hence only require caching of the log * probabilities.  The confidence-based estimator uses only the linear * estimates, and hence only require caching of linear probabilities. * * @author  Bob Carpenter * @version 3.0 * @since   LingPipe2.2 */public class HmmChunker implements NBestChunker, ConfidenceChunker {    private final TokenizerFactory mTokenizerFactory;    private final HmmDecoder mDecoder;    /**     * Construct a chunker from the specified tokenizer factory     * and hidden Markov model decoder.  The decoder should operate     * over a tag set as specified in the class documentation above.     * Once constructed, the tokenizer factory and decoder may not     * be changed.     *     * <p>See the note in the class documentation concerning caching     * in the decoder.  A typical application will configure the cache     * of the decoder before creating an HMM chunker.  See the class     * documentation for {@link HmmDecoder}, as well as the method     * documentation for {@link HmmDecoder#setEmissionCache(Map)} and     * {@link HmmDecoder#setEmissionLog2Cache(Map)} for more     * information.     *     * @param tokenizerFactory Tokenizer factory for tokenization.     * @param decoder Hidden Markov model decoder.     */    public HmmChunker(TokenizerFactory tokenizerFactory,                      HmmDecoder decoder) {        mTokenizerFactory = tokenizerFactory;        mDecoder = decoder;    }    /**     * Returns the underlying hidden Markov model decoder for this     * chunker.  This is the actual decoder used by this chunker, so     * any changes to it will affect this chunker.     *     * <p>The decoder provides access to the underlying hidden Markov     * model for this chunker.     *     * @return The decoder for this chunker.     */    public HmmDecoder getDecoder() {        return mDecoder;    }    /**     * Returns the underlying tokenizer factory for this chunker.     *     * @return The tokenizer factory for this chunker.     */    public TokenizerFactory getTokenizerFactory() {        return mTokenizerFactory;    }    /**     * Returns a chunking of the specified character slice.  This is     * the chunking determined by the first-best hidden Markov model     * tags given the tokenization performed by the underlying factory.     * More information about the underlying HMM is provided in the     * class documentation above.     *     * @param cs Array of characters.     * @param start Index of first character.     * @param end Index of one past last character.     * @return First-best chunking of the specified character slice.     * @throws IndexOutOfBoundsException If the specified indices are out     * of bounds of the specified character array.     */    public Chunking chunk(char[] cs, int start, int end) {        Strings.checkArgsStartEnd(cs,start,end);        Tokenizer tokenizer = mTokenizerFactory.tokenizer(cs,start,end-start);        ArrayList tokList = new ArrayList();        ArrayList whiteList = new ArrayList();        tokenizer.tokenize(tokList,whiteList);        String[] toks = toStringArray(tokList);        String[] whites = toStringArray(whiteList);        String[] tags = mDecoder.firstBest(toks);        decodeNormalize(tags);        return ChunkTagHandlerAdapter.toChunkingBIO(toks,whites,tags);    }    /**     * Returns a chunking of the specified character sequence.  This is     * the chunking determined by the first-best hidden Markov model     * tags given the tokenization performed by the underlying factory.     * More information about the underlying HMM is provided in the     * class documentation above.     *     * @param cSeq Character sequence to chunk.     * @return First-best chunking of the specified character     * sequence.     */    public Chunking chunk(CharSequence cSeq) {        char[] cs = Strings.toCharArray(cSeq);        return chunk(cs,0,cs.length);    }    /**     * Returns a size-bounded iterator over scored objects with joint     * probability estimates of tags and tokens as scores and     * chunkings as objects.  The maximum number of chunkings returned     * is specified by the last argument position.  This limit should     * be set as low as possible to reduce the memory requirements of     * n-best search.     *     * @param cs Array of characters.     * @param start Index of first character.     * @param end Index of one past last character.     * @param maxNBest Maximum number of results to return.     * @return Iterator over scored objects containing chunkings and     * joint probability estimates.     * @throws IndexOutOfBoundsException If the specified indices are out     * of bounds of the specified character array.     * @throws IllegalArgumentException If the maximum n-best value is     * not greater than zero.     */    public Iterator<ScoredObject<Chunking>> nBest(char[] cs, int start, int end, int maxNBest) {        Strings.checkArgsStartEnd(cs,start,end);        if (maxNBest < 1) {            String msg = "Maximum n-best value must be greater than zero."                + " Found maxNBest=" + maxNBest;            throw new IllegalArgumentException(msg);        }        String[][] toksWhites = getToksWhites(cs,start,end);        Iterator it = mDecoder.nBest(toksWhites[0],maxNBest);        return new NBestIt(it,toksWhites);    }    /**     * Returns an iterator over scored objects with conditional     * probability estimates for scores and chunks as objects.  This     * returns the possible chunks in confidence-ranked order, with     * scores being the conditional probability of the chunk given the     * input.  If the results are exhausted, or after the specified     * number of n-best results have been returned, the iterator will     * return <code>false</code> when its <code>hasNext()</code>     * method is called.     *     * @param cs Array of characters.     * @param start Index of first character.     * @param end Index of one past last character.     * @param maxNBest Maximum number of chunks returned.     * @return Iterator over scored objects containing chunkings and     * joint probability estimates.     * @throws IndexOutOfBoundsException If the specified indices are out
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -