📄 hmmchunker.java
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.chunk;import com.aliasi.corpus.ChunkTagHandlerAdapter;import com.aliasi.hmm.HmmDecoder;import com.aliasi.hmm.TagWordLattice;import com.aliasi.symbol.SymbolTable;import com.aliasi.tokenizer.Tokenizer;import com.aliasi.tokenizer.TokenizerFactory;import com.aliasi.util.BoundedPriorityQueue;import com.aliasi.util.Iterators;import com.aliasi.util.Scored;import com.aliasi.util.ScoredObject;import com.aliasi.util.Strings;import java.util.ArrayList;import java.util.Collection;import java.util.Iterator;import java.util.List;/** * An <code>HmmChunker</code> uses a hidden Markov model to perform * chunking over tokenized character sequences. Instances contain a * hidden Markov model, decoder for the model and tokenizer factory. * * <p>Chunking results are available through three related methods. * The method {@link #chunk(CharSequence)} and its sister method * {@link #chunk(char[],int,int)} implement the {@link Chunking} * interface by returning the first-best chunking for the argument * character sequence or slice. The method {@link * #nBest(char[],int,int,int)} return an iterator over complete * chunkings and their joint probability estimates in descending order * of probability, with the last argument supplying an upper bound on * the number of chunkings returned. Finally, the method {@link * #nBestChunks(char[],int,int,int)} returns an iterator over the * chunks themselves, this time in descending order of the chunk's * conditional probability given the input (i.e. descending * confidence), with the final argument providing an upper bound on * the number of such chunks returned. * * <h4>Internal Token/Tag Encoding</h4> * * <P>The chunker requires a hidden Markov model whose states * conform to a token-by-token encoding of a chunking. This * class assumes the following encoding: * * <blockquote><table border="1" cellpadding="5"> * <tr><th align="left">Tag</th> * <th align="left">Description of Tokens to which it is Assigned</th> * <th align="left">May Follow</th> * <th align="left">May Precede</th> * </tr> * <tr><td><code>B_<i>X</i></code></td> * <td>Initial (begin) token of chunk of * type <code><i>X</i></code></td> * <td><code>E_<i>Y</i>, W_<i>Y</i>, EE_O_<i>X</i>, WW_O_<i>X</i></code></td> * <td><code>M_<i>X</i>, W_<i>X</i></code></td></tr> * <tr><td><code>M_<i>X</i></code></td> * <td>Interior (middle) token of chunk of * type <code><i>X</i></code></td> * <td><code>B_<i>X</i>, M_<i>X</i></code></td> * <td><code>M_<i>X</i>, W_<i>X</i></code></td></tr> * <tr><td><code>E_<i>X</i></code></td> * <td>Final (end) token of chunk of * type <code><i>X</i></code></td> * <td><code>B_<i>X</i>, M_<i>X</i></code></td> * <td><code>B_<i>Y</i>, W_<i>Y</i>, BB_O_<i>X</i>, WW_O_<i>Y</i></code></td></tr> * <tr><td><code>W_<i>X</i></code></td> * <td>Token by itself comprising a (whole) chunk of * type <code><i>X</i></code></td> * <td><code>E_<i>Y</i>, W_<i>Y</i>, EE_O_<i>X</i>, WW_O_<i>X</i></code></td> * <td><code>B_<i>Y</i>, W_<i>Y</i>, BB_O_<i>X</i>, WW_O_<i>Y</i></code></td></tr> * <tr><td><code>BB_O_<i>X</i></code></td> * <td>Token not in chunk, previous token ending chunk of * type <code><i>X</i></code></td> * <td><code>E_<i>X</i>, W_<i>X</i></code></td> * <td><code>MM_O, EE_O_<i>Y</i></code></td></tr> * <tr><td><code>MM_O</code></td> * <td>Token, previous token and following token not in chunk</td> * <td><code>BB_O_<i>Y</i>, MM_O</code></td> * <td><code>MM_O, EE_O_<i>Y</i></code></td></tr> * <tr><td><code>EE_O_<i>X</i></code></td> * <td>Token and previous token not in a chunk, following * token begins chunk of type <code><i>X</i></code></td> * <td><code>BB_O_<i>Y</i>, MM_O</code></td> * <td><code>B_<i>X</i>, W_<i>X</i></code></td></tr> * <tr><td><code>WW_O_<i>X</i></code></td> * <td>Token not in chunk, previous token ended a chunk, * following token begins chunk of * type <code><i>X</i></code></td> * <td><code>E_<i>X</i>, W_<i>X</i></code></td> * <td><code>B_<i>Y</i>, W_<i>Y</i></code></td></tr> * </table></blockquote> * * The intention here is that the <code><i>X</i></code> tags in the * last two columns (legal followers and preceders) match the tag * in the first column, whereas the <code><i>Y</i></code> tags * vary freely. * * Note that this produces the following number of states: * * <blockquote><pre> * numTags = (7 * numTypes) + 1<pre></blockquote> * * Not all transitions between states are legal; the ones ruled out * in the table above must receive zero probability estimates. The * number of legal transitions is given by: * * <blockquote><pre> * numTransitions = 5*numTypes<sup><sup>2</sup></sup> + 13*numTypes + 1<pre></blockquote> * <p> By including an indication of the position in a chunk, an HMM * is able to model tokens that start and end chunks, as well as those * that fall in the middle of chunks or make up chunks on their own. * In addition, it also models tokens that precede or follow chunks of * a given type. For instance, consider the following tokenization * and tagging, with an implicit tag <code>W_OOS</code> for the * out-of-sentence tag: * * <blockquote><pre> * (W_OOS) * Yestereday BB_O_OOS * afternoon MM_O * , EE_O_PER * John B_PER * J M_PER * . M_PER * Smith E_PER * traveled BB_O_PER * to EE_O_LOC * Washington W_LOC * . WW_O_OOS * (W_OOS)</pre></blockquote> * * First note that the person chunk <code>John J. Smith</code> consists * of three tokens: <code>John</code> with a begin-person tag, * <code>J</code> and <code>.</code> with in-person tokens, and * <code>Smith</code> an end-person token. In contrast, the token * <code>Washington</code> makes up a location chunk all by itself. * * <p> There are several flavors of tags assigned to tokens that are * not part of chunks based on the status of the surrounding tokens. * First, <code>BB_O_OOS</code> is the tag assigned to * <code>Yesterday</code>, because it is an out token that follows (an * implicit) <code>OOS</code> tag. That is, it's the first out token * following out-of-sentence. This allows the tag to capture the * capitalization pattern of sentence-initial tokens that are not part * of chunks. The interior token <code>afternoon</code> is simply * assigned <code>MM_O</code>; its context does not allow it to see * any surrounding chunks. At the other end of the sentence, the * final period token is assigned the tag * <code>WW_O_OOS</code>, because it preceds the (implicit) <code>OOS</code> * (out of sentence) chunk. This allows some discrimination between * sentence-final punctuation and other punctuation. * * <p> Next note that the token <code>traveled</code> is assigned to * the category of first tokens following person, whereas * <code>to</code> is assigned to the category of a final token * preceding a location. Finally note the tag <code>MM_O</code> * assigned to the token <code>afternoon</code> which appears between * two other tokens that are not part of chunks. * * <p>If taggings of this sort are required rather than chunkings, the * HMM decoder may be retrieved via {@link #getDecoder()} and used * along with the tokenizer factory retrieved through {@link * #getTokenizerFactory()} to produce taggings. * * <h4>Training</h4> * * <p>The class {@link CharLmHmmChunker} may be used to train a * chunker using an HMM estimator such as {@link * com.aliasi.hmm.HmmCharLmEstimator} to estimate the HMM. This * estimator uses bounded character language models to estimate * emission probabilities. * * * <h4>Caching</h4> * * The efficiency of the chunker may be improved (at the expense of * increased memory usage) if caching is turned on for the HMM * decoder. The first-best and n-best results only use log * probabilities, and hence only require caching of the log * probabilities. The confidence-based estimator uses only the linear * estimates, and hence only require caching of linear probabilities. * * @author Bob Carpenter * @version 3.0 * @since LingPipe2.2 */public class HmmChunker implements NBestChunker, ConfidenceChunker { private final TokenizerFactory mTokenizerFactory; private final HmmDecoder mDecoder; /** * Construct a chunker from the specified tokenizer factory * and hidden Markov model decoder. The decoder should operate * over a tag set as specified in the class documentation above. * Once constructed, the tokenizer factory and decoder may not * be changed. * * <p>See the note in the class documentation concerning caching * in the decoder. A typical application will configure the cache * of the decoder before creating an HMM chunker. See the class * documentation for {@link HmmDecoder}, as well as the method * documentation for {@link HmmDecoder#setEmissionCache(Map)} and * {@link HmmDecoder#setEmissionLog2Cache(Map)} for more * information. * * @param tokenizerFactory Tokenizer factory for tokenization. * @param decoder Hidden Markov model decoder. */ public HmmChunker(TokenizerFactory tokenizerFactory, HmmDecoder decoder) { mTokenizerFactory = tokenizerFactory; mDecoder = decoder; } /** * Returns the underlying hidden Markov model decoder for this * chunker. This is the actual decoder used by this chunker, so * any changes to it will affect this chunker. * * <p>The decoder provides access to the underlying hidden Markov * model for this chunker. * * @return The decoder for this chunker. */ public HmmDecoder getDecoder() { return mDecoder; } /** * Returns the underlying tokenizer factory for this chunker. * * @return The tokenizer factory for this chunker. */ public TokenizerFactory getTokenizerFactory() { return mTokenizerFactory; } /** * Returns a chunking of the specified character slice. This is * the chunking determined by the first-best hidden Markov model * tags given the tokenization performed by the underlying factory. * More information about the underlying HMM is provided in the * class documentation above. * * @param cs Array of characters. * @param start Index of first character. * @param end Index of one past last character. * @return First-best chunking of the specified character slice. * @throws IndexOutOfBoundsException If the specified indices are out * of bounds of the specified character array. */ public Chunking chunk(char[] cs, int start, int end) { Strings.checkArgsStartEnd(cs,start,end); Tokenizer tokenizer = mTokenizerFactory.tokenizer(cs,start,end-start); ArrayList tokList = new ArrayList(); ArrayList whiteList = new ArrayList(); tokenizer.tokenize(tokList,whiteList); String[] toks = toStringArray(tokList); String[] whites = toStringArray(whiteList); String[] tags = mDecoder.firstBest(toks); decodeNormalize(tags); return ChunkTagHandlerAdapter.toChunkingBIO(toks,whites,tags); } /** * Returns a chunking of the specified character sequence. This is * the chunking determined by the first-best hidden Markov model * tags given the tokenization performed by the underlying factory. * More information about the underlying HMM is provided in the * class documentation above. * * @param cSeq Character sequence to chunk. * @return First-best chunking of the specified character * sequence. */ public Chunking chunk(CharSequence cSeq) { char[] cs = Strings.toCharArray(cSeq); return chunk(cs,0,cs.length); } /** * Returns a size-bounded iterator over scored objects with joint * probability estimates of tags and tokens as scores and * chunkings as objects. The maximum number of chunkings returned * is specified by the last argument position. This limit should * be set as low as possible to reduce the memory requirements of * n-best search. * * @param cs Array of characters. * @param start Index of first character. * @param end Index of one past last character. * @param maxNBest Maximum number of results to return. * @return Iterator over scored objects containing chunkings and * joint probability estimates. * @throws IndexOutOfBoundsException If the specified indices are out * of bounds of the specified character array. * @throws IllegalArgumentException If the maximum n-best value is * not greater than zero. */ public Iterator<ScoredObject<Chunking>> nBest(char[] cs, int start, int end, int maxNBest) { Strings.checkArgsStartEnd(cs,start,end); if (maxNBest < 1) { String msg = "Maximum n-best value must be greater than zero." + " Found maxNBest=" + maxNBest; throw new IllegalArgumentException(msg); } String[][] toksWhites = getToksWhites(cs,start,end); Iterator it = mDecoder.nBest(toksWhites[0],maxNBest); return new NBestIt(it,toksWhites); } /** * Returns an iterator over scored objects with conditional * probability estimates for scores and chunks as objects. This * returns the possible chunks in confidence-ranked order, with * scores being the conditional probability of the chunk given the * input. If the results are exhausted, or after the specified * number of n-best results have been returned, the iterator will * return <code>false</code> when its <code>hasNext()</code> * method is called. * * @param cs Array of characters. * @param start Index of first character. * @param end Index of one past last character. * @param maxNBest Maximum number of chunks returned. * @return Iterator over scored objects containing chunkings and * joint probability estimates. * @throws IndexOutOfBoundsException If the specified indices are out
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -