📄 exactdictionarychunker.java
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.dict;import com.aliasi.chunk.Chunk;import com.aliasi.chunk.Chunker;import com.aliasi.chunk.ChunkFactory;import com.aliasi.chunk.Chunking;import com.aliasi.chunk.ChunkingImpl;import com.aliasi.tokenizer.LowerCaseFilterTokenizer;import com.aliasi.tokenizer.Tokenizer;import com.aliasi.tokenizer.TokenizerFactory;import com.aliasi.util.Scored;import com.aliasi.util.ScoredObject;import com.aliasi.util.Strings;import java.util.Arrays;import java.util.HashMap;import java.util.Iterator;import java.util.Map;import java.util.Set;/** * An exact dictionary chunker extracts chunks based on exact matches * of tokenized dictionary entries. * * <p>All dictionary entry categories are converted to strings from * generic objects using {@link Object#toString()}. * * <p>An exact dicitonary chunker may be configured either to * extract all matching chunks, or to restrict the results to a * consistent set of non-overlapping chunks. These non-overlapping * chunks are taken to be the left-most, longest-matching, * highest-scoring, or alphabetically preceding in type according * to the following definitions. A chunk with span * <code>(start1,end1)</code> overlaps a chunk with span * <code>(start2,end2)</code> if and only if either end * points of the second chunk lie within the first chunk: * <ul> * <li> <code>start1 <= start2 < end1</code>, or * <li> <code>start1 < end2 <= end1</code>. * </ul> * * For instance, <code>(0,1)</code> and <code>(1,3)</code> do * not overlap, but * <code>(0,1)</code> overlaps <code>(0,2)</code>, * <code>(1,2)</code> overlaps <code>(0,2)</code>, and * <code>(1,7)</code> overlaps <code>(2,3)</code>. * * <p>A chunk <code>chunk1=(start1,end1):type1@score1</code> dominates * another chunk <code>chunk2=(start2,end2):type2@score2</code> if and * only if the chunks overlap and: * * <ul> * <li> <code>start1 < start2</code> (leftmost), or * <li> <code>start1 == start2</code> * and <code>end1 > end2</code> (longest), or * <li> <code>start1 == start2</code>, <code>end1 == end2</code> * and <code>score1 > score2</code> (highest scoring), or * <li> <code>start1 == start2</code>, <code>end1 == end2</code>, * <code>score1 == score2</code> and * <code>type1 < type2</code> (alphabetical). * </ul> * * To construct a non-overlapping result, all dominated chunks are * removed. * * <p>If the chunker is specified to be case sensitive, the exact * dictionary entries must match. If it is not case sensitive, all * matching will be done after applying string normalization using * {@link java.lang.String#toLowerCase()}. * * <p>Matching ignores whitespace as defined by the specified * tokenizer factory. The tokenizer factory should have * character-for-character aligned tokens with the input. That is, it * should not do stemming, stopword removal, etc., or this chunker * will not be able to calculate string positions. Safe tokenizer * factories include {@link * com.aliasi.tokenizer.IndoEuropeanTokenizerFactory}, {@link * com.aliasi.tokenizer.RegExTokenizerFactory}, and {@link * com.aliasi.tokenizer.CharacterTokenizerFactory}; unsafe ones * include the {@link com.aliasi.tokenizer.NGramTokenizerFactory} and * anything user-defined constructed with a filter tokenizer, * including {@link * com.aliasi.tokenizer.NormalizeWhiteSpaceFilterTokenizer}, {@link * com.aliasi.tokenizer.StopFilterTokenizer} or a {@link * com.aliasi.tokenizer.PorterStemmerFilterTokenizer}. * * <p>Chunking is thread safe, and may be run concurrently. Changing * the return-all-matches flag with {@link * #setReturnAllMatches(boolean)} should not be called while chunking * is running, as it may affect the behavior of the running example * with respect to whether it returns all chunkings. Once * constructed, the tokenizer's behavior should not change. * * <p><i>Implementation Note:</i> This class is implemented using the * Aho-Corasick algorithm, a generalization of the Knuth-Morris-Pratt * string-matching algorithm to sets of strings. Aho-Corasick is * linear in the number of tokens in the input plus the number of * output chunks. Memory requirements are only an array of integers * as long as the longest phrase (a circular queue for holding start * points of potential chunks) and the memory required by the chunking * implementation for the result (which may be as large as quadratic * in the size of the input, or may be very small if there are not * many matches). Compilation of the Aho-Corasick tree is done in the * constructor and is linear in number of dictionary entries with a * constant factor as high as the maximum phrase length; this can be * improved to a constant factor using suffix-tree like speedups, but * it didn't seem worth the complexity here when the dictionaries * would be long-lived. * * <ul> * <li>Aho, Alfred V. and Margaret J. Corasick. 1975. Efficient string * matching: an aid to bibliographic search, <i>CACM</i>, * <b>18</b>(6):333-340. * <li><a href="http://en.wikipedia.org/wiki/Aho-Corasick_algorithm" * >Wikipedia: Aho-Corasick Algorithm</a> * <br /><small>[Entry sketchy at the time of writing, but good links.]</small> * <li>Gusfield, Dan. 1997. <i>Algorithms on Strings, Trees and Sequences.</i> * Cambridge University Press. * <br /><small>[Best book on string algorithms out there. Great explanation * of Aho-Corasick.</small>] * </ul> * * @author Bob Carpenter * @version 3.2.0 * @since LingPipe2.3.1 */public class ExactDictionaryChunker implements Chunker { final TrieNode mTrieRootNode; final TokenizerFactory mTokenizerFactory; final boolean mCaseSensitive; boolean mReturnAllMatches; int mMaxPhraseLength = 0; /** * Construct an exact dictionary chunker from the specified * dictionary and tokenizer factory which is case sensitive and * returns all matches. See the class documentation above for * more information on chunking and requirements for tokenizer * factories. * * <p>After construction, this class does not use the dictionary * and will not be sensitive to changes in the underlying * dictionary. * * @param dict Dictionary forming the basis of the chunker. * @param factory Tokenizer factory underlying chunker. */ public ExactDictionaryChunker(Dictionary<String> dict, TokenizerFactory factory) { this(dict,factory,true,true); } /** * Construct an exact dictionary chunker from the specified * dictionary and tokenizer factory, returning all matches or not * as specified. See the class documentation above for more * information on chunking. * * <p>After construction, this class does not use the dictionary * and will not be sensitive to changes in the underlying * dictionary. * * @param dict Dictionary forming the basis of the chunker. * @param factory Tokenizer factory underlying chunker. * @param returnAllMatches <code>true</code> if chunker should return * all matches. * @param caseSensitive <code>true</code> if chunker is case * sensitive. */ public ExactDictionaryChunker(Dictionary<String> dict, TokenizerFactory factory, boolean returnAllMatches, boolean caseSensitive) { mTokenizerFactory = factory; mReturnAllMatches = returnAllMatches; mCaseSensitive = caseSensitive; mTrieRootNode = compileTrie(dict); } /** * Returns the tokenizer factory underlying this chunker. Once * set in the constructor, the tokenizer factory may not be * changed. If the tokenizer factory allows dynamic * reconfiguration, it should not be reconfigured or inconsistent * results may be returned. * * @return The tokenizer factory for this chunker. */ public TokenizerFactory tokenizerFactory() { return mTokenizerFactory; } /** * Returns <code>true</code> if this dictionary chunker is * case sensitive. Case sensitivity must be defined at * construction time and may not be reset. * * @return Whether this chunker is case sensitive. */ public boolean caseSensitive() { return mCaseSensitive; } /** * Returns <code>true</code> if this chunker returns all matches. * * @return Whether this chunker returns all matches. */ public boolean returnAllMatches() { return mReturnAllMatches; } /** * Set whether to return all matches to the specified condition. * * <p>Note that setting this while running a chunking in another * thread may affect that chunking. * * @param returnAllMatches <code>true</code> if all matches should * be returned. */ public void setReturnAllMatches(boolean returnAllMatches) { mReturnAllMatches = returnAllMatches; } /** * Returns the chunking for the specified character sequence. * Whether all matching chunks are returned depends on whether * this chunker is configured to return all matches or not. * See the class documentation above for more information. * * @param cSeq Character sequence to chunk. * @return The chunking for the specified character sequence. */ public Chunking chunk(CharSequence cSeq) { char[] cs = Strings.toCharArray(cSeq); return chunk(cs,0,cs.length); } final Tokenizer tokenizer(char[] cs, int start, int end) { Tokenizer tokenizer = mTokenizerFactory.tokenizer(cs,start,end-start); return mCaseSensitive ? tokenizer : new LowerCaseFilterTokenizer(tokenizer); } /** * Returns the chunking for the specified character slice. * Whether all matching chunks are returned depends on whether * this chunker is configured to return all matches or not. * See the class documentation above for more information. * * @param cs Underlying array of characters. * @param start Index of first character in slice.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -