📄 exactdictionarychunker.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
📖 第 1 页 / 共 2 页
字号:
12 下一页
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.dict;import com.aliasi.chunk.Chunk;import com.aliasi.chunk.Chunker;import com.aliasi.chunk.ChunkFactory;import com.aliasi.chunk.Chunking;import com.aliasi.chunk.ChunkingImpl;import com.aliasi.tokenizer.LowerCaseFilterTokenizer;import com.aliasi.tokenizer.Tokenizer;import com.aliasi.tokenizer.TokenizerFactory;import com.aliasi.util.Scored;import com.aliasi.util.ScoredObject;import com.aliasi.util.Strings;import java.util.Arrays;import java.util.HashMap;import java.util.Iterator;import java.util.Map;import java.util.Set;/** * An exact dictionary chunker extracts chunks based on exact matches * of tokenized dictionary entries. * * <p>All dictionary entry categories are converted to strings from * generic objects using {@link Object#toString()}. * * <p>An exact dicitonary chunker may be configured either to * extract all matching chunks, or to restrict the results to a * consistent set of non-overlapping chunks.  These non-overlapping * chunks are taken to be the left-most, longest-matching, * highest-scoring, or alphabetically preceding in type according * to the following definitions. A chunk with span * <code>(start1,end1)</code> overlaps a chunk with span * <code>(start2,end2)</code> if and only if either end * points of the second chunk lie within the first chunk: * <ul> * <li> <code>start1 <= start2 < end1</code>, or * <li> <code>start1 < end2 <= end1</code>. * </ul> * * For instance, <code>(0,1)</code> and <code>(1,3)</code> do * not overlap, but * <code>(0,1)</code> overlaps <code>(0,2)</code>, * <code>(1,2)</code> overlaps <code>(0,2)</code>, and * <code>(1,7)</code> overlaps <code>(2,3)</code>. * * <p>A chunk <code>chunk1=(start1,end1):type1@score1</code> dominates * another chunk <code>chunk2=(start2,end2):type2@score2</code> if and * only if the chunks overlap and: * * <ul> * <li> <code>start1 &lt; start2</code> (leftmost), or * <li> <code>start1 == start2</code> *      and <code>end1 &gt; end2</code> (longest), or * <li> <code>start1 == start2</code>, <code>end1 == end2</code> *      and <code>score1 &gt; score2</code> (highest scoring), or * <li> <code>start1 == start2</code>, <code>end1 == end2</code>, *      <code>score1 == score2</code> and *      <code>type1 &lt; type2</code> (alphabetical). * </ul> * * To construct a non-overlapping result, all dominated chunks are * removed. * * <p>If the chunker is specified to be case sensitive, the exact * dictionary entries must match.  If it is not case sensitive, all * matching will be done after applying string normalization using * {@link java.lang.String#toLowerCase()}. * * <p>Matching ignores whitespace as defined by the specified * tokenizer factory.  The tokenizer factory should have * character-for-character aligned tokens with the input.  That is, it * should not do stemming, stopword removal, etc., or this chunker * will not be able to calculate string positions.  Safe tokenizer * factories include {@link * com.aliasi.tokenizer.IndoEuropeanTokenizerFactory}, {@link * com.aliasi.tokenizer.RegExTokenizerFactory}, and {@link * com.aliasi.tokenizer.CharacterTokenizerFactory}; unsafe ones * include the {@link com.aliasi.tokenizer.NGramTokenizerFactory} and * anything user-defined constructed with a filter tokenizer, * including {@link * com.aliasi.tokenizer.NormalizeWhiteSpaceFilterTokenizer}, {@link * com.aliasi.tokenizer.StopFilterTokenizer} or a {@link * com.aliasi.tokenizer.PorterStemmerFilterTokenizer}. * * <p>Chunking is thread safe, and may be run concurrently.  Changing * the return-all-matches flag with {@link * #setReturnAllMatches(boolean)} should not be called while chunking * is running, as it may affect the behavior of the running example * with respect to whether it returns all chunkings.  Once * constructed, the tokenizer's behavior should not change. * * <p><i>Implementation Note:</i> This class is implemented using the * Aho-Corasick algorithm, a generalization of the Knuth-Morris-Pratt * string-matching algorithm to sets of strings.  Aho-Corasick is * linear in the number of tokens in the input plus the number of * output chunks.  Memory requirements are only an array of integers * as long as the longest phrase (a circular queue for holding start * points of potential chunks) and the memory required by the chunking * implementation for the result (which may be as large as quadratic * in the size of the input, or may be very small if there are not * many matches).  Compilation of the Aho-Corasick tree is done in the * constructor and is linear in number of dictionary entries with a * constant factor as high as the maximum phrase length; this can be * improved to a constant factor using suffix-tree like speedups, but * it didn't seem worth the complexity here when the dictionaries * would be long-lived. * * <ul> * <li>Aho, Alfred V. and Margaret J. Corasick. 1975. Efficient string * matching: an aid to bibliographic search, <i>CACM</i>, * <b>18</b>(6):333-340. * <li><a href="http://en.wikipedia.org/wiki/Aho-Corasick_algorithm" *      >Wikipedia: Aho-Corasick Algorithm</a> * <br /><small>[Entry sketchy at the time of writing, but good links.]</small> * <li>Gusfield, Dan. 1997. <i>Algorithms on Strings, Trees and Sequences.</i> * Cambridge University Press. * <br /><small>[Best book on string algorithms out there.  Great explanation * of Aho-Corasick.</small>] * </ul> * * @author Bob Carpenter * @version 3.2.0 * @since   LingPipe2.3.1 */public class ExactDictionaryChunker implements Chunker {    final TrieNode mTrieRootNode;    final TokenizerFactory mTokenizerFactory;    final boolean mCaseSensitive;    boolean mReturnAllMatches;    int mMaxPhraseLength = 0;    /**     * Construct an exact dictionary chunker from the specified     * dictionary and tokenizer factory which is case sensitive and     * returns all matches.  See the class documentation above for     * more information on chunking and requirements for tokenizer     * factories.     *     * <p>After construction, this class does not use the dictionary     * and will not be sensitive to changes in the underlying     * dictionary.     *     * @param dict Dictionary forming the basis of the chunker.     * @param factory Tokenizer factory underlying chunker.     */    public ExactDictionaryChunker(Dictionary<String> dict, TokenizerFactory factory) {        this(dict,factory,true,true);    }    /**     * Construct an exact dictionary chunker from the specified     * dictionary and tokenizer factory, returning all matches or not     * as specified.  See the class documentation above for more     * information on chunking.     *     * <p>After construction, this class does not use the dictionary     * and will not be sensitive to changes in the underlying     * dictionary.     *     * @param dict Dictionary forming the basis of the chunker.     * @param factory Tokenizer factory underlying chunker.     * @param returnAllMatches <code>true</code> if chunker should return     * all matches.     * @param caseSensitive <code>true</code> if chunker is case     * sensitive.     */    public ExactDictionaryChunker(Dictionary<String> dict,                                  TokenizerFactory factory,                                  boolean returnAllMatches,                                  boolean caseSensitive) {        mTokenizerFactory = factory;        mReturnAllMatches = returnAllMatches;        mCaseSensitive = caseSensitive;        mTrieRootNode = compileTrie(dict);    }    /**     * Returns the tokenizer factory underlying this chunker.  Once     * set in the constructor, the tokenizer factory may not be     * changed.  If the tokenizer factory allows dynamic     * reconfiguration, it should not be reconfigured or inconsistent     * results may be returned.     *     * @return The tokenizer factory for this chunker.     */    public TokenizerFactory tokenizerFactory() {        return mTokenizerFactory;    }    /**     * Returns <code>true</code> if this dictionary chunker is     * case sensitive.  Case sensitivity must be defined at     * construction time and may not be reset.     *     * @return Whether this chunker is case sensitive.     */    public boolean caseSensitive() {        return mCaseSensitive;    }    /**     * Returns <code>true</code> if this chunker returns all matches.     *     * @return Whether this chunker returns all matches.     */    public boolean returnAllMatches() {        return mReturnAllMatches;    }    /**     * Set whether to return all matches to the specified condition.     *     * <p>Note that setting this while running a chunking in another     * thread may affect that chunking.     *     * @param returnAllMatches <code>true</code> if all matches should     * be returned.     */    public void setReturnAllMatches(boolean returnAllMatches) {        mReturnAllMatches = returnAllMatches;    }    /**     * Returns the chunking for the specified character sequence.     * Whether all matching chunks are returned depends on whether     * this chunker is configured to return all matches or not.     * See the class documentation above for more information.     *     * @param cSeq Character sequence to chunk.     * @return The chunking for the specified character sequence.     */    public Chunking chunk(CharSequence cSeq) {        char[] cs = Strings.toCharArray(cSeq);        return chunk(cs,0,cs.length);    }    final Tokenizer tokenizer(char[] cs, int start, int end) {        Tokenizer tokenizer = mTokenizerFactory.tokenizer(cs,start,end-start);        return mCaseSensitive            ? tokenizer            : new LowerCaseFilterTokenizer(tokenizer);    }    /**     * Returns the chunking for the specified character slice.     * Whether all matching chunks are returned depends on whether     * this chunker is configured to return all matches or not.     * See the class documentation above for more information.     *     * @param cs Underlying array of characters.     * @param start Index of first character in slice.
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -