⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 geniasentenceparser.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i * Royalty Free License Version 1 for more details. *  * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.corpus.parsers;import com.aliasi.chunk.Chunk;import com.aliasi.chunk.ChunkFactory;import com.aliasi.chunk.ChunkingImpl;import com.aliasi.corpus.ChunkHandler;import com.aliasi.corpus.Handler;import com.aliasi.corpus.XMLParser;import com.aliasi.sentences.SentenceChunker;import com.aliasi.xml.DelegatingHandler;import com.aliasi.xml.DelegateHandler;import com.aliasi.xml.TextAccumulatorHandler;import java.util.ArrayList;import java.util.List;import org.xml.sax.SAXException;import org.xml.sax.helpers.DefaultHandler;/** * A <code>GeniaSentenceParser</code> provides a chunk parser for the * XML version of the GENIA corpus.  The type assigned to sentence * chunks is the constant {@link SentenceChunker#SENTENCE_CHUNK_TYPE}. * It only returns the sentences from citation abstracts, not * sentences in citation titles. * * <P>The following example is drawn from the initial part of the merged * 3.02 version of the GENIA corpus (with some content ellided and replaced * by ellipses (<code>...</code>, but all spaces/linebreaks left as is): * * <blockquote><table border='1' cellpadding='5'><tr><td><pre>&lt;set&gt;&lt;article&gt;&lt;articleinfo&gt;&lt;bibliomisc&gt;MEDLINE:95369245&lt;/bibliomisc&gt;&lt;/articleinfo&gt;&lt;title&gt;&lt;sentence&gt;...&lt;/sentence&gt;&lt;/title&gt;&lt;abstract&gt;&lt;sentence&gt;&lt;w c=&quot;NN&quot;&gt;Activation&lt;/w&gt; &lt;w c=&quot;IN&quot;&gt;of&lt;/w&gt; &lt;w c=&quot;DT&quot;&gt;the&lt;/w&gt; &lt;cons lex=&quot;CD28_surface_receptor&quot; sem=&quot;G#protein_family_or_group&quot;&gt;&lt;cons lex=&quot;CD28&quot; sem=&quot;G#protein_molecule&quot;&gt;&lt;w c=&quot;NN&quot;&gt;CD28&lt;/w&gt;&lt;/cons&gt; &lt;w c=&quot;NN&quot;&gt;surface&lt;/w&gt; &lt;w c=&quot;NN&quot;&gt;receptor&lt;/w&gt;&lt;/cons&gt; &lt;w c=&quot;VBZ&quot;&gt;provides&lt;/w&gt; &lt;w c=&quot;DT&quot;&gt;a&lt;/w&gt; &lt;w c=&quot;JJ&quot;&gt;major&lt;/w&gt; &lt;w c=&quot;JJ&quot;&gt;costimulatory&lt;/w&gt; &lt;w c=&quot;NN&quot;&gt;signal&lt;/w&gt; &lt;w c=&quot;IN&quot;&gt;for&lt;/w&gt; &lt;cons lex=&quot;T_cell_activation&quot; sem=&quot;G#other_name&quot;&gt;&lt;w c=&quot;NN&quot;&gt;T&lt;/w&gt; &lt;w c=&quot;NN&quot;&gt;cell&lt;/w&gt; &lt;w c=&quot;NN&quot;&gt;activation&lt;/w&gt;&lt;/cons&gt; &lt;w c=&quot;VBG&quot;&gt;resulting&lt;/w&gt; &lt;w c=&quot;IN&quot;&gt;in&lt;/w&gt; &lt;w c=&quot;VBN&quot;&gt;enhanced&lt;/w&gt; &lt;w c=&quot;NN&quot;&gt;production&lt;/w&gt; &lt;w c=&quot;IN&quot;&gt;of&lt;/w&gt; &lt;cons lex=&quot;interleukin-2&quot; sem=&quot;G#protein_molecule&quot;&gt;&lt;w c=&quot;NN&quot;&gt;interleukin-2&lt;/w&gt;&lt;/cons&gt; &lt;w c=&quot;(&quot;&gt;(&lt;/w&gt;&lt;cons lex=&quot;IL-2&quot; sem=&quot;G#protein_molecule&quot;&gt;&lt;w c=&quot;NN&quot;&gt;IL-2&lt;/w&gt;&lt;/cons&gt;&lt;w c=&quot;)&quot;&gt;)&lt;/w&gt; &lt;w c=&quot;CC&quot;&gt;and&lt;/w&gt; &lt;cons lex=&quot;cell_proliferation&quot; sem=&quot;G#other_name&quot;&gt;&lt;w c=&quot;NN&quot;&gt;cell&lt;/w&gt; &lt;w c=&quot;NN&quot;&gt;proliferation&lt;/w&gt;&lt;/cons&gt;&lt;w c=&quot;.&quot;&gt;.&lt;/w&gt;&lt;/sentence&gt;&lt;sentence&gt;...&lt;/sentence&gt;... * </pre></td></tr></table></blockquote> * * All that is required is to pull all of the text content (including * informative spaces) from the sentence elements. * * <P>The GENIA corpus is available free of charge from: *  * <UL> * * <LI><a href="http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/" *       >GENIA Project Home Page</a> * </UL> * * @author  Bob Carpenter * @version 2.1.1 * @since   LingPipe2.1.1 */public class GeniaSentenceParser extends XMLParser {    /**     * Construct a GENIA sentence chunk parser with no designated chunk     * handler.  Chunk handlers may be later set using the method     * {@link #setHandler(Handler)}.     *     * @throws SAXException If there is an error configuring the     * SAX XML reader required for parsing.     */    public GeniaSentenceParser() throws SAXException {    super();    }    /**     * Construct a GENIA sentence chunk parser with the specified     * chunk handler.      *     * @param handler The chunk handler used to process sentences     * found by this parser.     * @throws SAXException If there is an error configuring the     * SAX XML reader required for parsing.     */    public GeniaSentenceParser(ChunkHandler handler) throws SAXException {    super(handler);    }    /**     * Returns the embedded XML handler.  This method implements     * the required method for the abstract superclass {@link XMLParser}.     *     * @return The XML handler for this class.     */    protected DefaultHandler getXMLHandler() {        return new SetHandler(getChunkHandler());    }    /**     * Sets the handler to the specified chunk handler.  If the handler     * is not a chunk handler, an illegal argument exception will be     * raised.     *     * @param handler New chunk handler.     * @throws IllegalArgumentException If the handler is not a chunk     * handler.     */    public void setHandler(Handler handler) {    if (!(handler instanceof ChunkHandler)) {        String msg = "Handler must be a chunk handler."        + " Found handler with class=" + handler.getClass();        throw new IllegalArgumentException(msg);    }    super.setHandler(handler);    }    /**     * Returns the chunk handler for this sentence parser.  The result     * will be the same as calling the superclass method {@link     * #getHandler()}, but the result in this case is cast to type     * <code>ChunkHandler</code>.     *     * @return The chunk handler for this sentence parser.     */    public ChunkHandler getChunkHandler() {    return (ChunkHandler) getHandler();    }    /**     * The tag used for sentence elements in GENIA, namely     * <code>sentence</code>.     */    public static final String GENIA_SENTENCE_ELT = "sentence";    /**     * The tag used for abstract elements in GENIA, namely     * <code>abstract</code>.     */    public static final String GENIA_ABSTRACT_ELT = "abstract";    private static class SetHandler extends DelegatingHandler {    final ChunkHandler mChunkHandler;        final AbstractHandler mAbstractHandler;    SetHandler(ChunkHandler chunkHandler) {        mChunkHandler = chunkHandler;        mAbstractHandler = new AbstractHandler(this);        setDelegate(GENIA_ABSTRACT_ELT,mAbstractHandler);    }    public void finishDelegate(String qName, DefaultHandler delegate) {        if (qName.equals(GENIA_ABSTRACT_ELT)) {        handleSentenceTexts(mAbstractHandler.getSentenceTexts());        }    }    void handleSentenceTexts(List texts) {        StringBuffer sb = new StringBuffer();        int numChunks = texts.size();        int[] lengths = new int[numChunks];        for (int i = 0; i< numChunks; i++) {        if (i > 0) sb.append(" ");        String text = (String)texts.get(i);        sb.append(text);        lengths[i] = text.length();        }        char[] cs = sb.toString().toCharArray();        int offset = 0;        ChunkingImpl chunking = new ChunkingImpl(cs,0,cs.length);        for (int i = 0; i< numChunks; i++) {        Chunk chunk              = ChunkFactory            .createChunk(offset,offset+lengths[i],                 SentenceChunker.SENTENCE_CHUNK_TYPE);            chunking.add(chunk);            offset += lengths[i]+1;        }        mChunkHandler.handle(chunking);    }    }    private static class AbstractHandler extends DelegateHandler {    final ArrayList mSentTexts  = new ArrayList();        final TextAccumulatorHandler mSentenceHandler         = new TextAccumulatorHandler();        public AbstractHandler(DelegatingHandler parent) {        super(parent);            setDelegate(GENIA_SENTENCE_ELT, mSentenceHandler);        }        public void startDocument() {        mSentTexts.clear();        }    public void finishDelegate(String qName, DefaultHandler delegate) {        if (qName.equals(GENIA_SENTENCE_ELT)) {        String text = mSentenceHandler.getText().trim();        if (text.length() > 0) mSentTexts.add(text);        }    }    List getSentenceTexts() {        return mSentTexts;    }    }}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -