chinesetokenizer.java

来自「「我是中國人」」· Java 代码 · 共 170 行

JAVA

170 行

package org.apache.lucene.analysis.cn;/* ==================================================================== * The Apache Software License, Version 1.1 * * Copyright (c) 2004 The Apache Software Foundation.  All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright *    notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright *    notice, this list of conditions and the following disclaimer in *    the documentation and/or other materials provided with the *    distribution. * * 3. The end-user documentation included with the redistribution, *    if any, must include the following acknowledgment: *       "This product includes software developed by the *        Apache Software Foundation (http://www.apache.org/)." *    Alternately, this acknowledgment may appear in the software itself, *    if and wherever such third-party acknowledgments normally appear. * * 4. The names "Apache" and "Apache Software Foundation" and *    "Apache Lucene" must not be used to endorse or promote products *    derived from this software without prior written permission. For *    written permission, please contact apache@apache.org. * * 5. Products derived from this software may not be called "Apache", *    "Apache Lucene", nor may "Apache" appear in their name, without *    prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * ==================================================================== * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation.  For more * information on the Apache Software Foundation, please see * <http://www.apache.org/>. */import java.io.Reader;import org.apache.lucene.analysis.*;/** * Title: ChineseTokenizer * Description: Extract tokens from the Stream using Character.getType() *              Rule: A Chinese character as a single token * Copyright:   Copyright (c) 2001 * Company: * * The difference between thr ChineseTokenizer and the * CJKTokenizer (id=23545) is that they have different * token parsing logic. *  * Let me use an example. If having a Chinese text * "C1C2C3C4" to be indexed, the tokens returned from the * ChineseTokenizer are C1, C2, C3, C4. And the tokens * returned from the CJKTokenizer are C1C2, C2C3, C3C4. * * Therefore the index the CJKTokenizer created is much * larger. * * The problem is that when searching for C1, C1C2, C1C3, * C4C2, C1C2C3 ... the ChineseTokenizer works, but the * CJKTokenizer will not work. * * @author Yiyi Sun * @version 1.0 * */public final class ChineseTokenizer extends Tokenizer {    public ChineseTokenizer(Reader in) {        input = in;    }    private int offset = 0, bufferIndex=0, dataLen=0;    private final static int MAX_WORD_LEN = 255;    private final static int IO_BUFFER_SIZE = 1024;    private final char[] buffer = new char[MAX_WORD_LEN];    private final char[] ioBuffer = new char[IO_BUFFER_SIZE];    private int length;    private int start;    private final void push(char c) {        if (length == 0) start = offset-1;            // start of token        buffer[length++] = Character.toLowerCase(c);  // buffer it    }    private final Token flush() {        if (length>0) {            //System.out.println(new String(buffer, 0, length));            return new Token(new String(buffer, 0, length), start, start+length);        }        else            return null;    }    public final Token next() throws java.io.IOException {        length = 0;        start = offset;        while (true) {            final char c;            offset++;            if (bufferIndex >= dataLen) {                dataLen = input.read(ioBuffer);                bufferIndex = 0;            };            if (dataLen == -1) return flush();            else                c = ioBuffer[bufferIndex++];            switch(Character.getType(c)) {            case Character.DECIMAL_DIGIT_NUMBER:            case Character.LOWERCASE_LETTER:            case Character.UPPERCASE_LETTER:                push(c);                if (length == MAX_WORD_LEN) return flush();                break;            case Character.OTHER_LETTER:                if (length>0) {                    bufferIndex--;                    return flush();                }                push(c);                return flush();            default:                if (length>0) return flush();                break;            }        }    }}

chinesetokenizer.java - 源码说明

本页面展示了「「我是中國人」」中的 chinesetokenizer.java 源码文件，采用 Java 编程语言编写，共 170 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫开发者社区收录了大量与中文分词相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?