首页 › 资源下载 › Java编程 › lucene2.2.0版本 › 源码查看

htmlstrategy.java

来自「lucene2.2.0版本」· Java 代码 · 共 100 行

JAVA

100 行

/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements.  See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License.  You may obtain a copy of the License at * *     http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package org.apache.lucene.gdata.search.analysis;import java.io.IOException;import java.io.StringReader;import java.io.StringWriter;import javax.xml.xpath.XPathExpressionException;import org.apache.lucene.gdata.data.ServerBaseEntry;import org.apache.lucene.gdata.search.config.IndexSchemaField;import org.apache.xerces.xni.XNIException;import org.apache.xerces.xni.parser.XMLDocumentFilter;import org.apache.xerces.xni.parser.XMLInputSource;import org.apache.xerces.xni.parser.XMLParserConfiguration;import org.cyberneko.html.HTMLConfiguration;import org.cyberneko.html.filters.ElementRemover;import org.cyberneko.html.filters.Writer;import org.w3c.dom.Node;/** * This ContentStrategy applies the path to the Indexable and retrieves the * plain string content from the returning node. All of the nodes text content * will cleaned from any html tags. *  * @author Simon Willnauer *  */public class HTMLStrategy extends        org.apache.lucene.gdata.search.analysis.ContentStrategy {    private static final String REMOVE_SCRIPT = "script";    private static final String CHAR_ENCODING = "UTF-8";    protected HTMLStrategy(IndexSchemaField fieldConfiguration) {        super(fieldConfiguration);    }    /**     * @see org.apache.lucene.gdata.search.analysis.ContentStrategy#processIndexable(org.apache.lucene.gdata.search.analysis.Indexable)     */    @Override    public void processIndexable(Indexable<? extends Node, ? extends ServerBaseEntry> indexable)            throws NotIndexableException {        String path = this.config.getPath();        Node node = null;        try {            node = indexable.applyPath(path);        } catch (XPathExpressionException e1) {            throw new NotIndexableException("Can not apply path -- " + path);        }        if(node == null)            throw new NotIndexableException("Could not retrieve content for schema field: "+this.config);        StringReader contentReader = new StringReader(node.getTextContent());        /*         * remove all elements and script parts         */        ElementRemover remover = new ElementRemover();        remover.removeElement(REMOVE_SCRIPT);        StringWriter contentWriter = new StringWriter();        Writer writer = new Writer(contentWriter, CHAR_ENCODING);        XMLDocumentFilter[] filters = { remover, writer, };        XMLParserConfiguration parser = new HTMLConfiguration();        parser.setProperty("http://cyberneko.org/html/properties/filters",                filters);        XMLInputSource source = new XMLInputSource(null, null, null,                contentReader, CHAR_ENCODING);        try {            parser.parse(source);        } catch (XNIException e) {            throw new NotIndexableException("Can not parse html -- ", e);        } catch (IOException e) {            throw new NotIndexableException("Can not parse html -- ", e);        }        this.content = contentWriter.toString();    }    }

htmlstrategy.java - 源码说明

本页面展示了「lucene2.2.0版本」中的 htmlstrategy.java 源码文件，采用 Java 编程语言编写，共 100 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫开发者社区收录了大量与Lucene相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?