⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 textextractor.java

📁 MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections
💻 JAVA
字号:
package it.unimi.dsi.mg4j.util.parser.callback;/*		  * MG4J: Managing Gigabytes for Java * * Copyright (C) 2005-2007 Sebastiano Vigna  * *  This library is free software; you can redistribute it and/or modify it *  under the terms of the GNU Lesser General Public License as published by the Free *  Software Foundation; either version 2.1 of the License, or (at your option) *  any later version. * *  This library is distributed in the hope that it will be useful, but *  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY *  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License *  for more details. * *  You should have received a copy of the GNU Lesser General Public License *  along with this program; if not, write to the Free Software *  Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * */import it.unimi.dsi.mg4j.util.MutableString;import it.unimi.dsi.mg4j.util.parser.Attribute;import it.unimi.dsi.mg4j.util.parser.BulletParser;import it.unimi.dsi.mg4j.util.parser.Element;import java.util.Map;/** A callback extracting text and titles. *  * <P>This callbacks extracts all text in the page, and the title. * The resulting * text is available through {@link #text}, and the title through {@link #title}. *  * <P>Note that {@link #text} and {@link #title} are never trimmed.  * @deprecated Moved to <code>dsiutils</code>.*/@Deprecatedpublic class TextExtractor extends DefaultCallback {	/** The text resulting from the parsing process. */	public final MutableString text = new MutableString();	/** The title resulting from the parsing process. */	public final MutableString title = new MutableString();	/** True if we are in the middle of the title. */	private boolean inTitle;	/**	 * Configure the parser to parse text. 	 */	public void configure( final BulletParser parser ) {		parser.parseText( true );		// To get the title.		parser.parseTags( true );	}	public void startDocument() {		text.length( 0 );		title.length( 0 );		inTitle = false;	}	public boolean characters( final char[] characters, final int offset, final int length, final boolean flowBroken ) {		text.append( characters, offset, length );		if ( inTitle ) title.append( characters, offset, length );		return true;	}		public boolean endElement( final Element element ) {		// No element is allowed inside a title.		inTitle = false;		if ( element.breaksFlow ) {			if ( inTitle ) title.append( ' ' );			text.append( ' ' );		}		return true;	}		public boolean startElement( final Element element, final Map<Attribute,MutableString> attrMapUnused ) {		// No element is allowed inside a title.		inTitle = element == Element.TITLE;		if ( element.breaksFlow ) {			if ( inTitle ) title.append( ' ' );			text.append( ' ' );		}		return true;	}	}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -