⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 merge.java

📁 MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections
💻 JAVA
字号:
package it.unimi.dsi.mg4j.tool;/*		  * MG4J: Managing Gigabytes for Java * * Copyright (C) 2005-2007 Sebastiano Vigna  * *  This library is free software; you can redistribute it and/or modify it *  under the terms of the GNU Lesser General Public License as published by the Free *  Software Foundation; either version 2.1 of the License, or (at your option) *  any later version. * *  This library is distributed in the hope that it will be useful, but *  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY *  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License *  for more details. * *  You should have received a copy of the GNU Lesser General Public License *  along with this program; if not, write to the Free Software *  Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * */import it.unimi.dsi.fastutil.ints.IntHeapSemiIndirectPriorityQueue;import it.unimi.dsi.fastutil.ints.IntIterator;import it.unimi.dsi.mg4j.index.Index;import it.unimi.dsi.mg4j.index.IndexIterator;import it.unimi.dsi.mg4j.index.CompressionFlags.Coding;import it.unimi.dsi.mg4j.index.CompressionFlags.Component;import it.unimi.dsi.io.OutputBitStream;import it.unimi.dsi.Util;import java.io.Closeable;import java.io.IOException;import java.lang.reflect.InvocationTargetException;import java.net.URISyntaxException;import java.util.Map;import org.apache.commons.configuration.ConfigurationException;import org.apache.log4j.Logger;import com.martiansoftware.jsap.JSAPException;/** Merges several indices. *  * <P>This class merges indices by performing a simple ordered list merge. Documents * appearing in two indices will cause an error. *  * @author Sebastiano Vigna * @since 1.0 *  */public class Merge extends Combine {	@SuppressWarnings("unused")	private static final Logger LOGGER = Util.getLogger( Merge.class );	/** The reference array of the document queue. */	protected int[] doc;	/** The queue containing document pointers (for remapped indices). */	protected IntHeapSemiIndirectPriorityQueue documentQueue;		public Merge( final String outputBasename,			final String[] inputBasename,			final boolean metadataOnly,			final int bufferSize,			final Map<Component,Coding> writerFlags,			final boolean interleaved,			final boolean skips,			final int quantum,			final int height,			final int skipBufferSize,			final long logInterval ) throws IOException, ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException {		super( outputBasename, inputBasename, metadataOnly, bufferSize, writerFlags, interleaved, skips, quantum, height, skipBufferSize, logInterval );		doc = new int[ numIndices ];		documentQueue = new IntHeapSemiIndirectPriorityQueue( doc, numIndices );	}	protected int combineNumberOfDocuments() {		int n = 0;		for( int i = 0; i < numIndices; i++ ) n = Math.max( n, index[ i ].numberOfDocuments );		return n;	}		protected int combineSizes() throws IOException {		int currDoc = 0, maxDocSize = 0;		for( int i = 0; i < numIndices; i++ ) {			final IntIterator sizes = sizes( i );			currDoc = 0;			int j = index[ i ].numberOfDocuments;			int s;			while( j-- != 0 ) {				if ( ( s = sizes.nextInt() ) != 0 ) {					if ( size[ currDoc ] != 0 ) throw new IllegalArgumentException( "Document " + currDoc + " has nonzero length in two indices" );					size[ currDoc ] = s;					if ( s > maxDocSize ) maxDocSize = s;				}				currDoc++;			}			if ( sizes instanceof Closeable ) ((Closeable)sizes).close();		}		return maxDocSize;	}	protected int combine( final int numUsedIndices ) throws IOException {		// We gather the frequencies from the subindices and just add up. At the same time, we load the document queue.		int totalFrequency = 0, currIndex, lastIndex = -1;		for( int k = numUsedIndices; k-- != 0; ) {			currIndex = usedIndex[ k ];			totalFrequency += ( frequency[ currIndex ] = indexIterator[ currIndex ].frequency() );			doc[ currIndex ] = indexIterator[ currIndex ].nextDocument();			documentQueue.enqueue( currIndex );		}				indexWriter.newInvertedList();		indexWriter.writeFrequency( totalFrequency );		int currDoc = -1, count; 		OutputBitStream obs;		Index i;		IndexIterator ir;		while( ! documentQueue.isEmpty() ) {			// We extract the smallest document pointer, and enqueue it in the new index.			if ( currDoc == doc[ currIndex = documentQueue.first() ] ) throw new IllegalStateException( "The indices to be merged contain document " + currDoc + " at least twice (once in index " + inputBasename[ lastIndex ] + " and once in index " + inputBasename[ currIndex ] + ")" );			currDoc = doc[ currIndex ];						obs = indexWriter.newDocumentRecord();			indexWriter.writeDocumentPointer( obs, currDoc );			i = index[ currIndex ];			ir = indexIterator[ currIndex ];						if ( i.hasPayloads ) indexWriter.writePayload( obs, ir.payload() );			if ( i.hasCounts ) {				count = ir.count();				if ( hasCounts ) indexWriter.writePositionCount( obs, count );				if ( i.hasPositions && hasPositions ) indexWriter.writeDocumentPositions( obs, ir.positionArray(), 0, count, size[ currDoc ] );			}						// If we just wrote the last document pointer of this term in index j, we dequeue it.			if ( --frequency[ currIndex ] == 0 ) documentQueue.dequeue();			else {				doc[ currIndex ] = ir.nextDocument();				documentQueue.changed();			}			lastIndex = currIndex;		}				return totalFrequency;	}	public static void main( String arg[] ) throws ConfigurationException, SecurityException, JSAPException, IOException, URISyntaxException, ClassNotFoundException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException {		Combine.main( arg, Merge.class );	}}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -