📄 minimalperfecthash.java

📁 MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections
💻 JAVA
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
package it.unimi.dsi.mg4j.util;/*		  * MG4J: Managing Gigabytes for Java * * Copyright (C) 2002-2007 Sebastiano Vigna  * *  This library is free software; you can redistribute it and/or modify it *  under the terms of the GNU Lesser General Public License as published by the Free *  Software Foundation; either version 2.1 of the License, or (at your option) *  any later version. * *  This library is distributed in the hope that it will be useful, but *  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY *  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License *  for more details. * *  You should have received a copy of the GNU Lesser General Public License *  along with this program; if not, write to the Free Software *  Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * */import it.unimi.dsi.Util;import it.unimi.dsi.fastutil.booleans.BooleanArrays;import it.unimi.dsi.fastutil.ints.IntArrays;import it.unimi.dsi.fastutil.io.BinIO;import it.unimi.dsi.mg4j.index.AbstractTermMap;import it.unimi.dsi.mg4j.index.TermMap;import it.unimi.dsi.mg4j.io.FastBufferedReader;import it.unimi.dsi.mg4j.io.FileLinesCollection;import it.unimi.dsi.mg4j.io.LineIterator;import java.io.FileInputStream;import java.io.IOException;import java.io.InputStreamReader;import java.io.ObjectInputStream;import java.io.ObjectOutputStream;import java.io.Serializable;import java.io.UnsupportedEncodingException;import java.lang.reflect.InvocationTargetException;import java.nio.charset.Charset;import java.util.ArrayList;import java.util.Collection;import java.util.Date;import java.util.Iterator;import java.util.zip.GZIPInputStream;import org.apache.log4j.Logger;import cern.colt.GenericPermuting;import cern.colt.GenericSorting;import cern.colt.Swapper;import cern.colt.function.IntComparator;import cern.jet.random.engine.MersenneTwister;import com.martiansoftware.jsap.FlaggedOption;import com.martiansoftware.jsap.JSAP;import com.martiansoftware.jsap.JSAPException;import com.martiansoftware.jsap.JSAPResult;import com.martiansoftware.jsap.Parameter;import com.martiansoftware.jsap.SimpleJSAP;import com.martiansoftware.jsap.Switch;import com.martiansoftware.jsap.UnflaggedOption;import com.martiansoftware.jsap.stringparsers.ForNameStringParser;/** Order-preserving minimal perfect hash tables. * * <P>Given a list of terms without duplicates,  * the constructors of this class finds an order-preserving minimal perfect hash function for * the list. Subsequent calls to the {@link #getNumber(CharSequence)} method will return the ordinal position of * the provided character sequence (if it appeared in the list; otherwise, the result is a random position). The class * can then be saved by serialisation and reused later. * * <P>If you also need to establish whether a given term was or not in the * list, you should consider using a {@link SignedMinimalPerfectHash} * implementing class instead. * * <P>This class is very scalable, and if you have enough memory it will handle * efficiently hundreds of millions of terms: in particular, the  * {@linkplain #MinimalPerfectHash(String, String) offline constructor} * can build a map without loading the terms in memory. *  * <P>To do its job, the class must store three vectors of weights that are used to compute * certain hash functions. By default, the vectors are long as the longest term, but if * your collection is sorted you can ask (passing {@link #WEIGHT_UNKNOWN_SORTED_TERMS} to a constructor) * for the shortest possible vector length. This optimisation does not change the * memory footprint, but it can improve performance. *  * <P>As a commodity, this class provides a main method that reads from * standard input a (possibly <samp>gzip</samp>'d) sequence of newline-separated terms, and * writes a serialised minimal perfect hash table for the given list. You can * specify on the command line the kind of table (e.g.,  * {@link it.unimi.dsi.mg4j.util.HashCodeSignedMinimalPerfectHash}) and have it fetched by reflection. * As a commodity, all signed classes in MG4J have a main method invoking the {@linkplain #main(Class, String[])  * parameterised main method} of this class, * which accept the default class to be built. In this way, running the main method of * any signed class provides the same features.  *  * <P>For efficiency, there are also method that access a minimal perfect hash * {@linkplain #getNumber(byte[], int, int) using byte arrays interpreted as ISO-8859-1} characters. * * <h3>Implementation Details</h3>  * * <P>An object of this class uses about 1.26<var>n</var> integers between 0 and <var>n</var>-1 inclusive * to hash <var>n</var> terms. This is asymptotically optimal, but for small collections using an integer * wastes a large number of bits in each entry. At construction time, however, about 15<var>n</var> integers * (i.e., 60<var>n</var> bytes) are necessary. * * <P>The technique used here is suggested (but not fully described) in the second edition of <em>Managing * Gigabytes</em>. It is due to Havas, Majewski, Wormald and Czech, and it is fully described in * the monograph by Czech, Havas and Majewski on perfect hashing (&ldquo;Perfect Hashing&rdquo;,  * <i>Theoret. Comput. Sci.</i> 182:1&minus;143, 1997). * * <P>First, a triple of intermediate hash functions (generated via universal hashing) define for each * term a 3-hyperedge in a hypergraph with 1.26<var>n</var> vertices. Each intermediate hash function * uses a vector of random weights; the length of the vector is by default the length of the longest * term, but if the collection is sorted it is possible to compute the minimum length of a prefix that will * distinguish any pair of term.  *  * <P>Then, by successively * stripping hyperedges with one vertex of degree one, we create an ordering on the hyperedges.  * Finally, we assign (in reverse order) to each vertex of a hyperedge a number in the range * 0 and <var>n</var>-1 in such a way that the sum of the three numbers modulo <var>n</var> is exactly * the original index of the term corresponding to the hyperedge. This is possible with high probability,  * so we try until we succeed. *  * <P>Note that the mathematical results guaranteeing that it is possible to find a * function in expected constant time are <em>asymptotic</em>.  * Thus, when the number of terms is less than {@link #TERM_THRESHOLD}, an instance * of this class just stores the terms in a vector and scans them linearly. This behaviour, * however, is completely transparent to the user. * * <h3>Rounds and Logging</h3> *  * <P>Building a minimal perfect hash map may take hours. As it happens with all probabilistic algorithms, * one can just give estimates of the expected time. *  * <P>There are two probabilistic sources of problems: degenerate hyperedges and non-acyclic hypergraphs. * The ratio between the number of vertices and the number of terms guarantee that acyclicity is true with high * probability. However, when generating the hypergraph we use three hash functions, and it must never happen * that the value of two of those functions coincide on a given term. Because of the birthday paradox, the * probability of getting a nondegerate edge is just * <blockquote> * (<var>m</var>-1)(<var>m</var>-2)/<var>m</var><sup>2</sup>, * </blockquote> * where <var>m</var>=1.26<var>n</var>. Since this must happen for <var>n</var> times, we must raise * this probability to <var>n</var>, and as <var>n</var> grows the value quickly (and monotonically)  * reaches <i>e</i><sup>-3/1.26</sup>. So the expected number of trials to generate a random 3-hypergraph  * would be bounded by <i>e</i><sup>3/1.26</sup>, which is about 11. Note that this bound  * <em>does not depend on <var>n</var></em>. *  * <p>However, starting from MG4J 1.2 this class will patch deterministically the hash functions so that * degenerate edges are <em>extremely</em> unlikely (in fact, so unlikely they never happen). One round of * generation should be sufficient for generating a valid hypergraph. The fix is performed by adding * {@link #NODE_OVERHEAD} nodes to the graph, and using them to remap the random hash functions in case * of clashes. The fix must be repeated, of course, each time {@link #getNumber(MutableString)} is called, * but in the vaste majority of cases it reduces to two checks for equality with negative result. * * <P>Once the hypergraph has been generated, the stripping procedure may fail. However, the expected number * of trials tends to 1 as <var>n</var> * approaches infinity (Czech, Havas and Majewski, for instance, report that on a set of 50,000 terms * they obtained consistently one trial for more than 5000 experiments). *   * <P>To help diagnosing problem with the generation process * class, this class will log at {@link org.apache.log4j.Level#INFO INFO} level * what's happening. In particular, it will signal whenever a degenerate * hyperedge is generated, and the various construction phases. * * <P>Note that if during the generation process the log warns more than once about duplicate hyperedges, you should * suspect that there are duplicates in the term list, as duplicate hyperedges are <em>extremely</em> unlikely. * * @author Sebastiano Vigna * @author Paolo Boldi * @since 0.1 * @deprecated Use the new hashing stuff in Sux4J. */@Deprecatedpublic class MinimalPerfectHash extends AbstractTermMap implements Serializable {	private static final Logger LOGGER = Util.getLogger( MinimalPerfectHash.class );		/** The number of nodes the graph will actually have. This value guarantees that the graph will be acyclic with positive probability. */	public static final float ENLARGEMENT_FACTOR = 1.26f;	/** The overhead in term of graphs nodes that is used to solve deterministically node clashes. */	public static final int NODE_OVERHEAD = 16;	/** The minimum number of terms that will trigger the construction of a minimal perfect hash;	 * overwise, terms are simply stored in a vector. */	public static final int TERM_THRESHOLD = 16;	/** A special value denoting that the weight length is unknown, and should be computed using	 * the maximum length of a term. */	public static final int WEIGHT_UNKNOWN = -1;	/** A special value denoting that the weight length is unknown, and should be computed assuming	 * that the terms appear in lexicographic order. */	public static final int WEIGHT_UNKNOWN_SORTED_TERMS = -2;	/** The number of buckets. */	final protected int n;	/** The number of vertices of the intermediate hypergraph, excluding the {@linkplain #NODE_OVERHEAD overhead nodes}. */	final protected int m;	/** The maximum amount of right shift to perform on a 32-bit number so to obtain something greater than or equal to {@link #m}. */	final protected int rightShift;	/** Initialisation values for the intermediate hash functions. */	final protected int init[];	/** Vector of weights to compute the first intermediate hash function. */	final protected int[] weight0;	/** Vector of weights to compute the second intermediate hash function. */	final protected int[] weight1;	/** Vector of weights to compute the third intermediate hash function. */	final protected int[] weight2;	/** The length of the components of the weight vectors (it's faster than asking the length of the vectors). */	final protected int weightLength;	/** The final magick&mdash;the function turning the intermediate hash values into the final bucket. */	final protected int[] g;	/** Four times the number of buckets. */	protected transient long n4;	/** If {@link #n} is smaller than {@link #TERM_THRESHOLD}, a vector containing the terms. */	protected transient CharSequence[] t;    public static final long serialVersionUID = 2L;    	/* The following four methods MUST be kept synchronised. The reason why we duplicate code is 	 * that we do not want the overhead of allocating an array when searching for a string. 	 * 	 * Note that we don't use shift-add-xor hash functions (as in BloomFilter). They are significantly	 * slower, and in this case (in which we certainly have enough weights to disambiguate any pair	 * of strings) they are not particularly advantageous.  	 */	/** Hashes a given term using the intermediate hash functions.	 *	 * @param term a term to hash.	 * @param h a three-element vector that will be filled with the three intermediate hash values.	 */	protected void hash( final CharSequence term, final int[] h ) {		int h0 = init[ 0 ], h1 = init[ 1 ], h2 = init[ 2 ]; // We need three these values to map correctly the empty string		char c;		int i = term.length();		if ( weightLength < i ) i = weightLength;		while( i-- != 0 ) {			c = term.charAt( i );			h0 ^= weight0[ i ] * c;			h1 ^= weight1[ i ] * c;			h2 ^= weight2[ i ] * c;		}		h0 = ( h0 >>> rightShift ) % m;		h1 = ( h1 >>> rightShift ) % m;		h2 = ( h2 >>> rightShift ) % m;				// Deterministic fix: we move h1 or h2 into the overhead space if necessary		if ( h0 == h1 || h1 == h2 ) h1 = m + h1 % (NODE_OVERHEAD / 2 );		if ( h0 == h2 ) h2 = m + NODE_OVERHEAD / 2 + h2 % ( NODE_OVERHEAD / 2 );				h[ 0 ] = h0;		h[ 1 ] = h1;		h[ 2 ] = h2;	}	/** Hashes a given term.	 *	 * @param term a term to be hashed.	 * @return the position of the given term in the generating collection, starting from 0. If the	 * term was not in the original collection, the result is a random position.	 *	 */	public int getNumber( final CharSequence term ) {				if ( t != null ) return getFromT( term );				int h0 = init[ 0 ], h1 = init[ 1 ], h2 = init[ 2 ]; // We need three these values to map correctly the empty string		char c;		int i = term.length();		if ( weightLength < i ) i = weightLength;		while( i-- != 0 ) {			c = term.charAt( i );			h0 ^= weight0[ i ] * c;			h1 ^= weight1[ i ] * c;			h2 ^= weight2[ i ] * c;		}		h0 = ( h0 >>> rightShift ) % m;		h1 = ( h1 >>> rightShift ) % m;		h2 = ( h2 >>> rightShift ) % m;		// Deterministic fix: we move h1 or h2 into the overhead space if necessary		if ( h0 == h1 || h1 == h2 ) h1 = m + h1 % (NODE_OVERHEAD / 2 );		if ( h0 == h2 ) h2 = m + NODE_OVERHEAD / 2 + h2 % ( NODE_OVERHEAD / 2 );		return (int)( ( (long)g[ h0 ] + g[ h1 ] + g[ h2 ] + n4 ) % n );	}	/**	 * Hashes a given term.	 * 	 * @param term	 *            a term to be hashed.	 * @return the position of the given term in the generating collection,	 *         starting from 0. If the term was not in the original collection,	 *         the result is a random position.	 */	public int getNumber( final MutableString term ) {		if ( t != null ) return getFromT( term );		int h0 = init[ 0 ], h1 = init[ 1 ], h2 = init[ 2 ]; // We need three these values to map correctly the empty string		char c;		int i = term.length();		final char[] a = term.array;		if ( weightLength < i ) i = weightLength;		while( i-- != 0 ) {			c = a[ i ];			h0 ^= weight0[ i ] * c;			h1 ^= weight1[ i ] * c;			h2 ^= weight2[ i ] * c;		}		h0 = ( h0 >>> rightShift ) % m;		h1 = ( h1 >>> rightShift ) % m;		h2 = ( h2 >>> rightShift ) % m;		// Deterministic fix: we move h1 or h2 into the overhead space if necessary
12 3 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -