📄 minimalperfecthash.java
字号:
package it.unimi.dsi.mg4j.util;/* * MG4J: Managing Gigabytes for Java * * Copyright (C) 2002-2007 Sebastiano Vigna * * This library is free software; you can redistribute it and/or modify it * under the terms of the GNU Lesser General Public License as published by the Free * Software Foundation; either version 2.1 of the License, or (at your option) * any later version. * * This library is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License * for more details. * * You should have received a copy of the GNU Lesser General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * */import it.unimi.dsi.Util;import it.unimi.dsi.fastutil.booleans.BooleanArrays;import it.unimi.dsi.fastutil.ints.IntArrays;import it.unimi.dsi.fastutil.io.BinIO;import it.unimi.dsi.mg4j.index.AbstractTermMap;import it.unimi.dsi.mg4j.index.TermMap;import it.unimi.dsi.mg4j.io.FastBufferedReader;import it.unimi.dsi.mg4j.io.FileLinesCollection;import it.unimi.dsi.mg4j.io.LineIterator;import java.io.FileInputStream;import java.io.IOException;import java.io.InputStreamReader;import java.io.ObjectInputStream;import java.io.ObjectOutputStream;import java.io.Serializable;import java.io.UnsupportedEncodingException;import java.lang.reflect.InvocationTargetException;import java.nio.charset.Charset;import java.util.ArrayList;import java.util.Collection;import java.util.Date;import java.util.Iterator;import java.util.zip.GZIPInputStream;import org.apache.log4j.Logger;import cern.colt.GenericPermuting;import cern.colt.GenericSorting;import cern.colt.Swapper;import cern.colt.function.IntComparator;import cern.jet.random.engine.MersenneTwister;import com.martiansoftware.jsap.FlaggedOption;import com.martiansoftware.jsap.JSAP;import com.martiansoftware.jsap.JSAPException;import com.martiansoftware.jsap.JSAPResult;import com.martiansoftware.jsap.Parameter;import com.martiansoftware.jsap.SimpleJSAP;import com.martiansoftware.jsap.Switch;import com.martiansoftware.jsap.UnflaggedOption;import com.martiansoftware.jsap.stringparsers.ForNameStringParser;/** Order-preserving minimal perfect hash tables. * * <P>Given a list of terms without duplicates, * the constructors of this class finds an order-preserving minimal perfect hash function for * the list. Subsequent calls to the {@link #getNumber(CharSequence)} method will return the ordinal position of * the provided character sequence (if it appeared in the list; otherwise, the result is a random position). The class * can then be saved by serialisation and reused later. * * <P>If you also need to establish whether a given term was or not in the * list, you should consider using a {@link SignedMinimalPerfectHash} * implementing class instead. * * <P>This class is very scalable, and if you have enough memory it will handle * efficiently hundreds of millions of terms: in particular, the * {@linkplain #MinimalPerfectHash(String, String) offline constructor} * can build a map without loading the terms in memory. * * <P>To do its job, the class must store three vectors of weights that are used to compute * certain hash functions. By default, the vectors are long as the longest term, but if * your collection is sorted you can ask (passing {@link #WEIGHT_UNKNOWN_SORTED_TERMS} to a constructor) * for the shortest possible vector length. This optimisation does not change the * memory footprint, but it can improve performance. * * <P>As a commodity, this class provides a main method that reads from * standard input a (possibly <samp>gzip</samp>'d) sequence of newline-separated terms, and * writes a serialised minimal perfect hash table for the given list. You can * specify on the command line the kind of table (e.g., * {@link it.unimi.dsi.mg4j.util.HashCodeSignedMinimalPerfectHash}) and have it fetched by reflection. * As a commodity, all signed classes in MG4J have a main method invoking the {@linkplain #main(Class, String[]) * parameterised main method} of this class, * which accept the default class to be built. In this way, running the main method of * any signed class provides the same features. * * <P>For efficiency, there are also method that access a minimal perfect hash * {@linkplain #getNumber(byte[], int, int) using byte arrays interpreted as ISO-8859-1} characters. * * <h3>Implementation Details</h3> * * <P>An object of this class uses about 1.26<var>n</var> integers between 0 and <var>n</var>-1 inclusive * to hash <var>n</var> terms. This is asymptotically optimal, but for small collections using an integer * wastes a large number of bits in each entry. At construction time, however, about 15<var>n</var> integers * (i.e., 60<var>n</var> bytes) are necessary. * * <P>The technique used here is suggested (but not fully described) in the second edition of <em>Managing * Gigabytes</em>. It is due to Havas, Majewski, Wormald and Czech, and it is fully described in * the monograph by Czech, Havas and Majewski on perfect hashing (“Perfect Hashing”, * <i>Theoret. Comput. Sci.</i> 182:1−143, 1997). * * <P>First, a triple of intermediate hash functions (generated via universal hashing) define for each * term a 3-hyperedge in a hypergraph with 1.26<var>n</var> vertices. Each intermediate hash function * uses a vector of random weights; the length of the vector is by default the length of the longest * term, but if the collection is sorted it is possible to compute the minimum length of a prefix that will * distinguish any pair of term. * * <P>Then, by successively * stripping hyperedges with one vertex of degree one, we create an ordering on the hyperedges. * Finally, we assign (in reverse order) to each vertex of a hyperedge a number in the range * 0 and <var>n</var>-1 in such a way that the sum of the three numbers modulo <var>n</var> is exactly * the original index of the term corresponding to the hyperedge. This is possible with high probability, * so we try until we succeed. * * <P>Note that the mathematical results guaranteeing that it is possible to find a * function in expected constant time are <em>asymptotic</em>. * Thus, when the number of terms is less than {@link #TERM_THRESHOLD}, an instance * of this class just stores the terms in a vector and scans them linearly. This behaviour, * however, is completely transparent to the user. * * <h3>Rounds and Logging</h3> * * <P>Building a minimal perfect hash map may take hours. As it happens with all probabilistic algorithms, * one can just give estimates of the expected time. * * <P>There are two probabilistic sources of problems: degenerate hyperedges and non-acyclic hypergraphs. * The ratio between the number of vertices and the number of terms guarantee that acyclicity is true with high * probability. However, when generating the hypergraph we use three hash functions, and it must never happen * that the value of two of those functions coincide on a given term. Because of the birthday paradox, the * probability of getting a nondegerate edge is just * <blockquote> * (<var>m</var>-1)(<var>m</var>-2)/<var>m</var><sup>2</sup>, * </blockquote> * where <var>m</var>=1.26<var>n</var>. Since this must happen for <var>n</var> times, we must raise * this probability to <var>n</var>, and as <var>n</var> grows the value quickly (and monotonically) * reaches <i>e</i><sup>-3/1.26</sup>. So the expected number of trials to generate a random 3-hypergraph * would be bounded by <i>e</i><sup>3/1.26</sup>, which is about 11. Note that this bound * <em>does not depend on <var>n</var></em>. * * <p>However, starting from MG4J 1.2 this class will patch deterministically the hash functions so that * degenerate edges are <em>extremely</em> unlikely (in fact, so unlikely they never happen). One round of * generation should be sufficient for generating a valid hypergraph. The fix is performed by adding * {@link #NODE_OVERHEAD} nodes to the graph, and using them to remap the random hash functions in case * of clashes. The fix must be repeated, of course, each time {@link #getNumber(MutableString)} is called, * but in the vaste majority of cases it reduces to two checks for equality with negative result. * * <P>Once the hypergraph has been generated, the stripping procedure may fail. However, the expected number * of trials tends to 1 as <var>n</var> * approaches infinity (Czech, Havas and Majewski, for instance, report that on a set of 50,000 terms * they obtained consistently one trial for more than 5000 experiments). * * <P>To help diagnosing problem with the generation process * class, this class will log at {@link org.apache.log4j.Level#INFO INFO} level * what's happening. In particular, it will signal whenever a degenerate * hyperedge is generated, and the various construction phases. * * <P>Note that if during the generation process the log warns more than once about duplicate hyperedges, you should * suspect that there are duplicates in the term list, as duplicate hyperedges are <em>extremely</em> unlikely. * * @author Sebastiano Vigna * @author Paolo Boldi * @since 0.1 * @deprecated Use the new hashing stuff in Sux4J. */@Deprecatedpublic class MinimalPerfectHash extends AbstractTermMap implements Serializable { private static final Logger LOGGER = Util.getLogger( MinimalPerfectHash.class ); /** The number of nodes the graph will actually have. This value guarantees that the graph will be acyclic with positive probability. */ public static final float ENLARGEMENT_FACTOR = 1.26f; /** The overhead in term of graphs nodes that is used to solve deterministically node clashes. */ public static final int NODE_OVERHEAD = 16; /** The minimum number of terms that will trigger the construction of a minimal perfect hash; * overwise, terms are simply stored in a vector. */ public static final int TERM_THRESHOLD = 16; /** A special value denoting that the weight length is unknown, and should be computed using * the maximum length of a term. */ public static final int WEIGHT_UNKNOWN = -1; /** A special value denoting that the weight length is unknown, and should be computed assuming * that the terms appear in lexicographic order. */ public static final int WEIGHT_UNKNOWN_SORTED_TERMS = -2; /** The number of buckets. */ final protected int n; /** The number of vertices of the intermediate hypergraph, excluding the {@linkplain #NODE_OVERHEAD overhead nodes}. */ final protected int m; /** The maximum amount of right shift to perform on a 32-bit number so to obtain something greater than or equal to {@link #m}. */ final protected int rightShift; /** Initialisation values for the intermediate hash functions. */ final protected int init[]; /** Vector of weights to compute the first intermediate hash function. */ final protected int[] weight0; /** Vector of weights to compute the second intermediate hash function. */ final protected int[] weight1; /** Vector of weights to compute the third intermediate hash function. */ final protected int[] weight2; /** The length of the components of the weight vectors (it's faster than asking the length of the vectors). */ final protected int weightLength; /** The final magick—the function turning the intermediate hash values into the final bucket. */ final protected int[] g; /** Four times the number of buckets. */ protected transient long n4; /** If {@link #n} is smaller than {@link #TERM_THRESHOLD}, a vector containing the terms. */ protected transient CharSequence[] t; public static final long serialVersionUID = 2L; /* The following four methods MUST be kept synchronised. The reason why we duplicate code is * that we do not want the overhead of allocating an array when searching for a string. * * Note that we don't use shift-add-xor hash functions (as in BloomFilter). They are significantly * slower, and in this case (in which we certainly have enough weights to disambiguate any pair * of strings) they are not particularly advantageous. */ /** Hashes a given term using the intermediate hash functions. * * @param term a term to hash. * @param h a three-element vector that will be filled with the three intermediate hash values. */ protected void hash( final CharSequence term, final int[] h ) { int h0 = init[ 0 ], h1 = init[ 1 ], h2 = init[ 2 ]; // We need three these values to map correctly the empty string char c; int i = term.length(); if ( weightLength < i ) i = weightLength; while( i-- != 0 ) { c = term.charAt( i ); h0 ^= weight0[ i ] * c; h1 ^= weight1[ i ] * c; h2 ^= weight2[ i ] * c; } h0 = ( h0 >>> rightShift ) % m; h1 = ( h1 >>> rightShift ) % m; h2 = ( h2 >>> rightShift ) % m; // Deterministic fix: we move h1 or h2 into the overhead space if necessary if ( h0 == h1 || h1 == h2 ) h1 = m + h1 % (NODE_OVERHEAD / 2 ); if ( h0 == h2 ) h2 = m + NODE_OVERHEAD / 2 + h2 % ( NODE_OVERHEAD / 2 ); h[ 0 ] = h0; h[ 1 ] = h1; h[ 2 ] = h2; } /** Hashes a given term. * * @param term a term to be hashed. * @return the position of the given term in the generating collection, starting from 0. If the * term was not in the original collection, the result is a random position. * */ public int getNumber( final CharSequence term ) { if ( t != null ) return getFromT( term ); int h0 = init[ 0 ], h1 = init[ 1 ], h2 = init[ 2 ]; // We need three these values to map correctly the empty string char c; int i = term.length(); if ( weightLength < i ) i = weightLength; while( i-- != 0 ) { c = term.charAt( i ); h0 ^= weight0[ i ] * c; h1 ^= weight1[ i ] * c; h2 ^= weight2[ i ] * c; } h0 = ( h0 >>> rightShift ) % m; h1 = ( h1 >>> rightShift ) % m; h2 = ( h2 >>> rightShift ) % m; // Deterministic fix: we move h1 or h2 into the overhead space if necessary if ( h0 == h1 || h1 == h2 ) h1 = m + h1 % (NODE_OVERHEAD / 2 ); if ( h0 == h2 ) h2 = m + NODE_OVERHEAD / 2 + h2 % ( NODE_OVERHEAD / 2 ); return (int)( ( (long)g[ h0 ] + g[ h1 ] + g[ h2 ] + n4 ) % n ); } /** * Hashes a given term. * * @param term * a term to be hashed. * @return the position of the given term in the generating collection, * starting from 0. If the term was not in the original collection, * the result is a random position. */ public int getNumber( final MutableString term ) { if ( t != null ) return getFromT( term ); int h0 = init[ 0 ], h1 = init[ 1 ], h2 = init[ 2 ]; // We need three these values to map correctly the empty string char c; int i = term.length(); final char[] a = term.array; if ( weightLength < i ) i = weightLength; while( i-- != 0 ) { c = a[ i ]; h0 ^= weight0[ i ] * c; h1 ^= weight1[ i ] * c; h2 ^= weight2[ i ] * c; } h0 = ( h0 >>> rightShift ) % m; h1 = ( h1 >>> rightShift ) % m; h2 = ( h2 >>> rightShift ) % m; // Deterministic fix: we move h1 or h2 into the overhead space if necessary
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -