📄 stdhash.c
字号:
/* Copyright (c) 2000, The Johns Hopkins University * All rights reserved. * * The contents of this file are subject to a license (the ``License'') * that is the exact equivalent of the BSD license as of July 23, 1999. * You may not use this file except in compliance with the License. The * specific language governing the rights and limitations of the License * can be found in the file ``STDUTIL_LICENSE'' found in this * distribution. * * Software distributed under the License is distributed on an AS IS * basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. * * The Original Software is: * The Stdutil Library * * Contributors: * Creator - John Lane Schultz (jschultz@cnds.jhu.edu) * The Center for Networking and Distributed Systems * (CNDS - http://www.cnds.jhu.edu) */ #include <stdlib.h>#include <string.h>#ifdef USE_DMALLOC#include <dmalloc.h>#endif#include "stdutil/stdutil.h"#include "stdutil/stderror.h"#include "stdutil/stdhash.h"/* if all the necessary types exist then you can use stdhash */#ifdef STDRAND_EXISTS/* stdhash is a table based implementation of a dictionary data structure. In particular, it is an open-address double hashing hashtable (ref: Sedgewick: Algorithms in C, 3rd Ed.) that maps non-unique keys to values. Values can be zero length which allows this dictionary to function also as a hashset of keys. Both keys and values are contained by value inside the dictionary, just like all of the other data structures in the stdutil library. The user must specify a key equality relation and a key hashcode relation for his keys. Several such functions are defined by stdkvp.[ch] for commonly used key types. _The user must be very careful when crafting such a pair of fcns_: (1) equals(k1, k2) MUST be reflexive, symmetric, and transitive (2) hashcode(k) MUST be invariant (always the same for a particular key) (3) if equals(k1, k2) then hashcode(k1) == hashcode(k2) MUST hold ** Note that condition 3 is fairly stringent!! Also the user should consider the following: (4) valid values for the key hashcode function are in the range: [0, 2^(sizeof(size_t) * 8 - 1)). The most significant bit can legally be set, but it will be ignored -- possibly causing more hashcode collisions than expected. (5) if !equals(k1, k2) then the probability that hashcode(k1) == hashcode(k2) should not be significantly more than 1 / 2^(sizeof(size_t) * 8 - 1) for best performance. At the core of this hashtable implementation are two hash fcns: The first hash fcn maps a key's hashcode to the range [0, n) (where n is the capacity of the hashtable). This mapping is used to find where in the table a search for that key should begin. The second hash fcn returns an invariant offset (based on the hashcode) for the key that is non-zero, less than and relatively prime to n. This offset is used to jump through the hashtable when searching for a particular key. Example: searching for a key in the table... First, the hashcode is computed for the key. Based on that, the first hash fcn returns an index in the table, ind, at which to begin searching for the key. If ind contains no key-val pair then the key does not exist in the table and our search is done. If ind contains a pair and it's key matches the search key then our search is done. If ind contains a pair but it's key doesn't match the search key (both its hashcode and equals return a match) then I examine index (ind += offset) (mod n). I continue this process until I find a key match or I find an empty index. Because offset is relatively prime to n, I can check every index in the table in this manner in n iterations (the sequence of indexes generated by this process for a particular key is called a probe sequence). However, by keeping the table sparsely populated the search will always terminate, and in practice the search will on average terminate after a constant number of steps. The first hash fcn (see implementation below: hash) is a kick ass implementation of the strongly 2-universal family of hash fcns that maps integers from the range [0, m) to the range [0, n): h(i)[a, b] = ((a * i + b) (mod p)) (mod n) Where n < m <= p <= 2m (usually n << m), p is prime, and a and b are uniformly chosen random parameters of the function (0 < a < p, 0 <= b < p). However, rather than using a large prime p, I use 2^(sizeof(size_t) * 8). This causes problems if either a or i is a multiple of 2. To get around these problems I simply force a and i to be odd (see next paragraph for how). Because of this, a * i is relatively prime to my "p" and I get all of the nice properties of doing multiplication modulo a prime. However, because I did the math mod 2^(sizeof(size_t) * 8) the mod p operation is done implicitly by the standard definition of multiplication and addition of size_t types in C. Furthermore, I implement the second mod operation (mod n) by a bitwise AND operation. I do this by _making the capacity (n) of the table a power of 2_ and then (x % n) == (x & (n-1))! One catch though: because a * i is always odd, b always determines what the lowest bit of y = a * i + b is. If b is odd then y is always even, if b is even then y is always odd. So, the lowest bit doesn't contribute anything and I need to remove it. That is why in the hash() fcn below I right shift the result y. So the two expensive modulo operations above are replaced by a single bitwise AND operation! I force a to be odd by uniformly picking a number in the range [0, 2^(sizeof(size_t) * 8)) and then OR'ing it with 0x1. I force i in the range [0, 2^(sizeof(size_t) * 8 - 1)) to be odd by multiplying it by 2 and then OR'ing it with 0x1. This operation maps i to odd integers in the range [1, 2^(sizeof(size_t) * 8)). I often refer to these numbers as expanded hashcodes or exphcodes. I choose b uniformly from the range [0, 2^(sizeof(size_t) * 8)), but the lowest bit does not contribute anything (see explanation in above paragraph). The second hash fcn is simply the lowest lg(n) bits of the expanded hashcode for a key. This value is always odd making it relatively prime to n (table capacity), _which is a power of 2_. This allows us to iteratively search through the table. This hash function is simply a bitwise AND operation between the expanded hashcode and the. Another important thing to note about this hashtable is that because it uses an open-addressing scheme, key removals are more complex than in other schemes. Instead of actually deleting the node and saying it is empty, I leave the node in its space but I set the cached, expanded hashcode to an impossible value (0) to indicate that it was removed, but searchs should not terminate here. This is because some insert probe sequence might have depended on the removed node when it was inserted and so I can't simply delete the node. But, if a user tries to insert into the table and such an "inactive" node is found it can be used by the new pair being inserted. Leaving these nodes in the table causes still another problem. If the nodes still exist in the table, then the load factor thresholds must be modified to take this into account when deciding whether or not to grow, shrink, or rehash the table. If they didn't, the table could be filled with "inactive" nodes and searches might never terminate. In this implementation I track the number of nodes in the table by the num_nodes variable. I make decisions on whether to resize or rehash the table based on: size, num_nodes, low_thresh and high_thresh. Some more points: I cache expanded hashcodes for keys in stdhash_nodes. Obviously a key should be constant, meaning that it shouldn't change once it is inserted in to an associative data structure. Anyway, I depend on that fact and property 3 (above) strongly when I am searching for keys. When rehashing occurs I don't free and reallocate active nodes. I allocate a new table and free those inactive nodes (exphcode == 0) that are acting only as place holders (see prior paragraph). Then, all of the active nodes are inserted into the new table. */# ifdef STD_CONSTRUCT_CHECKS# define IS_HASH_INITED(hash) ((hash)->init_val == STDHASH_INITED)# define INIT_HASH(hash) ((hash)->init_val = STDHASH_INITED)# define UNINIT_HASH(hash) ((hash)->init_val = ~STDHASH_INITED)# define IS_IT_INITED(it) ((it)->it_init_val == STDHASH_IT_INITED)# define INIT_IT(it) ((it)->it_init_val = STDHASH_IT_INITED)# else# define IS_HASH_INITED(hash)# define INIT_HASH(hash)# define UNINIT_HASH(hash)# define IS_IT_INITED(it)# define INIT_IT(it)# endif/* macros for table positions (stdhash_node**s) */# define POS_EMPTY(node_pos) ((*(node_pos)) == 0)# define POS_INACTIVE(node_pos) ((*(node_pos))->exphcode == 0)# define IS_VALID_IT(it) \((it)->node_pos >= (it)->hash->begin && (it)->node_pos < (it)->hash->table_end && \ !POS_EMPTY((it)->node_pos) && !POS_INACTIVE((it)->node_pos))# define IS_LEGAL_IT(it) (IS_VALID_IT(it) || (it)->node_pos == (it)->hash->table_end)/* macros for table nodes (stdhash_node*s) */# define NKEY(node) ((node)->kv.key)# define NVAL(node) ((node)->kv.value)inline static int get_cap_and_threshs(size_t size_request, size_t *new_cap, size_t *high_thresh, size_t *low_thresh) { size_t tmp_new_cap = stdgood_pow2_cap(size_request <<= 1); /* <<= !!! */ if (tmp_new_cap <= size_request && size_request != 0) return STD_MEM_FAILURE; *new_cap = tmp_new_cap; *high_thresh = tmp_new_cap >> 1; /* let table get 50% full before realloc'ing again */ *low_thresh = tmp_new_cap >> 3; /* if table drops below 12.5% full shrink */ return STD_SUCCESS;}/* allocs and sets the key-val pair for a node */inline static stdhash_node *init_node(stdhash *h, stdhash_node *node, size_t exphcode, const void *key, const void *val) { if (!(node->kv.key = (void*) malloc(h->ksize))) return 0; if (h->vsize != 0) { if (!(node->kv.value = (void*) malloc(h->vsize))) { free(node->kv.key); return 0; } } else node->kv.value = 0; node->exphcode = exphcode; memcpy(node->kv.key, key, h->ksize); memcpy(node->kv.value, val, h->vsize); return node;}/* allocs a node and inits it to contain the passed key-val pair */inline static stdhash_node *make_node(stdhash *h, size_t exphcode, const void *key, const void *val) { stdhash_node *node; if (!(node = (stdhash_node*) malloc(sizeof(stdhash_node)))) return 0; if (!init_node(h, node, exphcode, key, val)) { free(node); return 0; } return node;}/* frees a node entirely */inline static void free_node(stdhash_node *node) { free(node->kv.key); if (node->kv.value != 0) free(node->kv.value); free(node);}/* see comments at top of file: expands hcode of key into an exphcode */inline static size_t exp_hcode(size_t hcode) { return (hcode << 1) | 0x1;}/* see comments at top of file: computes both hashfcns for a given exphcode and a stdhash */inline static void hash(size_t exphcode, size_t a, size_t b, size_t new_cap_min1, size_t *hash_val, size_t *offset) { *hash_val = (((a * exphcode + b) >> 1) & new_cap_min1); /* [0, n) */ *offset = (exphcode & new_cap_min1); /* (0, n) and always odd */}/* pick new values for a and b -- used when I rehash the table *//* When a is extremely small our second hash fcn is more likely to produce the same offset for two different hashcodes that have the same hash value. In fact, if a == 1 then all keys that have the same hash value will have the same offset. So here I discard very small values of a. Therefore, a is actually an odd number in the range [13, 2^(sizeof(size_t) * 8)). */inline static void update_params(stdhash *h) { do { h->a = stdrand(h->ai); } while (h->a < 12); h->a |= 0x1; /* a needs to be odd */ h->b = stdrand(h->bi); /* lowest bit of b contributes nothing */}/* get next with no safe checking */inline static stdhash_node **next(const stdhash *h, stdhash_node **curr) { for (++curr; curr != h->table_end && (POS_EMPTY(curr) || POS_INACTIVE(curr)); ++curr); return curr;}/* get prev with no safe checking */inline static stdhash_node **prev(const stdhash *h, stdhash_node **curr) { for (--curr; POS_EMPTY(curr) || POS_INACTIVE(curr); --curr); return curr;}/* Find the position of the first active node in the table and return it */inline static stdhash_node **find_begin(const stdhash *h) { return next(h, h->table - 1);}/* This function is called when it has been determined that the active pairs in h need to be put into a different table. It also determines what the new capacity and thresholds should be. */inline static int rehash(stdhash *h, size_t request_size) { stdhash_node **curr = h->table, **end = h->table_end; stdhash_node **search, **t = 0, **t_end; size_t new_cap, new_cap_min1, new_high, new_low, hash_val, offset; STD_SAFE_CHECK(request_size >= h->size); /* don't allow "lossy" rehashes */ /* get a good capacity for this table -- aim for capacity to be around 4 times size */ if (get_cap_and_threshs(request_size, &new_cap, &new_high, &new_low) != 0 || (new_cap != 0 && !(t = (stdhash_node**) calloc(new_cap, sizeof(stdhash_node*))))) return STD_MEM_FAILURE; t_end = t + new_cap; new_cap_min1 = new_cap - 1; update_params(h); /* pick new a and b parameters */ for (; curr != end; ++curr) { if (POS_EMPTY(curr)) /* ignore empty positions */ continue; else if (POS_INACTIVE(curr)) /* free inactive nodes */ free_node(*curr); else { /* insert active node into t */ hash((*curr)->exphcode, h->a, h->b, new_cap_min1, &hash_val, &offset); search = t + hash_val; while (!POS_EMPTY(search)) /* simple search: find an empty spot */ if ((search += offset) >= t_end) search = t + (search - t_end); *search = *curr; /* insert active node */ } } if (h->table) free(h->table); h->table = t; h->table_end = t_end; h->begin = find_begin(h); /* find first active element or end */ h->cap_min1 = new_cap_min1; h->high_thresh = new_high; h->low_thresh = new_low; h->num_nodes = h->size; /* no inactive nodes right now, only active */ return STD_SUCCESS;}inline static size_t sizeof_stdkvp(const stdhash_it *it) { STD_CONSTRUCT_CHECK(IS_IT_INITED(it)); return stdhash_kvp_size(it->hash);}/******************* stdhash_it interface **************************************/inline void *stdhash_it_key(const stdhash_it *it) { STD_CONSTRUCT_CHECK(IS_IT_INITED(it) && IS_HASH_INITED(it->hash)); STD_BOUNDS_CHECK(IS_VALID_IT(it)); return NKEY(*it->node_pos);}inline void *stdhash_it_val(const stdhash_it *it) { STD_CONSTRUCT_CHECK(IS_IT_INITED(it) && IS_HASH_INITED(it->hash)); STD_BOUNDS_CHECK(IS_VALID_IT(it)); return NVAL(*it->node_pos);}inline stdkvp *stdhash_it_kvp(const stdhash_it *it) { STD_CONSTRUCT_CHECK(IS_IT_INITED(it) && IS_HASH_INITED(it->hash)); STD_BOUNDS_CHECK(IS_VALID_IT(it)); return &(*it->node_pos)->kv;}inline stdbool stdhash_it_equals(const stdhash_it *it1, const stdhash_it *it2) { STD_CONSTRUCT_CHECK(IS_IT_INITED(it1) && IS_IT_INITED(it2)); STD_CONSTRUCT_CHECK(IS_HASH_INITED(it1->hash) && IS_HASH_INITED(it2->hash)); STD_BOUNDS_CHECK(it1->hash == it2->hash && IS_LEGAL_IT(it1) && IS_LEGAL_IT(it2)); return it1->node_pos == it2->node_pos;}inline stdbool stdhash_it_is_begin(const stdhash_it *it) { STD_CONSTRUCT_CHECK(IS_IT_INITED(it) && IS_HASH_INITED(it->hash)); return it->node_pos == it->hash->begin;}inline stdbool stdhash_it_is_end(const stdhash_it *it) { STD_CONSTRUCT_CHECK(IS_IT_INITED(it) && IS_HASH_INITED(it->hash)); return it->node_pos == it->hash->table_end;}inline stdhash_it *stdhash_it_seek_begin(stdhash_it *it) { STD_CONSTRUCT_CHECK(IS_IT_INITED(it) && IS_HASH_INITED(it->hash)); it->node_pos = it->hash->begin; return it;}inline stdhash_it *stdhash_it_seek_end(stdhash_it *it) { STD_CONSTRUCT_CHECK(IS_IT_INITED(it) && IS_HASH_INITED(it->hash)); it->node_pos = it->hash->table_end; return it;}inline stdhash_it *stdhash_it_next(stdhash_it *it) { STD_CONSTRUCT_CHECK(IS_IT_INITED(it) && IS_HASH_INITED(it->hash)); STD_BOUNDS_CHECK(IS_VALID_IT(it)); it->node_pos = next(it->hash, it->node_pos); return it;}inline stdhash_it *stdhash_it_advance(stdhash_it *it, size_t num_advance) { STD_CONSTRUCT_CHECK(IS_IT_INITED(it) && IS_HASH_INITED(it->hash)); STD_BOUNDS_CHECK(IS_LEGAL_IT(it)); while (num_advance--) { STD_BOUNDS_CHECK(it->node_pos != it->hash->table_end); it->node_pos = next(it->hash, it->node_pos);
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -