HashTables.pptx
http:
algs4.cs.princeton.edu
Algorithms
Robert Sedgewick | Kevin Wayne
3.4 Hash Tables
hash functions
separate chaining
linear probing
context
http:
algs4.cs.princeton.edu
Algorithms
Robert Sedgewick | Kevin Wayne
[wayne f11] include memory of separate chaining and linear probing vs. red-black BST?
Symbol table implementations: summary
Q. Can we do better?
A. Yes, but with different access to the data.
2
implementation guarantee average case ordered
ops? key
interface
search insert delete search hit insert delete
sequential search
(unordered list) N N N ½ N N ½ N equals()
binary search
(ordered a
ay) lg N N N lg N ½ N ½ N ✔ compareTo()
BST N N N 1.39 lg N 1.39 lg N √ N ✔ compareTo()
red-black BST 2 lg N 2 lg N 2 lg N 1.0 lg N 1.0 lg N 1.0 lg N ✔ compareTo()
3
Hashing: basic plan
Save items in a key-indexed table (index is a function of the key).
Hash function. Method for computing a
ay index from key.
Issues.
Computing the hash function.
Equality test: Method for checking whether two keys are equal.
Collision resolution: Algorithm and data structure
to handle two keys that hash to the same a
ay index.
Classic space-time tradeoff.
No space limitation: trivial hash function with key as index.
No time limitation: trivial collision resolution with sequential search.
Space and time limitations: hashing (the real world).
hash("times") = 3
??
0
1
2
3 "it"
4
5
hash("it") = 3
hashing at its core is a space-time tradeoff: if our keys are 32-bit integers, can use a hash table of size 2^32
http:
algs4.cs.princeton.edu
hash functions
separate chaining
linear probing
context
3.4 Hash Tables
http:
algs4.cs.princeton.edu
5
Computing the hash function
Idealistic goal. Scramble the keys uniformly to produce a table index.
Efficiently computable.
Each table index equally likely for each key.
Ex 1. Phone numbers.
Bad: first three digits.
Better: last three digits.
Ex 2. Social Security numbers.
Bad: first three digits.
Better: last three digits.
Practical challenge. Need different approach for each key type.
thoroughly researched problem,
still problematic in practical applications
573 = California, 574 = Alaska
(assigned in chronological order within geographic region)
key
table
index
Goal: hash function scrambles keys, then each index will co
espond to roughly 1/M fraction of elements
Assumes table size is M = XXXXXXXXXXdecimal digits)
6
Java’s hash code conventions
All Java classes inherit a method hashCode(), which returns a 32-bit int.
Requirement. If x.equals(y), then (x.hashCode() == y.hashCode()).
Highly desirable. If !x.equals(y), then (x.hashCode() != y.hashCode()).
Default implementation. Memory address of x.
Legal (but poor) implementation. Always return 17.
Customized implementations. Integer, Double, String, File, URL, Date, …
User-defined types. Users are on their own.
x.hashCode()
x
y.hashCode()
y
int between -2^31 and 2^31 - 1
epeated calls to hashCode() must return same value (provided no info used in equals() is changed)
Ensures hashing can be used for every object type.
Enables expert implementations for each type. This is the sweet spot for hashing.
7
Implementing hash code: integers, booleans, and doubles
public final class Intege
{
private final int value;
...
public int hashCode()
{ return value; }
}
convert to IEEE 64-bit representation;
xor most significant 32-bits
with least significant 32-bits
Warning: -0.0 and +0.0 have different hash codes
public final class Double
{
private final double value;
...
public int hashCode()
{
long bits = doubleToLongBits(value);
return (int) (bits ^ (bits
32));
}
}
public final class Boolean
{
private final boolean value;
...
public int hashCode()
{
if (value) return 1231;
else return 1237;
}
}
Java li
ary implementations
Java li
ary implementations
Warning: this means that need special care when implementing equals(). The equals() method in Double considers -0.0 and 0.0 to be not equal even though 0.0 == -0.0
Horner's method to hash string of length L: L multiplies/adds.
Equivalent to h = s[0] · 31L–1 + … + s[L – 3] · XXXXXXXXXXs[L – 2] · XXXXXXXXXXs[L – 1] · 310.
Ex.
public final class String
{
private final char[] s;
...
public int hashCode()
{
int hash = 0;
for (int i = 0; i < length(); i++)
XXXXXXXXXXhash = s[i] + (31 * hash);
return hash;
}
}
8
Implementing hash code: strings
3045982 = 99· XXXXXXXXXX· XXXXXXXXXX· XXXXXXXXXX·310
= XXXXXXXXXX· XXXXXXXXXX · XXXXXXXXXX · (99)))
(Horner's method)
ith character of s
String s = "call";
int code = s.hashCode();
char Unicode
… …
'a' 97
'b' 98
'c' 99
… ...
Java li
ary implementation
Note: reference Java implementation caches string hash codes.
this is essentially Java's implementation of hashCode() for strings since Java 1.5.
It's ok to add an int and a char. It's also ok if the result overflows since this is all well-defined in java. (two's complement integers)
Why 31? It's a Mersenne prime (one less than a power of 2) -> easy to implement with shift and subtract 1.
Performance optimization.
Cache the hash value in an instance variable.
Return cached value.
Q. What if hashCode() of string is 0?
public final class String
{
private int hash = 0;
private final char[] s;
...
public int hashCode()
{
int h = hash;
if (h != 0) return h;
for (int i = 0; i < length(); i++)
XXXXXXXXXXh = s[i] + (31 * h);
hash = h;
return h;
}
}
9
Implementing hash code: strings
eturn cached value
cache of hash code
store cache of hash code
Note: if hash value is 0, it is recomputed every time, e.g., "pollinating sandboxes"
The code is a bit ve
ose in case threads are calling hashCode() concu
ently
Another place where caching a value is useful is on 8-puzzle assignment - cache the manhattan distance in Board (or SearchNode)
Note: still immutable even though instance variable hash can change.
10
Implementing hash code: user-defined types
public final class Transaction implements Comparable
{
private final String who;
private final Date when;
private final double amount;
public Transaction(String who, Date when, double amount)
{ /* as before */ }
...
public boolean equals(Object y)
{ /* as before */ }
public int hashCode()
{
int hash = 17;
hash = 31*hash + who.hashCode();
hash = 31*hash + when.hashCode();
hash = 31*hash + ((Double) amount).hashCode();
return hash;
}
}
typically a small prime
nonzero constant
for primitive types, use hashCode()
of wrapper type
for reference types,
use hashCode()
the 17 helps reduce collisions when there are leading fields with 0s. Java often uses 1 as the constant instead of 17.
11
Hash code design
"Standard" recipe for user-defined types.
Combine each significant field using the 31x + y rule.
If field is a primitive type, use wrapper type hashCode().
If field is null, return 0.
If field is a reference type, use hashCode().
If field is an a
ay, apply to each entry.
In practice. Recipe works reasonably well; used in Java li
aries.
In theory. Keys are bitstring; "universal" hash functions exist.
Basic rule. Need to use the whole key to compute hash code;
consult an expert for state-of-the-art hash codes.
or use A
ays.deepHashCode()
applies rule recursively
used in Java li
aries such as A
ays.deepHashCode() and String
Note: Java makes it difficult to implement universal hashing and tabular hashing because there is only one hash function (from hashCode())
Hash code. An int between -231 and XXXXXXXXXX.
Hash function. An int between 0 and M - 1 (for use as a
ay index).
12
Modular hashing
typically a prime or power of 2
private int hash(Key key)
{ return key.hashCode() % M; }
ug
private int hash(Key key)
{ return Math.abs(key.hashCode()) % M; }
private int hash(Key key)
{ return (key.hashCode() & 0x7fffffff) % M; }
co
ect
1-in-a-billion bug
hashCode() of "polygenelu
icants" is -231
x.hashCode()
x
hash(x)
int between -2^32 and 2^31 - 1
Can't use h % M for index since h can be negative so % can return negative number. Plausible fix |h| % M doesn't work since |h| can be negative if h = -2^31. Should also count 1 bit-whacking operation in work to hash string of length W.
13
Uniform hashing assumption
Uniform hashing assumption. Each key is equally likely to hash to an integer between 0 and M - 1.
Bins and balls. Throw balls uniformly at random into M bins.
Birthday problem. Expect two balls in the same bin after ~ π M / 2 tosses.
Coupon collector. Expect every bin has ≥ 1 ball after ~ M ln M tosses.
Load balancing. After M tosses, expect most loaded bin has
Q ( log M / log log
M ) balls.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Ex. After 23 people enter a room, expect two with same birthday.
14
Uniform hashing assumption
Uniform hashing assumption. Each key is equally likely to hash to an integer between 0 and M - 1.
Bins and balls. Throw balls uniformly at random into M