Lecture Notes for CS 512, Algorithm Design
Hashing - 23 February 2000
- a dictionary with constant average-time searching behavior
- direct addressing
- use the object itself as an index into an array
- worst case constant access
- fast but big
- objects and pointers
- hash tables
- reduce size requirements by mapping the large universe into a smaller
universe
- the hash function maps from the large key set to the small key set
- collisions: two or more large keys mapping to the same smaller key
- minimize collisions by being uniformly random
- handle collisions when they arise
- chaining
- the hash table
T
contains pointers to linked lists
- item
x
is stored in the linked list pointed to by
T[h(x.key)]
- insertion is constant, searching and deleting is proportional to list
length
- analysis
- in the worst case the hash function h() maps all keys to the same
value
- inserting and deleting are Theta(n)
- let h() be simple uniform - every key is map to all candidates
with uniform probability
- ignore hash computations and hash table access
- any key can end up on any linked list
- with k keys being stored in s slots, expect k/s
items per slot
- the load factor a = k/s
- unsuccessful searches are Theta(1 + a) - the list search time
plus the hash computation time
- successful searches are Theta(1 + a) - the average time for all
possible successful searches
- hash functions
- two tasks: convert the object to a number, map the number
- converting to a number
- mapping numbers
- good hash functions distribute keys evenly over the slots
- division - h(k) = k mod m for some m
- m shouldn't be a power of 2, powers of 10, 2k - 1
- m should be prime and away from powers of two
- multiplication - h(k) = floor(m(kA mod 1))
- m a power of two
- A = (root(5) - 1)/2 = 0.618033
- the leftmost p bits of the fractional part
- universal - random selection from sets of hash functions
- reducing worst-case behavior to an accident
- a universal set of hash functions - the probability of collision
between two objects is inversely proportional to the slot count
- fix r and let x = <x0, x1, ..., xr>, xi < m, and a = <a0, a1, ..., ar>, 0< ai < m
- ha(x) = sum(i = 0 to r, aixi)
mod m
- H = union(a, ha), has mr + 1 elements
- open addressing
- get rid of chains, store all objects in the table itself
- chain storage goes to table storage
- probe on collisions
- add a probe sequence number to the hash function: h(k,
1), h(k, 2), ..., h(k, m - 1)
- inserting and searching is easy, deleting is hard
- uniform hashing - uniform distribution of permutations to keys
- linear probing
- h(k, i) = h(k) + i mod m
- only m index sequences are generated
- primary clustering - an empty slot proceeded by i full slots
will be filled with probability (i + 1)/m
- quadratic probing
- h(k, i) = (h(k)' + c1i + c2i2) mod m
- still only m sequences
- secondary clustering - spread the probe sequence out
- double hashing
- h(k, i) = (h1(k) + ih2(k)) mod
m
- h2(k) relatively prime to m
- around m2 sequences generated
- expected times
- unsuccessful search - 1/(1 - a) probes
- successful search -
- insertion - 1/(1 - a) probes
This page last modified on 1 March 2000.