Lecture notes for CS 512, Algorithm Design

Lecture Notes for CS 512, Algorithm Design

Hashing - 23 February 2000

a dictionary with constant average-time searching behavior
direct addressing
1. use the object itself as an index into an array
2. worst case constant access
3. fast but big
4. objects and pointers
hash tables
1. reduce size requirements by mapping the large universe into a smaller universe
2. the hash function maps from the large key set to the small key set
3. collisions: two or more large keys mapping to the same smaller key
4. minimize collisions by being uniformly random
5. handle collisions when they arise
chaining
1. the hash table T contains pointers to linked lists
2. item x is stored in the linked list pointed to by T[h(x.key)]
3. insertion is constant, searching and deleting is proportional to list length
4. analysis
  1. in the worst case the hash function h() maps all keys to the same value
    1. inserting and deleting are Theta(n)
  2. let h() be simple uniform - every key is map to all candidates with uniform probability
  3. ignore hash computations and hash table access
  4. any key can end up on any linked list
  5. with k keys being stored in s slots, expect k/s items per slot
    1. the load factor a = k/s
  6. unsuccessful searches are Theta(1 + a) - the list search time plus the hash computation time
  7. successful searches are Theta(1 + a) - the average time for all possible successful searches
hash functions
1. two tasks: convert the object to a number, map the number
2. converting to a number
3. mapping numbers
  1. good hash functions distribute keys evenly over the slots
  2. division - h(k) = k mod m for some m
    1. m shouldn't be a power of 2, powers of 10, 2^k - 1
    2. m should be prime and away from powers of two
  3. multiplication - h(k) = floor(m(kA mod 1))
    1. m a power of two
    2. A = (root(5) - 1)/2 = 0.618033
    3. the leftmost p bits of the fractional part
  4. universal - random selection from sets of hash functions
    1. reducing worst-case behavior to an accident
    2. a universal set of hash functions - the probability of collision between two objects is inversely proportional to the slot count
    3. fix r and let x = <x₀, x₁, ..., x_r>, x_i < m, and a = <a₀, a₁, ..., a_r>, 0< a_i < m
    4. h_a(x) = sum(i = 0 to r, a_ix_i) mod m
    5. H = union(a, h_a), has m^{r + 1} elements
open addressing
1. get rid of chains, store all objects in the table itself
  1. chain storage goes to table storage
2. probe on collisions
  1. add a probe sequence number to the hash function: h(k, 1), h(k, 2), ..., h(k, m - 1)
  2. inserting and searching is easy, deleting is hard
  3. uniform hashing - uniform distribution of permutations to keys
  4. linear probing
    1. h(k, i) = h(k) + i mod m
    2. only m index sequences are generated
    3. primary clustering - an empty slot proceeded by i full slots will be filled with probability (i + 1)/m
  5. quadratic probing
    1. h(k, i) = (h(k)' + c₁i + c₂i²) mod m
    2. still only m sequences
    3. secondary clustering - spread the probe sequence out
  6. double hashing
    1. h(k, i) = (h₁(k) + ih₂(k)) mod m
    2. h₂(k) relatively prime to m
    3. around m² sequences generated
3. expected times
  1. unsuccessful search - 1/(1 - a) probes
  2. successful search -
  3. insertion - 1/(1 - a) probes

This page last modified on 1 March 2000.