Homework 3 - Heaps and Hashing

CS 512, Algorithm Design, Spring 2000


  1. A colleague of yours has invented the double-binary heap, a variant of a binary heap that can find both the minimum and maximum of a set of numbers in constant time. Your colleague's invention is based on the following observation: in a regular binary heap, the lower half of the heap is unordered because those elements are the smallest values in the heap. Therefore the elements can be re-arranged into a second binary heap without effecting the original binary heap. In the second binary heap, the parent's value is no larger than its children's value. Knowing the number of elements in the heap, I can find the root, and so the smallest element, of the second heap in constant time.

    Has your colleague invented a useful data structure or not? Explain your answer.


    Unfortunately no, your colleague has not invented a useful heap data structure. The problem is that the bottom level of a heap may be unordered, but it's not random: each element i in the bottom level must obey the heap property h[parent(i)] >= h[i]. Re-arranging the bottom half of the heap to form a second heap may violate the heap property in the top half.

    Take, for example, the max-heap 9, 5, 8, 3, 4, 7, 6. The last four elements in the max-heap can be re-arranged to form the min-heap 7, 6, 4, 3 (with the minimum heap top at the right this time). However, the combination of the two heaps 9, 5, 8, 7, 6, 4, 3 violates the max-heap property because h[parent(4)] = h(2) = 5 >= h[4] = 7 is false.

    This is not to say that it wouldn't be possible to form a double-binary heap. All that needs to be done is to store the heap in descending order; for example, 9, 8, 7, 6, 5, 4, 3 is a double-binary heap. However, maintaining a sorted list in the presence of insertions and (perhaps) deletions requires O(n) work (although you can find the insertion point with O(log n) work, you may have to move O(n) elements to do the insertions), so the O(log n) heap behavior is lost.


  2. True or false: Every tree T of a binomial heap has one node at level h, where h is the height of T. Justify your answer.


    Every tree of a binomial heap is a binomial tree, so we need to answer the question for binomial trees.

    The binomial tree B0 has one node at level 0. Let's assume the binomial tree Bi has one node at level i and consider the binomial tree Bi + 1. Bi + 1 comprises the two binomial trees Bi, one of which is the left-most child of the other. By our assumption, the leftmost-child binomial tree has a single node at level i, which becomes a node at level i + 1 in Bi + 1. Ignoring its left-most child, the parent binomial tree has height i and so has no nodes at level i + 1. From this Bi + 1 has one node at level i + 1.

    The binomial tree Bi has one node at level i and so does every tree in a binomial heap.


  3. A colleague of yours has implemented a bunch of hash tables using closed addressing; each hash table uses four bytes for pointers and 12 bytes for each linked-list element (four for the pointer and eight for the key). Disenchanted by the complexity of maintaining closed-address hash tables, your colleague would like to convert the hash tables to use open addressing. However, a hash table should be converted only if the following are true about the converted hash table: (1) its loading factor is no larger than the loading factor of the unconverted hash table and (2) it uses no more space than did the unconverted hash table. Your colleague has come to you to for help in determining when it is a hash table should be converted. Explain the technique you've created to determine when a hash table should be converted. You may assume uniform hashing.


    The load factor on the close-addressed table Tc is

    ac = nc/mc
    where nc is the number of elements stored in Tc and mc is the number of slots in Tc. The amount of space used by Tc is
    sc = 12nc + 4mc

    Using similar notation, the load factor on the open-addressed table To is

    ao = no/mo
    Because the To needs to hold the elements of Tc, no = nc. The amount of space used by To is
    so = 8mo

    The trick to converting from closed- to open-addressed hash tables is the load factor: ac can be greater than 1, while ao can be at most one. If the load factor on the converted table can't be made at most 1, then the close-addressed table can't be converted. Because both hash tables must store the same number of entries, the smallest ao occurs with the largest number of slots possible. This implies that

    so = sc
    or
    8mo = 12nc + 4mc
    or
    mo = (3nc + mc)/2
    With that slot count, the load factor for To is
    nc/(3nc + mc)/2 <= 1
    or
    2nc <= 3nc + mc
    or
    -mc <= nc
    This is always going to be true, which means that whenever Tc's load factor is at least 1, it should be converted to an open-addressed table having a load factor of at most 1. In retrospect, this is obvious because it takes at least 12 bytes to store an entry in in Tc, while it only takes 8 bytes to store it in To. Using Tc's space for To will always provide enough room to store all Tc's entries in To. In fact, assuming a thinner margin of safety was acceptable, your colleague need take only 12(ceiling(nc/3)) bytes from Tc for To because every two entries in Tc uses enough extra space (the space for the pointers) to store another entry.

    What happens when ac is less than 1? The load factor on the converted table must be at most the load factor on the original table, or

    ao <= ac
    or
    nc/mo <= nc/mc
    or
    2/(mc + 3nc) <= 1/mc
    or
    2mc <= mc + 3nc
    or
    mc <= 3nc
    or
    1/3 <= ac
    That is, when ac is less than 1, Tc should be converted only when there are three times as many elements as there are slots. In retrospect, this seems, if perhaps not obvious, at least reasonable. Converting four-byte mc slots into eight-byte mo slots reduces the number of slots by half. Because the number of entries is the same in each table, the load factor for the converted open-addressed table is higher than the load factor for the original, closed-address table. Reducing ao requires more slots, which come from the space used to store the entries in Tc; to reduce ao enough, there have to be enough entries in Tc.

    To summarize, your colleague should convert any closed-address table with a load factor of at least 1/3, using all storage in the closed-address table for the open-address table. In addition, if the closed-address load factor is at least 1, the space requirements for the open-address table can be reduced to 12(ceiling(nc/3)) bytes, assuming your colleague is willing to run with a smaller margin of error.


  4. Devise a hash function hash() that accepts as input a month of the year (January, February, and so on) as a string and returns an 8-bit unsigned value. Your hash function should be collision-free, that is, if hash(m1) = hash(m2), then m1 = m2, and should run in constant time.


    Accessing string characters can be done in constant time. There are a constant number of character that can be used to distinguish among the month names. These two facts can combine to serve as the basis of a hash function.

    Each month contains at least three characters, so it makes sense to try and distinguish among the months using the first three characters. The first character alone can't distinguish among the months because, for example, three months start with 'J'. The second character isn't any good either because, for example, three months have 'u' as their second character; similarly, two months have 'r' or 'n' as their third character.

    Considering two of the three initial characters may work. The first and second characters don't distinguish between "June" and "July". The first and third characters don't distinguish between "June" and "January". Fortunately, the second and third characters distinguish between all twelve months:

    jan
    mar
    may
    oct
    feb
    dec
    sep
    nov
    apr
    aug
    jul
    jun

    The hash function could be derived from the ASCII value of the characters themselves, but the characters represents a range of 25 values from 0 ('a' - 'a') to 24 ('y' - 'a'), which requires 5 bits, resulting in a 10-bit has value, which is two bits larger than it should be. To get an 8-bit hash value, the characters will have to be mapped to something other than their ASCII values.

    There are 6 unique second characters, which takes three bits to represent (using binary to represent the character values):

    a000
    c001
    e010
    o011
    p100
    u101
    There are ten third characters, so four bits are needed to give them all values (the characters 'c' and 'p' are also second characters; their values reflect the three-bit values assigned above):
    b0000
    c 001
    g0010
    l0011
    n0101
    p 100
    r0110
    t0111
    v1000
    y1001

    A hash function using these character mappings

    //            a b c d e f g h i j k l m n o p q r s t u v w x y z
    int c2i[] = { 0 0 1 0 2 0 2 0 0 0 0 3 0 5 3 4 0 6 0 7 5 8 0 0 9 0 }
    
    int hash(char m[])
      return c2i[m[1] - 'a']*4 + c2i[m[2] - 'a']
    
    produces the month mappings
    jan 5
    mar 6
    may 9
    oct 23
    feb 32
    dec 33
    sep 36
    nov 56
    apr 70
    aug 82
    jul 83
    jun 85

    The hash table can be shortened up a bit by noting that it isn't necessary to distinguish all the third characters from each other; only the third characters with the same second character need to be distinguished. Because at most three third characters have the same second character, the third characters can be mapped into two bits (the values of 'o' and 'p' have been swapped to give 'p' a value with two significant bits when used as a third character; the value of 'c' already has at most two significant bits):

    a000
    c001
    e010
    o100
    p011
    u101
    b 00
    c001
    g 10
    l 01
    n 00
    p011
    r 01
    t 00
    v 00
    y 10
    A hash function using this character mapping
    //            a b c d e f g h i j k l m n o p q r s t u v w x y z
    int c2i[] = { 0 0 1 0 2 0 2 0 0 0 0 1 0 0 4 3 0 1 0 0 5 0 0 0 2 0 }
    
    int hash(char m[])
      return c2i[m[1] - 'a']*4 + c2i[m[2] - 'a']
    
    produces the month mapping
    jan 0
    mar 1
    may 2
    oct 4
    feb 8
    dec 9
    sep 11
    apr 13
    nov 16
    jun 20
    jul 21
    aug 22

    Hash functions that have no collisions are known as perfect hash functions.



This page last modified on 12 March 2000.