CS 512 Homework 3

Homework 3 - Heaps and Hashing

CS 512, Algorithm Design, Spring 2000

A colleague of yours has invented the double-binary heap, a variant of a binary heap that can find both the minimum and maximum of a set of numbers in constant time. Your colleague's invention is based on the following observation: in a regular binary heap, the lower half of the heap is unordered because those elements are the smallest values in the heap. Therefore the elements can be re-arranged into a second binary heap without effecting the original binary heap. In the second binary heap, the parent's value is no larger than its children's value. Knowing the number of elements in the heap, I can find the root, and so the smallest element, of the second heap in constant time.
Has your colleague invented a useful data structure or not? Explain your answer.
Unfortunately no, your colleague has not invented a useful heap data structure. The problem is that the bottom level of a heap may be unordered, but it's not random: each element i in the bottom level must obey the heap property h[parent(i)] >= h[i]. Re-arranging the bottom half of the heap to form a second heap may violate the heap property in the top half.
Take, for example, the max-heap 9, 5, 8, 3, 4, 7, 6. The last four elements in the max-heap can be re-arranged to form the min-heap 7, 6, 4, 3 (with the minimum heap top at the right this time). However, the combination of the two heaps 9, 5, 8, 7, 6, 4, 3 violates the max-heap property because h[parent(4)] = h(2) = 5 >= h[4] = 7 is false.
This is not to say that it wouldn't be possible to form a double-binary heap. All that needs to be done is to store the heap in descending order; for example, 9, 8, 7, 6, 5, 4, 3 is a double-binary heap. However, maintaining a sorted list in the presence of insertions and (perhaps) deletions requires O(n) work (although you can find the insertion point with O(log n) work, you may have to move O(n) elements to do the insertions), so the O(log n) heap behavior is lost.
True or false: Every tree T of a binomial heap has one node at level h, where h is the height of T. Justify your answer.
Every tree of a binomial heap is a binomial tree, so we need to answer the question for binomial trees.
The binomial tree B₀ has one node at level 0. Let's assume the binomial tree B_i has one node at level i and consider the binomial tree B_{i + 1}. B_{i + 1} comprises the two binomial trees B_i, one of which is the left-most child of the other. By our assumption, the leftmost-child binomial tree has a single node at level i, which becomes a node at level i + 1 in B_{i + 1}. Ignoring its left-most child, the parent binomial tree has height i and so has no nodes at level i + 1. From this B_{i + 1} has one node at level i + 1.
The binomial tree B_i has one node at level i and so does every tree in a binomial heap.
A colleague of yours has implemented a bunch of hash tables using closed addressing; each hash table uses four bytes for pointers and 12 bytes for each linked-list element (four for the pointer and eight for the key). Disenchanted by the complexity of maintaining closed-address hash tables, your colleague would like to convert the hash tables to use open addressing. However, a hash table should be converted only if the following are true about the converted hash table: (1) its loading factor is no larger than the loading factor of the unconverted hash table and (2) it uses no more space than did the unconverted hash table. Your colleague has come to you to for help in determining when it is a hash table should be converted. Explain the technique you've created to determine when a hash table should be converted. You may assume uniform hashing.
The load factor on the close-addressed table T_c is
a_c = n_c/m_c
where n_c is the number of elements stored in T_c and m_c is the number of slots in T_c. The amount of space used by T_c is
s_c = 12n_c + 4m_c

Using similar notation, the load factor on the open-addressed table T_o is
a_o = n_o/m_o
Because the T_o needs to hold the elements of T_c, n_o = n_c. The amount of space used by T_o is
s_o = 8m_o

The trick to converting from closed- to open-addressed hash tables is the load factor: a_c can be greater than 1, while a_o can be at most one. If the load factor on the converted table can't be made at most 1, then the close-addressed table can't be converted. Because both hash tables must store the same number of entries, the smallest a_o occurs with the largest number of slots possible. This implies that
s_o = s_c
or
8m_o = 12n_c + 4m_c
or
m_o = (3n_c + m_c)/2
With that slot count, the load factor for T_o is
n_c/(3n_c + m_c)/2 <= 1
or
2n_c <= 3n_c + m_c
or
-m_c <= n_c
This is always going to be true, which means that whenever T_c's load factor is at least 1, it should be converted to an open-addressed table having a load factor of at most 1. In retrospect, this is obvious because it takes at least 12 bytes to store an entry in in T_c, while it only takes 8 bytes to store it in T_o. Using T_c's space for T_o will always provide enough room to store all T_c's entries in T_o. In fact, assuming a thinner margin of safety was acceptable, your colleague need take only 12(ceiling(n_c/3)) bytes from T_c for T_o because every two entries in T_c uses enough extra space (the space for the pointers) to store another entry.
What happens when a_c is less than 1? The load factor on the converted table must be at most the load factor on the original table, or
a_o <= a_c
or
n_c/m_o <= n_c/m_c
or
2/(m_c + 3n_c) <= 1/m_c
or
2m_c <= m_c + 3n_c
or
m_c <= 3n_c
or
1/3 <= a_c
That is, when a_c is less than 1, T_c should be converted only when there are three times as many elements as there are slots. In retrospect, this seems, if perhaps not obvious, at least reasonable. Converting four-byte m_c slots into eight-byte m_o slots reduces the number of slots by half. Because the number of entries is the same in each table, the load factor for the converted open-addressed table is higher than the load factor for the original, closed-address table. Reducing a_o requires more slots, which come from the space used to store the entries in T_c; to reduce a_o enough, there have to be enough entries in T_c.
To summarize, your colleague should convert any closed-address table with a load factor of at least 1/3, using all storage in the closed-address table for the open-address table. In addition, if the closed-address load factor is at least 1, the space requirements for the open-address table can be reduced to 12(ceiling(n_c/3)) bytes, assuming your colleague is willing to run with a smaller margin of error.
Devise a hash function hash() that accepts as input a month of the year (January, February, and so on) as a string and returns an 8-bit unsigned value. Your hash function should be collision-free, that is, if hash(m₁) = hash(m₂), then m₁ = m₂, and should run in constant time.
Accessing string characters can be done in constant time. There are a constant number of character that can be used to distinguish among the month names. These two facts can combine to serve as the basis of a hash function.
Each month contains at least three characters, so it makes sense to try and distinguish among the months using the first three characters. The first character alone can't distinguish among the months because, for example, three months start with 'J'. The second character isn't any good either because, for example, three months have 'u' as their second character; similarly, two months have 'r' or 'n' as their third character.
Considering two of the three initial characters may work. The first and second characters don't distinguish between "June" and "July". The first and third characters don't distinguish between "June" and "January". Fortunately, the second and third characters distinguish between all twelve months:

j a n

m a r

m a y

o c t

f e b

d e c

s e p

n o v

a p r

a u g

j u l

j u n

The hash function could be derived from the ASCII value of the characters themselves, but the characters represents a range of 25 values from 0 ('a' - 'a') to 24 ('y' - 'a'), which requires 5 bits, resulting in a 10-bit has value, which is two bits larger than it should be. To get an 8-bit hash value, the characters will have to be mapped to something other than their ASCII values.
There are 6 unique second characters, which takes three bits to represent (using binary to represent the character values):

a 000

c 001

e 010

o 011

p 100

u 101

There are ten third characters, so four bits are needed to give them all values (the characters 'c' and 'p' are also second characters; their values reflect the three-bit values assigned above):

b 0000

c 001

g 0010

l 0011

n 0101

p 100

r 0110

t 0111

v 1000

y 1001

A hash function using these character mappings
```
//            a b c d e f g h i j k l m n o p q r s t u v w x y z
int c2i[] = { 0 0 1 0 2 0 2 0 0 0 0 3 0 5 3 4 0 6 0 7 5 8 0 0 9 0 }

int hash(char m[])
  return c2i[m[1] - 'a']*4 + c2i[m[2] - 'a']
```
produces the month mappings

jan 5

mar 6

may 9

oct 23

feb 32

dec 33

sep 36

nov 56

apr 70

aug 82

jul 83

jun 85

The hash table can be shortened up a bit by noting that it isn't necessary to distinguish all the third characters from each other; only the third characters with the same second character need to be distinguished. Because at most three third characters have the same second character, the third characters can be mapped into two bits (the values of 'o' and 'p' have been swapped to give 'p' a value with two significant bits when used as a third character; the value of 'c' already has at most two significant bits):

a 000

c 001

e 010

o 100

p 011

u 101

b 00

c 001

g 10

l 01

n 00

p 011

r 01

t 00

v 00

y 10

A hash function using this character mapping
```
//            a b c d e f g h i j k l m n o p q r s t u v w x y z
int c2i[] = { 0 0 1 0 2 0 2 0 0 0 0 1 0 0 4 3 0 1 0 0 5 0 0 0 2 0 }

int hash(char m[])
  return c2i[m[1] - 'a']*4 + c2i[m[2] - 'a']
```
produces the month mapping

jan 0

mar 1

may 2

oct 4

feb 8

dec 9

sep 11

apr 13

nov 16

jun 20

jul 21

aug 22

Hash functions that have no collisions are known as perfect hash functions.

This page last modified on 12 March 2000.

a	000
c	001
e	010
o	011
p	100
u	101

b	0000
c	001
g	0010
l	0011
n	0101
p	100
r	0110
t	0111
v	1000
y	1001

jan	5
mar	6
may	9
oct	23
feb	32
dec	33
sep	36
nov	56
apr	70
aug	82
jul	83
jun	85

a	000
c	001
e	010
o	100
p	011
u	101
b	00
c	001
g	10
l	01
n	00
p	011
r	01
t	00
v	00
y	10