Example Solution for Programming Assignment 2, Advanced Programming I

Advanced Programming I, Spring 2006

Programming Assignment 2 - An Example Solution


Table of Contents

Main

Command-line Options

The only command-line option is -s, which determines whether or not the words are sorted before the character assignments are done (s for "slow"). The default is to do the sorting. This option is not part of the assignment; it was added to illustrate the effects of sorting.

Input

Because error handling is simple, just throw an execption if anything goes wrong during input and print the exception before exiting. Read a word from the given input stream; return the list of all words read in the order read and ending with the empty string. cnt is the number of words that have been read already. Throw a string exception if anything untoward happens.

This is an example of how you can use recursion to read an arbitrary number of words from input and assign them to an array. The array is created after all the words have been read; in the mean time, each word is stored in an activation record on the run-time stack.

Check the given word for validity; throw a string exception if the word is invalid. To be valid a word should consist only of alphabetic characters. Return the given word in lower case.

Searching for Character Values

Assign and check encodings for the characters associated with the word at the given index in the given word list. Return true iff the given word list contains no hash-value collisions under the given character encodings.

The Encoding

An instance of the encoding class represents a mapping from characters to values; the values have type encoding::code_type. An encoding instance behaves as an array indexed by either characters 'a' through 'z' or unsigned integers 0 through 25 via the overloaded operator []; The correspondence between char and unsigned indices is the obvious one: 'a' corresponds to 0, 'b' corresponds to 1, and so on.

Utility Procedures

The input words are treated as a set, which means there should be no duplicates. As each word is input, compare it against the list of words already input. If a duplicate is found on the list, the duplicate is removed from the list by sliding all the words (and the empty word marker) to the right of the duplidate left by one array slot, compressing out the duplicate.

remove_duplicates() assumes the word list contains no duplicates and returns immediately after a duplicate for word is removed.

Now we come to the whole point of the assignment: the sort used to speed-up the brute-force search for a non-colliding character encoding. Assuming that freq[i] contains the letter frequency for words[i], a simple selection sort will do to order words. freq[i] is the sum of the frequencies of the first and last letters of words[i]. freq[] is easiest computed using the encoding f to hold the character frequencies; that is, f[c] is the number of times the character c appears as a first or last letter in words.

Note that using an encoding to represent character frequencies is bad programming, because it makes an assumption that encoding values can be big. For the encoding implementation used here, "big" is 255, which means that words contains at least 128 words, 127 of which starts and ends with the same character. Finding character values in such a case is pointless, but supposed six months from now the encoding implementation is optimized to save space by using four bits for each character value, exploiting the maximum value limit of 15. Now the character-value generator is broken for a list of sixteen words all of which start with the same character, which is a word list the generator should be able to handle.

And that, except for some small housekeeping, is all that needs to be done to sort words in decreasing feqeuency order. The next sort is a bit tricker. The goal is to check a word for collisions as soon as its first and last letters have recevied values. The word list, being ordered by decreasing letter frequency, doesn't support that goal.

For example, suppose the word list starts with "egg" and "elf" and ends with "fog". The first three letters to get values are 'e', 'g', and 'f', at which point "fog" should be checked for collisions against "egg" and "elf". However, "fog" appears at the end of the list and won't be checked until last, which is not the goal.

To reach the goal, the word list has to be re-ordered so that words can be tested as soon as they have a hash value. The re-ordering can be done by a modified form of selection sort. One index, i, scans left to right through the word list, from high-frequency to low-frequency words. At each point of the scan, some set of letters - namely the first and last letters of the words in word-list[0..i] - have received non-colliding values.

References

Minimal Perfect Hash Functions Made Simple by Richard Cichelli, Communications of the ACM, January 1980, pages 17-19.

Index


This page last modified on 8 March 2006.