CS 503 - Advanced Programming I Programming Assignment 2

Programming Assignment 2 - Generating Hash Values

Advanced Programming I, Spring 2006

Due Date

This assignment is due by 5:30 p.m. on Tuesday, 21 February.

See the assignment turn-in page (last modified on 18 January 2006) for instructions on turning in your assignment.

Background

A hash function is a function that accepts a string and produces an unsigned integer computed from the contents of the string. This assignment is interested in the hash function

const unsigned char_values[26] = { /* ??? */ }

unsigned hash_function(const std::string & str) {

  assert(not str.empty());

  const char 
    first_char = str[0],
    last_char = str[str.size() - 1];

  assert(isalpha(fist_char) and isalpha(last_char));

  return str.size()
         + char_values[tolower(first_char)] 
	 + char_values[tolower(last_char)];
  }

Two strings are said to collide if they are different strings but produce the same hash value; that is

s1 ≠ s2

but

hash_function(s1) = hash_function(s2)

Because letter-case is ignored, "Hello" does not collide with "hello": they are the same string.

The Problem

The problem to be dealt with is: what are the values in char_values?

Write a program that reads a set of words from std-in and writes to std-out the values to be stored in char_values; fifteen is the largest value that can be assigned to a letter. The hash function defined by the output values should not collide for any pair of the input strings; that is, if

s1 ≠ s2

are strings given as input and hash_function is using the character values output, then

hash_function(s1) ≠ hash_function(s2)

In addition, your program should be efficient, which means it should take an amount of time that is within a factor of two (on the high side) of the amount of time taken by my program to find a set of values for the same input.

Given the definition of hash_function(), it may be impossible to generate non-colliding character values for some inputs. For example, the strings "boat" and "boot" will always collide, even though they are different, because the hash function views them as being identical. If your program can't find a suitable set of character values, it should print a brief, informative error message to std-err and exit with no further output.

Input

Input are words read from std-in until eof. Each word should contain only the letters a through z of either case; letter case is ignored. Words containing any non-letter characters are invalid; if your program reads an invalid word it should print a brief, informative error message to std-err and exit with no further processing.

Output

If your program can find a non-colliding assignment of values to characters, it should output them to std-out one per line starting with the value assigned to a and ending with the value assigned to z.

Brief, informative error messages should be sent to std-err, preceded by the characters "! " (an exclamation point followed by a space).

Testing

gen-words can be used to generate a set of words that can be used as input to your program; gen-words can be found in the public class directory /export/home/class/CS-503/a2. gen-words accepts an optional command-line argument which specifies the maximum number of words to generate. For example

$ ./gen-words 5

will generate sets that contain at most five words. Because gen-words generates a new set of words each time it's run, you may find it convenient to capture gen-words output in a file, which can then be repeatedly input into your program:

$ ./gen-words > words

$ ./gen-cvals < words

gen-cvals, my solution to the problem (compiled for Linux), can also be found in the same directory.

You can use the Unix time command to determine your program's running time:

$ time ./gen-cvals < words
5
1
[ blah blah blah ]
0
0
    4.29s real     4.29s user     0.01s system

$

Use the real (total) time for comparisons; in this case, the program took 4.3 seconds to find the letter encodings. See the time man page for more details.

check-hash in the same directory can be used to check your program's output. It reads letter codes from std-in and accepts the name of a file containing words on the command line. If it detects anything wrong, including a collision between the given words using the letter codes, check-hash prints an error message to std-err; if there are no problems, there is no output. An example use:

$ ./gen-words > words

$ ./gen-cvals < words | ./check-hash words

$

This page last modified on 15 February 2006.