Programming Assignment 3 - End-to-End Compression

Computer Algorithms II, Fall 2008

Due Date

This assignment is due by 5:30 p.m. on Tuesday, 18 November.

See the assignment turn-in page (last modified on 16 October 2007) for instructions on turning in your assignment.

Background

If the tail of one string (its suffix) matches the head of another string (its prefix), then it’s possible to save some space by overlapping the two strings so that they share their common part.

For example, the strings “kissing” and “singer” can be combined into “kissinger”, saving four bytes. The two strings can also be combined into “singerkissing”, but there’s no compression if they’re combined that way. The combined string can then be combined with other strings; for example, “kissinger” can be combined with “gerbil” to from “kissingerbil”.

Overlapping common prefixes and suffixes to save space is known as end-to-end compression. End-To-End compression is an example of lossless compression, because uncompressing the result returns exactly the original input.

Uncompressing an end-to-end-compressed string requires further information to determine where each component string is in compressed string; such further information is known as the uncompression key. Although the uncompression key can take may forms, for this assignment it will be kept simple: it’s a sequence of non-negative integer pairs s e, where the ith pair in the sequence indicates the ith string starts s characters from the start of the previous (i - 1)th string and extends e characters beyond the end of the previous string. Note that characters in a string are numbered starting from one for the left-most character in the string; the left-most character in a string is not zero, as it would be in C++. If the compressed string contains n component strings, the uncompression key will contain n pairs.

Because the first string doesn’t have a previous string, it has to be treated as a special case in the uncompression key. The first string starts at character 0 (which is the special case) and extends the length of the first string (which is not the special case).

To continue the example, given the compressed string “kissingerbil”, the uncompression key is 0 7 4 2 4 3:

s: 0 4 4

k i s s i n g

s i n g e r

g e r b i l

e: 7 2 3

The uncompression key for the compressed string “singerkissing” is 0 6 7 7; the seventh character of “singer” is the one after the last character.

Because the uncompression key is limited to non-negative integers, some potential overlappings are not allowed. For example, given the strings baaaccc and aaa, the shorter string overlaps the longer string, but it extends -3 past the end of the longer string, and so cannot be expressed by an uncompression key. The longer string also overlaps the shorter string, but the longer string starts at character 0 of the shorter string and also can’t be represented by an uncompression key.

However, if the longer string was aaabccc, then the shorter string still doesn’t overlap the longer one, but the longer one does overlap the shorter one, starting at character 1 in the sorter string and extending 4 characters past the shorter string.

The Problem

Write a tractable program that accepts a set of strings and applies end-to-end compression to them. Your program should achieve the maximum possible compression.

Input

Input will be a sequence of strings from std-in; each string (except possibly the last) ends in a newline, which is not part of the string. The last string input need not end with a newline; it is not an error if the last string does not end in a newline. Any strings that consist only of space characters or are empty should be ignored; an empty string is a string with no characters.

Output

Output goes to std-out and consists of two parts. The first part is the string resulting from end-to-end compression. The string should be output in lines of 70 characters each (except for the last string output, which may contain less than 70 characters). Each line output ends in a newline.

The second part of output is the uncompression key, and it follows last combined-string line is output separated from it by an empty line, which is a line that contains no characters (the line-ending newline character isn’t part of the line). Adjacent numbers in the compression key should be separated by at least one space character; space characters in excess of those required to separate adjacent numbers are ignored.

Testing

The program gen-strings writes a set of strings to std-out. The command format is

./gen-strings [-cn]

where -cn is an option that causes gen-strings to output n strings; the default is to generate a random number of strings.

Because gen-strings randomly generates a new set of strings each time it’s run, you may find it more convenient to send the output of gen-strings to a file and then take input from the file:

$ ./gen-strings > in

$ ./e2e-compress < in

Your implementation of end-to-end compression should not exploit any extra properties of the input data created by gen-strings; I’ll be generating different input data when I test your assignments.

The program e2e-uncompress reads a compressed string and an uncompression key from std-in and writes to std-out the uncompressed string set. At the moment e2e-uncompress only checks for a sensible uncompression key and determines if the key can be used to uncompress the strings; it does not check for maximal compression (I’ll add that as soon as I figure out how).

Both gen-strings and e2e-uncompress can be found in the assignment directory /export/home/class/cs-306/a3 .

Probably the easiest way to check your code is to do something like

$ ./gen-strings | sort > in

$ ./e2e-compress < in | ./e2e-uncompress | sort > out

$ diff in out

If diff doesn’t complain, then everything should be ok (subject to the warning above about not checking for maximal compression). You can find my answer to the assignment in the same directory; see os-os, where os is either solaris or linux. This solution provably produces maximal end-to-end compressions, but it is dog slow. Remember, the objective of this assignment is to obtain maximal end-to-end compressions; the objective is not to faithfully reproduce the behavior of my solution. If my solution’s wrong and you copy the error, you’re going to lose points.

This page last modified on 20 October 2008.