See the assignment turn-in page (last modified on 9 February 2004) for instructions on turning in your assignment.
For example, the strings "kissing" and "singer" can be combined into "kissinger", saving four bytes. The two strings can also be combined into "singerkissing", but there's no compression if they're combined that way. The combined string can then be combined with other strings; for example, "kissinger" can be combined with "gerbil" to from "kissingerbil".
Overlapping common prefixes and suffixes to save space is known as ouroboros compression, after the image of the snake biting its tail. Ouroboros compression is an example of lossless compression, because uncompressing the result returns exactly the original input.
Uncompressing an ouroboros-compressed string requires further information to determine where each component string is in compressed string; such further information is known as the uncompression key. Although the uncompression key can take may forms, for this assignment it will be kept simple: it's a sequence of non-negative integer pairs s e, where the ith pair in the sequence indicates the ith string starts s characters from the start of the previous ((i - 1)th) string and extends e characters beyond the end of the previous string. Note that characters in a string are numbered starting from one for the left-most character in the string; the left-most character in a string is not zero, as it would be in C++. If the compressed string contains n component strings, the uncompression key will contain n pairs.
Because the first string doesn't have a previous string, it has to be treated as a special case in the uncompression key. The first string starts at character 0 (which is the special case) and extends the length of the first string (which is not the special case).
To continue the example, given the compressed string "kissingerbil", the uncompression key is 0 7 4 2 4 3:
s: | 0 | 4 | 4 | |||||||||
k | i | s | s | i | n | g | ||||||
s | i | n | g | e | r | |||||||
g | e | r | b | i | l | |||||||
e: | 7 | 2 | 3 |
The uncompression key for the compressed string "singerkissing" is 0 6 7 7; the seventh character of "singer" is the one after the last character.
Because the uncompression key is limited to non-negative integers, some
potential overlappings are not allowed. For example, given the strings
baaaccc
and aaa
, the shorter string overlaps the longer string, but
it extends -3 past the end of the longer string, and so cannot be expressed by
an uncompression key. The longer string also overlaps the shorter string, but
the longer string starts at character 0 of the shorter string and also can't be
represented by an uncompression key.
However, if the longer string was aaabccc
, then the shorter string still
doesn't overlap the longer one, but the longer one does overlap the shorter
one, starting at character 1 in the sorter string and extending 4 characters
past the shorter string.
The second part of output is the uncompression key, and it follows last combined-string line is output separated from it by an empty line, which is a line that contains no characters (the line-ending newline character isn't part of the line). Adjacent numbers in the compression key should be separated by at least one space character; space characters in excess of those required to separate adjacent numbers are ignored.
gen-strings
writes a number set of strings
to std-out. The command format is
./gen-strings
[-c
n]
where -c
n is an option that causes gen-strings
to output n
strings; the default is to generate a random number of strings.
Because gen-strings
randomly generates a new set of strings each time it's
run, you may find it more convenient to send the output of gen-strings
to
a file and then take input from the file:
$ ./gen-strings > in $ ./oc < in
Your implementation of ouroboros compression should not exploit any extra
properties of the input data created by gen-strings
; I'll be generating
different input data when I test your assignments.
The program ouroboros-uncompress
reads a
compressed string and an uncompression key from std-in and writes to std-out
the uncompressed string set. At the moment ouroboros-uncompress
only
checks for a sensible uncompression key and determines if the key can be used
to uncompress the strings; it does not check for maximal compression (I'll add
that as soon as I figure out how).
Both gen-strings
and ouroboros-uncompress
can be found in the
assignment directory
/export/home/class/cs-509/pa2
Probably the easiest way to check your code is to do something like
$ ./gen-strings | sort > in $ ./oc < in | ./ouroboros-uncompress | sort > out $ diff in out
If diff
doesn't complain, then everything should be ok (subject to the
warning above about not checking for maximal compression).
You can find my answer to the assignment in the same directory; see
os-os
, where os is either solaris
or linux
. This
solution provably produces maximal ouroboros compressions, but it is dog slow.
Remember, the objective of this assignment is to obtain maximal ouroboros
compressions; the objective is not to faithfully reproduce the behavior of my
solution. If my solution's wrong and you copy the error, you're going to lose
points.
This page last modified on 8 April 2004.