CS 503 - Advanced Programming I Programming Assignment 3

Programming Assignment 3 - Concordances

Advanced Programming I, Spring 2007


Due Date

This assignment is due by 5:30 p.m. on Tuesday, 27 February.

See the assignment turn-in page (last modified on 14 January 2006) for instructions on turning in your assignment.

Background

A concordance determines which words appear on which pages in a text.

The Problem

Write a program that generates a concordance of a text and then answers queries about the text. The name of the file containing the text to be analyzed is given on the command line; the queries are read from std-in and the responses are written to std-out. There are two kinds of queries: list all the words common to a set of pages, and list all pages common to a set of words.

Here is an example session:

$ concord verne.txt
? the an in
100 200 300
? 100 200 300
afternoon an in the
? laser

?
The question mark is the query prompt. The first query asks for all pages containing all three of the words "the", "an", and "in"; there are three of them. The second query asks for all the words in common on pages 100, 200 and 300; there are four of them. The third query asks for a word that doesn't exist.

Input

There are two kinds of inputs: the text file-name given on the command line and the queries read from std-in.

The Text

Pages in input are delimited by lines of the form "PAGE n", where n is a positive integer; case is not significant (that is "PAGE" is the same as "page" is the same as "PaGe"). A page delimiter will be the only text on a line; if the line containing what looks like a page delimiter also contains other text, that line is not a page delimiter. The following is an example of a page delimiter:
combination of their different operations.

PAGE 5

In every other art and manufacture, the effects of the division of
while the following is not a page delimiter:
"Turn to PAGE 5?"
The page delimiters are not part of the text and should not be part of the concordance (that is, if the text itself doesn't contain the word "page", then a query for "page" should return no pages).

A page delimiter marks the start of a page; the text for a page follows the page delimiter. Any text before the first page (PAGE 1) should be ignored.

Words in text are maximal sequences of one or more letters in either case; for example the text "it's" consists of the two words "it" and "s". Words need not be stemmed or otherwise transformed.

Queries

There are two formats for queries, corresponding to the two types of queries. A sequence of one or more words, in the same format as described for the text, is a query for pages containing all the words. A sequence of one or more positive integers is a query the words common to all the pages. The following shows a query of each kind in the order described:
lucy scotland virago

10 20 30
Input read from std-in not in either of the described formats should produce a brief error message to std-out (not std-err).

Output

The prompt "? " (that is, a question mark followed by a space) should be output to std-out each time the program is ready for a query.

There are two forms of output corresponding to the two types of queries. The result of a page query should be a list of page numbers in increasing order. For example

? lucy scotland
23 128 240
The result of a word query should be a list of words in increasing order; each word listed should appear only once. For example
? 100 200 300
an in the
Error messages should be brief and informative, and written to std-out (not std-err) preceded by "! " (that is, an exclamation point followed by a space). For example
? it's
! Input garbled: "it's" not understood.
? 10 the
! input garbled: "the" not understood.

Testing

You can use the files smith.txt, trollope.txt, and darwin.txt as input texts for your program. The files can be found in the public class directory /export/home/class/cs-503/a3.


This page last modified on 11 March 2006.