Programming Assignment 5 - Concept Maps

Advanced Programming I, Spring 2006

Due Date

This assignment is due by 5:30 p.m. on Thursday, 6 April.

See the assignment turn-in page (last modified on 18 January 2006) for instructions on turning in your assignment.

A concept map is a graph in which nodes represent ideas or concepts and edges between nodes represent relations between concepts. In this assignment the concepts are going to be words and two words are going to be related if they are n-neighbors. If two words are n-neighbors, they will have an edge between them in the concept graph.

The following steps build a concept graph G from a seed word w:

Add w as a vertex to G; there are no edges.
Find the set N(G), the set of all words that are n-neighbors with any word in G.
Find the set N(G) = N(G) - G', the set of all words that are n-neighbors of any word in G but are not in G.
If N(G)' is empty, stop.
Find the word w' in N(G)' that has the largest number of links to words in G; w' may not be unique, in which case any one of the words will do.
Add w' to G; only the links from w' to words in G should be added to G.
Go to step 2.

The Problem

Write a program that automatically generates a concept map from text. The name of the file containing the text is given on the command line; the specifications for the concept graphs to construct are are read from std-in; the program displays the concept graphs by calling the show-graphs() procedure described below.

The program should accept one or two command-line arguments. If there is one command-line argument, it should be taken as the name of the file containing the text to be analyzed. If there are two command-line arguments, the first should be taken as the name of a file containing stop words and the second should be taken as the name of the file containing the text to be analyzed.

For example, in the command

$ cgraph wealth.txt

wealth.txt is the name of the file containing the text to be analyzed. In the command line

$ cgraph ignore origins.txt

ignore is the name of the file containing stop words and origins.txt is the name of the file containing the text to be analyzed.

The concept graph is weighed; the edge between two nodes has a weight equal to the non-negative integer n, where n is the size of the largest n-neighborhood containing both words associated with the two nodes. For example, if "Frank" and "Lucy" appeared as 3-neighbors and 2-neighbors, then the edge between the nodes associated with "Frank" and "Lucy" would have a weight of 3, the size of the largest n-neighborhood containing both words.

The nodes of the concept graph are also weighted. A node's weight is the average weight of all its edges. If a node has n edges, then its weight would be the floating-point number s/n where s is the sum of the weights on all the edges associated with the node. For example, if a node has three edges of weights 3, 3, and 1, then the node's weight is (3 + 3 + 1)/3 = 2.3.

Node and edge weights give a measure of importance to words. High edge weights indicate the associated words frequently appear together; high node weights indicate words that are heavily connected to other words. The seed node used to start the concept graph any of the nodes having the highest weighting. Similarly, the next node added to the graph is any of the candidate nodes having the highest weighting.

Input

There are three kinds of inputs: the text to be analyzed, the concept-graph specifications read from std-in, and stop words to use.

The Text

The format of the input text is identical to that of Assignment 3.

Concept-Graph Specifications

A concept-graph specification has the form

start end min

start and end are a valid range of page numbers, start ≤ end; the pages start and min are part of the page range. min is a positive integer indicating the minimum number of times a word should appear on a page in the given range to be considered important; that is, each word appearing in the graph must appear in, at least, an min-neighborhood. For example, the specification

? 1 100 3

specifies that a concept graph should be constructed for all words appearing at least three times on any page in the range 1 through 100.

Stop Words

The stop-word file is the same as it was for Assignment 4.

Output

The prompt "? " (that is, a question mark followed by a space) should be output to std-out each time the program is ready for a query.

Under non-error conditions, the program outputs an n-neighbor tree by calling the procedure

void show_graph(const graph_node * const);

passing a pointer to a node of the graph. The file show-graph.cc in the /export/home/class/CS-503/a5 assignment directory contains the code for show_graph(). This file should be compiled and linked in to your code.

The definition for graph-node used by show_graph() can be found in the file graph-node.h, which can also be found in the assignment directory. You may extend this definition any way you see fit, but you should not change the declarations for the existing member functions (graph_node::neighbor_count(), graph_node::get_neighbor(), and graph_node:: get_word()).

Error messages should be brief and informative, and written to std-out (not std-err) preceded by "! " (that is, an exclamation point followed by a space).

Testing

You can continue to use the Assignment 3 test files for this assignment; they can be found in the Assignment 3 directory. The Assignment 4 stop file, found in the Assignment 4 directory, can also be used for this assignment.

References

A Graph Model for Unsupervised Lexical Acquisition by Dominic Widdows and Beate Dorow, 19th International Conference on Computational Linguistics, August 2003.

This page last modified on 13 April 2006.