Programming Assignment 4a - Web Page Concordance

Computer Algorithms II, Fall 2007


Due Date

This assignment is due by 5:30 p.m. on Tuesday, 27 November.

See the assignment turn-in page (last modified on 14 January 2006) for instructions on turning in your assignment.

Background

A concordance for a set of web pages is a list of words appearing in any web page in the set. Associated with each word in the concordance is the list of web pages in which the word appeared.

Given a web-page concordance, a query on the concordance is a list of words. The result of the query is a list of web pages containing all the words in the query list (that is, each web page in the result contains all the words in the query list). If no web page contains all the words in the query, the query result is empty (that is, contains no web pages).

The Problem

Write code that maintains a concordance for a set of web pages and is able to answer queries on the concordance.

Your code will be part of a web-browser proxy. The interface between your code and the proxy is the procedure

extern void wbproxy(const std::string & url, std::string & document);

defined in /export/home/class/cs-306/a4a/wbproxy.h.

Input

Input to your code will be either

  1. a web page to be added to the concordance or

  2. a query to be answered.
If the input to your code is a web page, then first argument to wbproxy() is the page's URL and the second argument contains the page itself. If the input to your code is a query, then the first argument is the query itself and the second argument is empty (that is, document = "").

A query URL has the form

http://wpconc/word1,word2,...,wordn
where word1 through wordn are the words in the query. For example, the query

http://wpconc/quicksort,optimization,smallest

returns a page containing a list of all pages that contain the three words “quicksort,” “optimization,” and “smallest.”

Output

If the input to your code was a web page to be added to the concordance, no output is expected from your code.

If the input to your code was a query, your code should create the contents for a web page containing the answers, if any, to the query. The contents should be the HTML text that goes between the <body> and </body> tags in an HTML document. The content need not contain a <body> tag, nor any of the tags that usually precede the <body> tag in an HTML document; these will be added by the proxy. Similarly, the content need not contain a </body> tag nor any of the tags that usually follow the </body> tag.

The web-page contents returned can be simple; for example, just a list of web pages satisfying the query. The web page can be made more useful, but don't tart it up too much (for example, no JavaScript).

The contents are stored in the second argument to wbproxy(). Don't forget to set the size field correctly; the size should not include the null byte at the end of the content string.

Building

The library libwbproxy.a in the assignment directory contains the code implementing a browser proxy. You link your code with libwbproxy.a to create the browser-proxy executable. The makefile in the assignment directory shows the commands needed to create the executable.

Running

It is easiest to have two windows open when using the proxy. Run the proxy in one of the windows; assuming your proxy is called wpconc, type

$ ./wpconc

The proxy normally runs without producing any output to std-out or std-err.

In the other window, run a browser connected to the proxy; this forces all traffic to and from the browser to pass through the proxy. The easiest way to do this is use a text-only browser, such as lynx, set to use wpconc as a proxy:

$ http_proxy=http://localhost:10306 lynx

The proxy details may vary from one browser to another; check the browser's documentation for more information.

Ports

The value 10306 in the lynx example is the TCP port used by wpconc for communication with a browser. TCP ports are system-wide, which means only one program can use a particular port at a time. This could be a problem if there are several people working on the same system, such as rockhopper. In such cases, the first person to run wpconc will get the 10306 port and everybody else will get an error message:

$ ./wpconc 
Can't bind to 10306 port:  Address already in use.

$

In such cases, use the -p option to your proxy to specify an alternative port. Use the port 10xxx, where xxx are the last three digits of your student id. For example, if the last three digits of your student id are 123, use port 10123:

$ ./wpconc -p 10123
Don't forget to specify the alternative port when starting a browser:

$ http_proxy=http://localhost:10123 lynx


This page last modified on 10 November 2007.