Programming Assignment 3a - Word Scraping

Computer Algorithms I, Fall 2003

Due Date

This assignment is due by 5:00 p.m. on Friday, 17 October.

See the assignment turn-in page (last modified on 3 September 2003) for instructions on turning in your assignment.

Background

A Web scraper is a program that extracts information from Web pages. Web scrapers are handy gadgets to have; I run scrapers that delete advertisements and font manipulation tags from Web pages.

In this assignment you'll write a simple scraper that extracts words from HTML pages.

The Problem

Write a word scraper with the prototype

void mitm(const std::string &, const resource &);

(defined in mitm.h available in the assignment directory /export/home/class/cs-305/pa3a).

Your implementation of mitm() should print to std-out all the words found in the body of the HTML page passed in as the resource, one word per line. The body of an HTML page is defined as all text appearing between body and end-body tags. If the resource doesn't contain an HTML page, or if the HTML page doesn't contain a body, your routine should output nothing.

A word is defined as a maximal sequence of letters of either case; a letter is any character that makes isalpha() true. For example, O'Reilly in the document would be output as

O
Reilly

by your program. Your word scraper should ignore everything within HTML tags.

The assignment directory /export/home/class/cs-305/pa3a contains everything else you'll need to implement the assignment:

main.cc - The main routine.
mitm.h - Defines resource.
ip-utils.h, ip-utils.cc - IP utilities.
Makefile, A makefile to compile your scraper.

As always, it is not necessary that you understand these files to do your assignment. You may do whatever you want with these files to help you implement and test your assignment, but your code should not rely on any modifications you make to these files. These files are deleted if you submit them, and your code is compiled with the code from the assignment directory.

You may add whatever other files you feel are necessary to implement this assignment.

Input

The only input your routine will get is from the mitm() parameter list; the string argument isn't used for this assignment. Your code should not expect or require any other input.

Output

The only output your routine should produce is the words found in the body of the resources heading to the browser. Each word should be output to std-out one word per line, left justified.

Running the Word Scraper

To test your word scraper, you should set your browser to use it as a proxy. How you do that depends on your browser. The examples here use lynx; other browsers are about the same, differing only in the way you specify the proxy.

By default, your word scraper listens on port 10305; you can change the port to p using the -p p option. The legal values for p are in the range 1024 <= p < 65536. Ports are a per-system resource; If you attempt to use an already allocated port, you'll get an error message

Can't bind to 10305 port:  Address already in use.

and you'll have to pick another port.

The easiest way to run your word scraper is to open two windows on the same system. In one window run your word scraper. In the other window run lynx with the http-proxy set to your word scraper:

$ http_proxy=http://localhost:10305 lynx

Make sure you run your word scraper before you run lynx. If you run your word scraper and browser on two different systems, the http proxy must contain the fully qualified domain name of the system on which your word scraper is running:

$ http_proxy=http://rockhopper.monmouth.edu:10305 lynx

You can find os-word-scraper, my solution to this assignment, in the assignment directory; os is one of linux or solaris. Remember, the objective of this assignment is to correctly print words in HTML documents; it is not to faithfully reproduce the behavior of my solution. If my solution's wrong and you copy the error, you're going to lose points.

This page last modified on 15 October 2003.