See the assignment turn-in page (last modified on 3 September 2003) for instructions on turning in your assignment.
In this assignment you'll write a simple scraper that extracts words from HTML pages.
void mitm(const std::string &, const resource &);
(defined in mitm.h
available in the assignment directory
/export/home/class/cs-305/pa3a
).
Your implementation of mitm()
should print to std-out all the words found in
the body of the HTML page passed in as the resource, one word per line. The
body of an HTML page is defined as all text appearing between body and end-body
tags. If the resource doesn't contain an HTML page, or if the HTML page
doesn't contain a body, your routine should output nothing.
A word is defined as a maximal sequence of letters of either case; a letter is
any character that makes isalpha()
true. For example, O'Reilly
in
the document would be output as
O Reilly
by your program. Your word scraper should ignore everything within HTML tags.
The assignment directory /export/home/class/cs-305/pa3a
contains everything else you'll need to
implement the assignment:
main.cc
- The main routine.
mitm.h
- Defines resource
.
ip-utils.h
, ip-utils.cc
- IP utilities.
Makefile
, A makefile to compile your scraper.
As always, it is not necessary that you understand these files to do your assignment. You may do whatever you want with these files to help you implement and test your assignment, but your code should not rely on any modifications you make to these files. These files are deleted if you submit them, and your code is compiled with the code from the assignment directory.
You may add whatever other files you feel are necessary to implement this assignment.
mitm()
parameter list; the
string argument isn't used for this assignment. Your code should not expect or
require any other input.
By default, your word scraper listens on port 10305; you can change the port to
p using the -p
p option. The legal values for p are in the
range 1024 <= p < 65536. Ports are a per-system resource; If you attempt
to use an already allocated port, you'll get an error message
Can't bind to 10305 port: Address already in use.
and you'll have to pick another port.
The easiest way to run your word scraper is to open two windows on the same system. In one window run your word scraper. In the other window run lynx with the http-proxy set to your word scraper:
$ http_proxy=http://localhost:10305 lynx
Make sure you run your word scraper before you run lynx. If you run your word scraper and browser on two different systems, the http proxy must contain the fully qualified domain name of the system on which your word scraper is running:
$ http_proxy=http://rockhopper.monmouth.edu:10305 lynx
You can find os-word-scraper
, my solution to this assignment, in the
assignment directory; os is one of linux
or solaris
. Remember,
the objective of this assignment is to correctly print words in HTML documents;
it is not to faithfully reproduce the behavior of my solution. If my
solution's wrong and you copy the error, you're going to lose points.
This page last modified on 15 October 2003.