Programming Assignment 1a - Retrieving Web Documents

Computer Algorithms I, Fall 2003


Due Date

This assignment is due by 5:00 p.m. on Thursday, 11 September.

See the assignment turn-in page (last modified on 3 September 2003) for instructions on turning in your assignment.

Background

A Web spider (or Web crawler or Web robot) is a program that that crawls a particular URL by retrieving the associated document, extracting the URLs contained in the document, and then recursively crawling the new URLs. Spiders are most frequently used to retrieve URLs and documents for search engines.

This is the first of four assignments that will lead up to a spider of more modest abilities: it will crawl the Computer Science Department's web site printing out the URLs for all the syllabi it manages to find.

The Problem

Write a program that accepts a URL given as a single command-line argument, gets the associated document, and writes the document to std-out. If the program is run with zero or more than one command-line argument, it should print an informative error message and exit; it should to the same if any errors occur while getting the document.

You can use get() to get the document. get() is defined in get.h and get.cc, which can be found in the assignment directory

/export/home/class/cs-305/pa1a

Be careful if you copy these files to a more convenient location; you are responsible for making sure you are using the most recent version of these files. If you do copy the files, you shouldn't change them; get.h and get.cc are deleted if you turn them in.

You can find get-page, my solution to this assignment, in the assignment directory. Remember, the objective of this assignment is to correctly retrieve URL documents; it is not to faithfully reproduce the behavior of my solution. If my solution's wrong and you copy the error, you're going to lose points.


This page last modified on 9 September 2003.