Programming Assignment 1b - Link Filtering

Computer Algorithms I, Fall 2003


Due Date

This assignment is due by 5:00 p.m. on Thursday, 18 September.

See the assignment turn-in page (last modified on 3 November 2003) for instructions on turning in your assignment.

Background

A link is a mechanism by which an HTML document can refer to another resource. A link is described by an anchor tag, which has (for our purposes) the syntax

< a href = "URL" >

where URL is the reference to the external resource.

Outside of strings, letter case is ignored in tags; also out side of strings, space characters in excess of those needed to separate parts of a tag are also ignored.

The href = part of the tag is called an attribute; anchor tags have many attributes, but the only one we're interested in is the href attribute, which, as shown above, is a string describing a URL to a resource. The href attribute is not required for anchor tags; an anchor tag need not have an href attribute.

Other HTML tags also have a href attribute (such as, for example, the link tag), but the only href attribute we're interested for this assignment in are those associated with anchor tags.

The Problem

Write a program that retrieves a resource via HTTP and writes to std-out links made by the resource. That is, your program should write to std-out the URLs given in the href attributes of the anchor tags contained in the resource.

The command line to your program should contain a single URL; the href URLs in the associated resource should be written to std-out one per line. The resource associated with the command-line URL need not be a HTML document; if it isn't, chances are good that it won't contain any anchor tags, in which case your program should print nothing.

The only thing your program has to do is find the href URLs and print them to std-out. It doesn't have to check the URLs for validity, or figure out what kind of URLs they are, or retrieve the resources associated with the URLs; all that will come later.

You can use get() to get the resource associated with a URL. get() is defined in get.h and get.cc, which can be found in the assignment directory

/export/home/class/cs-305/pa1b

Be careful if you copy these files to a more convenient location; you are responsible for making sure you are using the most recent version of these files. If you do copy the files, you shouldn't change them; get.h and get.cc are deleted if you turn them in.

You can find os-print-urls, my solution to this assignment, in the assignment directory; os is one of linux or solaris. Remember, the objective of this assignment is to correctly filter links in HTML documents; it is not to faithfully reproduce the behavior of my solution. If my solution's wrong and you copy the error, you're going to lose points.


This page last modified on 11 September 2003.