Programming Assignment 2a - URL Rewriting

Computer Algorithms I, Fall 2003


Due Date

This assignment is due by 5:00 p.m. on Thursday, 25 September.

See the assignment turn-in page (last modified on 3 September 2003) for instructions on turning in your assignment.

Background

An absolute URL has the format

absolute url format

The port portion of the address is optional and defaults to :80 if omitted. (The protocol portion is not optional; get() assumes it's http: and adds it to the URL if omitted.)

The directory part of the URL is similar to the Unix directory structure: a sequence of names delimited by slash characters. The final, right-most name is name of the resource to retrieve, while all proceeding names are directory names. For example, the URL

http://www.monmouth.edu/path/visitors.asp

refers to the visitors.asp resource in the path directory. The left-most slash in a directory is known as the root of the server's resource directory.

A directory that ends in a slash refers to the default resource stored in the associated directory. The actual document returned depends on the Web server used and how its configured; the default is to return the resource index.html if it exists or an error otherwise. This is also what happens when the directory is omitted from the URL; the server returns the default resource in its root directory (or an error if otherwise configured).

A resource name may contain an hash character # possibly followed by more text, such as in

http://www.w3.org/Consortium/Legal/ipr-notice#Copyright

This is called a name reference and is used to refer to a particular location within the resource given to the left of the hash. Stripping the hash and everything to the right produces a URL to the resource.

A relative URL is a URL consisting of only a directory. For example, the resource with URL http://www.w3.org/Overview.html contains the HREF URL /Help/siteindex. The context for a relative URL comes from the resource containing the URL. In the previous example, the protocol and host for /Help/siteindex comes from the protocol and host for its containing document Overview.html; that is, the protocol is http: and the host is www.w3.org.

Similar to Unix directories, a URL directory need not start with a slash. If the URL directory doesn't start with a slash, it is interpreted relative to the location of its containing resource. For example, the resource

http://www.w3.org/Style/CSS/Overview.html

contains the images/Ark-CSS HREF URL. Because it doesn't start with a slash, its location interpreted relative to the containing resource; that is, the absolute URL is

http://www.w3.org/Style/CSS/images/Ark-CSS

On the other hand

http://www.w3.org/Style/CSS/Overview.html

also contains the /TR/WD-positioning HREF URL. Because it does start with a slash, its location interpreted relative to the server's root; that is, the absolute URL is

http://www.w3.org/TR/WD-positioning

Also similar to Unix, directories may contain references to the .. directory, which moves one directory up towards the root. To continue the example, the resource

http://www.w3.org/Style/CSS/Overview.html

contains the HREF URL

../../Graphics/SVG/Test/BE-ImpStatus-20011026.html

Because it doesn't start with a slash, the HREF URL's location is interpreted relative to its containing resources' location, or

http://www.w3.org/Style/CSS

The first two directories are .., which moves the location two directories closer to the server's root, skipping the directories CSS and Style; the practical effect of this is to move directly under the server's root. From there, the resource BE-ImpStatus-20011026.html can be found by following the Graphics, SVG and Test directories

http://www.w3.org/Graphics/SVG/Test/BE-ImpStatus-20011026.html

Any URL that attempts to move above the server's root is invalid and should be ignored.

The Problem

Revise your program for Assignment 1b to rewrite the URLs it finds as absolute URLs before outputting them.

The command line to your revised program remains the same: a single URL. The output of your revised program also remains the same: href URLs in the associated resource should be written to std-out one per line, only now the URLs will be absolute urls. Remember that the resource associated with the command-line URL need not be a HTML document.

Your program still doesn't have to check the URLs for validity, or figure out what kind of URLs they are, or retrieve the resources associated with the URLs; all that will come next.

A data structure and procedure have been provided to help you parse URLs; for more details, see get.h and get.cc, which can be found in the assignment directory

/export/home/class/cs-305/pa2a

Be careful if you copy these files to a more convenient location; you are responsible for making sure you are using the most recent version of these files. If you do copy the files, you shouldn't change them; get.h and get.cc are deleted if you turn them in.

You can find os-print-urls, my solution to this assignment, in the assignment directory; os is one of linux or solaris. Remember, the objective of this assignment is to correctly print links in HTML documents; it is not to faithfully reproduce the behavior of my solution. If my solution's wrong and you copy the error, you're going to lose points.


This page last modified on 25 September 2003.