See the assignment turn-in page (last modified on 3 September 2003) for instructions on turning in your assignment.
The port portion of the address is optional and defaults to :80
if
omitted. (The protocol portion is not optional; get()
assumes it's
http:
and adds it to the URL if omitted.)
The directory part of the URL is similar to the Unix directory structure: a sequence of names delimited by slash characters. The final, right-most name is name of the resource to retrieve, while all proceeding names are directory names. For example, the URL
http://www.monmouth.edu/path/visitors.asp
refers to the visitors.asp
resource in the path
directory. The
left-most slash in a directory is known as the root of the server's resource
directory.
A directory that ends in a slash refers to the default resource stored in the
associated directory. The actual document returned depends on the Web server
used and how its configured; the default is to return the resource
index.html
if it exists or an error otherwise. This is also what happens
when the directory is omitted from the URL; the server returns the default
resource in its root directory (or an error if otherwise configured).
A resource name may contain an hash character #
possibly followed by
more text, such as in
http://www.w3.org/Consortium/Legal/ipr-notice#Copyright
This is called a name reference and is used to refer to a particular location within the resource given to the left of the hash. Stripping the hash and everything to the right produces a URL to the resource.
A relative URL is a URL consisting of only a directory. For example, the
resource with URL http://www.w3.org/Overview.html
contains the HREF URL
/Help/siteindex
. The context for a relative URL comes from the resource
containing the URL. In the previous example, the protocol and host for
/Help/siteindex
comes from the protocol and host for its containing
document Overview.html
; that is, the protocol is http:
and the host
is www.w3.org
.
Similar to Unix directories, a URL directory need not start with a slash. If the URL directory doesn't start with a slash, it is interpreted relative to the location of its containing resource. For example, the resource
http://www.w3.org/Style/CSS/Overview.html
contains the images/Ark-CSS
HREF URL. Because it doesn't start with a
slash, its location interpreted relative to the containing resource; that is,
the absolute URL is
http://www.w3.org/Style/CSS/images/Ark-CSS
On the other hand
http://www.w3.org/Style/CSS/Overview.html
also contains the /TR/WD-positioning
HREF URL. Because it does start with
a slash, its location interpreted relative to the server's root; that is,
the absolute URL is
http://www.w3.org/TR/WD-positioning
Also similar to Unix, directories may contain references to the ..
directory, which moves one directory up towards the root. To continue the
example, the resource
http://www.w3.org/Style/CSS/Overview.html
contains the HREF URL
../../Graphics/SVG/Test/BE-ImpStatus-20011026.html
Because it doesn't start with a slash, the HREF URL's location is interpreted relative to its containing resources' location, or
http://www.w3.org/Style/CSS
The first two directories are ..
, which moves the location two directories
closer to the server's root, skipping the directories CSS
and Style
;
the practical effect of this is to move directly under the server's root. From
there, the resource BE-ImpStatus-20011026.html
can be found by following
the Graphics
, SVG
and Test
directories
http://www.w3.org/Graphics/SVG/Test/BE-ImpStatus-20011026.html
Any URL that attempts to move above the server's root is invalid and should be ignored.
The command line to your revised program remains the same: a single URL. The output of your revised program also remains the same: href URLs in the associated resource should be written to std-out one per line, only now the URLs will be absolute urls. Remember that the resource associated with the command-line URL need not be a HTML document.
Your program still doesn't have to check the URLs for validity, or figure out what kind of URLs they are, or retrieve the resources associated with the URLs; all that will come next.
A data structure and procedure have been provided to help you parse URLs; for
more details, see get.h
and get.cc
, which
can be found in the assignment directory
/export/home/class/cs-305/pa2a
Be careful if you copy these files to a more convenient location; you are
responsible for making sure you are using the most recent version of these
files. If you do copy the files, you shouldn't change them; get.h
and
get.cc
are deleted if you turn them in.
You can find os-print-urls
, my solution to this assignment, in the
assignment directory; os is one of linux
or solaris
. Remember,
the objective of this assignment is to correctly print links in HTML documents;
it is not to faithfully reproduce the behavior of my solution. If my
solution's wrong and you copy the error, you're going to lose points.
This page last modified on 25 September 2003.