Programming Assignment 1 - HTML Filtering

Advanced Programming II, Spring 2001


Due Date

This assignment is due on Tuesday, 30 January, no later than 2:00 p.m.

See the assignment turn-in page for instructions on turning in your assignment.

Background

A web page is a document containing HTML (Hierarchical Text Mark-up Language) text. HTML text consists of regular text interspersed with tags. Regular text is a possibly empty sequence of characters excluding the characters < and > (collectively known as angle brackets). A tag is a character sequence of the form
< white-space tag-name white-space regular-text >
where white-space is a possibly empty sequence of the white-space characters space, tab (\t), or newline (\n), and tag-name is a non-empty sequence of letters (a through z, case is not significant). The regular text contained in a tag is known as attribute text. If a tag contains attribtute text, there will be at least one white-space character between the tag name and the attribute text.

Web pages are stored on web servers and displayed on web browsers. A web browser contacts a web server with the name of a web page, and the web server sends the web page - that is, the HTML text making up the web page - back to the web browser. One important characteristic of this transaction is a web browser may (and usually does) receive a web page in pieces rather than all at once; this is the reason why web pages often appear little-by-little at a web browser. For example, a web browser may send a 2,000 character web page to a web browser, but the web browser may receive the web page in 250-character pieces.

The Problem

Some web-browser users place a web-filter between their web browser and web server. The web-filter receives all the HTML text being sent to the browser and manipulates the HTML text as specified by the user; the mainpulated text is then sent on to the web browser. Web filters usually support a wide range of manipulations; two common ones are stripping cookies from web pages and preventing images from being included in web pages.

In this assignment, you will be implementing another common web-filter manipulation: modifying tags in web pages. In paritular, you are to write a subroutine that disables font tags in HTML text. The font tags are disabled by replacing any attribute text with space characters in the font tag. For example, the subroutine would change the font tag

<FONT FACE="Tahoma" COLOR="#000099">
into the font tag
<FONT                              >
Note that the disabled font tag has the same number of characters as the original font tag, but all the characters in the attribute text are replaced by space characters.

The subroutine should be called disable_font and have the prototype

void disable_font(char * text, unsigned char_cnt)
The text parameter is a pointer to the HTML text to process. The char_cnt parameter gives the count of characters in the HTML text pointed to by text; the end-of-string character is not included in the count.

disable_font() should change the HTML text in place; that is, it should modify the contents of the string pointed to by text. For example, if text pointed to the string

"<P><FONT FACE='Tahoma'><FONT COLOR='#000099'> If you
teach 400 or 500 level courses, I ask that you please remind your
students in those classes to check their email account for this information.
This would be of great assistance to my efforts and will help many students
with their career planning.</FONT></FONT><FONT FACE='Tahoma'>
<FONT COLOR='#000099'></FONT></FONT>"
after the call to
disable_font(text, strlen(text))
text would point to the string
"<P><FONT              ><FONT                > If you
teach 400 or 500 level courses, I ask that you please remind your
students in those classes to check their email account for this information.
This would be of great assistance to my efforts and will help many students
with their career planning.</FONT></FONT><FONT              >
<FONT                ></FONT></FONT>"
Note that because the HTML text for a page may be delivered to the browser in pieces, a font tag may be split across two or more pieces. disable_font() must be able to recognize and disable font tags that have been split across pieces of HTML text.


This page last modified on 22 January 2001.