CS 176 - An Answer for Design Test 3

CS 176, Introduction to Computer Science II

Summer 2001 - An Answer for Design Test 3

This problem involves keeping track of statistics on a file. Do the easiest part first - open a file and then close it:

ifstream ins(filename)
if !ins
  couldn't open filename
  exit

// process the file

ins.close()

Because the file's opened for read, and because there isn't much that can be done if there's a problem, this design doesn't check to see if the close succeeded. This would not be acceptable if the file was opened for write, because an error on close could mean lost data.

Now to process the file. The problem statement describes the statistics to be collected mostly in terms of lines (the largest, the smallest, and the total number), the remaining statistic (the total number of characters) can be determined from the lines.

The problem's orientation towards lines suggests that handling the input in terms of lines might be a good approach to take. Using wishful thinking, the algorithm designer's most powerful design tool, let's assume we know how to set line_size to the number of characters in the current input line, or to -1 if there's no more lines. The file processing algorithm follows immediately:

// process the file

   min_line_size = a big number
   max_line_size = 0
   line_count = 0
   char_count = 0

   while true

     // find the line size

     if line_size < 0
       break
     char_count += line_size
     line_count++
     if min_line_size > line_size
       min_line_size = line_size
     if max_line_size < line_size
       max_line_size = line_size

Finding the line size involves reading characters until a newline or end-of-file:

// find the line size

   line_size = 0
   while true
     if (ch = ins.get()) == EOF
       line_size = -1
       break
     else if !ins.good()
       an error on read
       exit
     else
       line_size++
       if ch == newline
	 break

If the line doesn't end in a newline, it's not a line and doesn't have any size. The chances of an error on read are slim, but if an error does occur, it's important to catch it. Not catching a read error would cause the program to misbehave in undesirable ways, such as running forever (no eof) or producing ridiculous results that none-the-less look legitimate from outside the program.

This end the design of the algorithm, or it would if the design was correct. Unfortunately, this design has an error; what is it?

The problem occurs in keeping track of the number of characters in the file. A line that doesn't end in a newline may not be a line, but it does contain characters which should be counted as part of the file.

Once the error is pointed out, it's easily fixed by adding the current line size to the character count before wiping out the line size with -1:

line_size = 0
while true
  if (ch = ins.get()) == EOF
    char_cnt += line_size
    line_size = -1
    break
  else if !ins.good()
    an error on read
    exit
  else
    line_size++
    if ch == newline
      break

With this correction we're done, but it's worth while taking a look at how strings might help solve this problem. The most likely reason to use strings is to simplify reading in a line, which can be done with one call to getline():

getline(ins, str)
line_size = str.length()
if (ch = ins.get()) == EOF
  char_cnt += line_size
  line_size = -1
else if !ins.good()
  an error on read
  exit

getline() does simplify the design by eliminating the while loop, but this design has an error. What is the error?

The problem with getline() is that it throws away the newline rather than storing it in the string. The line size is off by one because it doesn't include the newline. "Well", you may be saying to yourself, "that's easy enough to fix, just add one to the line size to compensate for the missing newline:"

getline(ins, str)
line_size = str.length() + 1
if (ch = ins.get()) == EOF
  char_cnt += line_size
  line_size = -1
else if !ins.good()
  an error on read
  exit

Adding one does compensate for missing newlines, but this design is still wrong. Why?

Automatically adding one to the string size may over compensate because the last line in the file need not end in a newline. If the last line doesn't end in a newline, adding one produces incorrect statistics.

Fixing this new problem requires a deeper understanding of how getline() works. getline() expects each string to end in a newline. A string that doesn't end in a newline is not formatted correctly, and getline() produces a format error when it tries to read such a string. If getline() produces a format error, then good() will return false because an error occurred. This suggests that line_size should only be increased by one when no format error occurred:

getline(ins, str)
line_size = str.length()
if ins.good()
  line_size++
if (ch = ins.get()) == EOF
  char_cnt += line_size
  line_size = -1
else if !ins.good()
  an error on read
  exit

Unfortunately, this design is still wrong because good() can return false for a number of reasons (including eof), not only because a string didn't end in a newline.

Although it is possible to fix-up the design so it works correctly with strings, we haven't covered the parts of C++ necessary to make the fix. Rather than battle on, we'll let the problem drop here. However, there are two important lessons you should take away from this attempt:

Make sure you really understand a C++ feature before you use it. Errors caused by features you think you understand but don't are particularly nasty to find because you don't suspect the cause. And C++ is filled with features that are more complicated to understand than they initially appear to be.
Make sure your simplification really is. Your simplification may start out simple, but if you have to keep adjusting it to fix little problems, you'll find your initial simplification lost in a thicket of special cases. Keep your eye on the original design, and don't be afraid to bail out if the new design is becoming more complicated than the original.

This page last modified on 27 July 2001.