Programming Assignment 2 - Symbolic Execution

Advanced Programming II, Spring 2003

Due Date

This assignment is due by 2:00 p.m. on Tuesday, 18 February.

See the assignment turn-in page (last modified on 21 January 2003) for instructions on turning in your assignment.

Symbolic execution is the process by which one program simulates the execution of another program. Symbolic execution is similar to interpretation; they both use the same mechanism to execute code. The difference is that symbolic execution manipulates symbolic values instead of the integers and characters manipulated by an interpreter.

Symbolic execution is an important technique in program analysis. For example, compilers use symbolic execution to determine when an expression is using an uninitialized variable. Each variable in a program can hold either of the symbolic values "defined" or "undefined". When a variable is created, it holds "undefined"; when assigned a value, a variable holds the value "defined". If the compiler finds an expression containing a variable holding "undefined", then the program is incorrectly using an uninitialized variable.

Here is an example, with the results of symbolic execution appearing in the comments:

int * ip;	// ip = "undefined"
int i;		// i = "undefined"

i = 3;		// i = "defined"
*ip = i;	// error - ip undefined

The compiler finds the variable ip is undefined in the expression *ip. Notice that neither the types of the variables nor the actual values assigned to the variables matter; the only thing that matters is whether the variables hold "undefined" or "defined".

The Problem

Write a program that uses symbolic execution to determine if two instruction blocks perform the same computations. The program should read the two blocks from std-in and write to std-out either nothing if the two blocks perform identical computations or the two different values computed by the blocks for each memory location.

Computational Model

Your program should simulate a simple CPU-memory computer. The CPU contains eight registers named r0 through r7; memory is addressed by identifiers in C++ syntax (see Deitel and Deitel, page 26), which are assumed to be different from the register names.

The CPU executes executes a sequence of instructions starting at the beginning of the sequence and stopping at the end of the same sequence; instructions are executed in strict sequential order with no jumping or gotos.

An instruction is either a move instruction or an operator instructions. A move instruction has the form

move dst src

and moves the value stored in src to the location given by dst; the effect of a move instruction is

dst = src

dst can be either a memory-address identifier or a register name; src can be either a memory-address identifier, a register name, or an integer constant. However, both dst and src cannot be memory-address identifiers.

An operator instruction has the form

opr dst src

where opr is any of the standard C++ binary operators multiplication, division, modulus, addition, subtraction, and the bitwise operators left shift, right shift, AND, exclusive OR, and inclusive OR (See Deitel and Deitel Appendix A for the operators). dst must be a register name; src can be a register name or an integer constant. The effect of an operator instruction is to compute

dst = dst op src

Two blocks are said to perform the same computation if and only if, when started from the same memory configuration, they both produce the same memory configuration at the end of their execution. Notice that the final contents of the CPU memory registers do not matter when determining if two computations compute the same values. Notice also that this property must hold for any initial memory configuration.

For example, the two blocks

move r0  a
move r1  a
+    r0 r1
move  a r0

move r0  a
+    r0 r0
move  a r0

always produce the same memory configuration a = aval + aval, where aval is the initial value stored in the memory location a. Notice that the difference in values stored in register r1 by the two blocks does not matter when determining if the blocks produce the same computation.

Because your program is computing on symbolic values, it need not worry about matters such as precision and overflow.

Input

The input will be a pair of instruction blocks. Each block can be proceeded or followed by zero or more blank lines (that is, a line containing zero or more space characters as defined by std::isspace(), excluding the line-ending newline); at least one blank line separates the two blocks. Any non-space character occurring after the second block should result in an input error, which prints an informative error message to std-err and immediately terminates the program.

An instruction block consists of one or more instructions. Each instruction appears in a single line, and a single line contains exactly one instruction.

Any syntactically incorrect input should immediately terminate execution with an informative error message written to std-err.

Output

If the two instruction blocks perform the identical computations, the program should produce no output.

If the two blocks perform different computations for one or more memory-resident variables, the program should write the following error message to std-out for any one of the incorrect variables:

Block 1 computes value <value₁> for variable <variable>.
Block 2 computes value <value₂> for variable <variable>.

where <value_i> is the value computed by block i and <variable> is one of the memory-location identifiers with incorrectly computed values. Any syntactically incorrect input should immediately terminate execution with an informative error message written to std-err.

Testing

The gen-code program in /export/home/class/cs-509/pa2 generates blocks of code you can use to test your program. The command format is

gen-code [ -d ]

By default, gen-code writes to std-out two code blocks that compute identical values. The -d command-line option causes gen-code to write two code blocks that compute different values.

There are a few points you should keep in mind when using gen-code:

It generates random code; you shouldn't try to make too much sense of the output. Also, you should probably send gen-code's output to a file so you can test your program with repeatable input.
gen-code creates good output by generating a random block of code and then optimizing it. Because optimization is tricky (and because I wrote the optimizer), it may happen that the optimizations are incorrect and that optimized block computes different values then does the unoptimized block.
If your program indicates the good input from gen-code computes different values, your program may be correct and you should check the input to verify your program's result.
There's a similar, though less likely, problem with bad output produced by the -d command-line option. gen-code produces bad output by generating two random blocks of code; your program should almost always indicate an error in the bad output from gen-code.
However, if gen-code manages to generate two identical code blocks (which could happen, as they say), then your program will indicate no errors. Again, you should verify the input to your program to see if your program's correct.
Occasionally gen-code will create a random expression that's too big to compile. When that happens, you'll get a "No more free registers." error message. Just run gen-code again.

This page last modified on 11 February 2003.