CS 503 - Advanced Programming I Programming Assignment 6

Programming Assignment 6 - Nearest-Neighbor Hashing

Advanced Programming I, Spring 2007


Due Date

This assignment is due by 5:30 p.m. on Friday, 20 April.

See the assignment turn-in page (last modified on 14 January 2006) for instructions on turning in your assignment.

Background

Given a query point p and a set of points P, the nearest-neighbor problem involves finding the subset of points from P that are nearest to the query point p.

Unlike general hash functions, which should be designed to minimize collisions, nearest-neighbor hash functions are designed to encourage collisions of a particular kind: points that are close to one another should collide and end up in the same bucket. Nearest-neighbor queries are answered by hashing the query point p and examining the points in the same bucket as p. Points that collide with the p are candidates for being p's nearest neighbors.

In general, nearest-neighbor hash functions suffer from two problems: points that are far apart collide and points that are near don't collide (general-purpose hash functions only suffer from the first problem). The solution to both these problems is to use multiple hash tables, each with its own hash function. Queries are answered by hashing the query point in each hash table and using the resulting collisions to determine the candidate neighbor set. For this assignment, if the point set contains d-dimensional points, then queries should use d hash tables.

The Problem

Write a program that answers nearest-neighbor queries using hashing.

Input

The input comes from std-in and has two parts. The first part defines the point set P to be queried and the second part contains nearest-neighbor queries. At least one blank line separates the first and second part.

The Point Set

The point set is input as a sequence of integers. The first integer read gives the dimension d of the points; d must be positive. Following d will be zero or more groups of d integers, each group defining a point in P.

Adjacent integers are separated by at least one space character. The total number of integers in the first part should be n*d + 1 where n is the number of points read.

If d is not positive or the last point read is incomplete (less than d coordinates read), the program should print an informative error message and stop without further processing.

Neighbor Queries

A neighbor query consists of d + 1 integers. The first integer read, k, should be non-negative and indicates the maximum number of nearest neighbors to return. The next d integers are the coordinates of the query point.

Each neighbor query fits on a single line, and a single line contains at most one neighbor query. If a neighbor query doesn't contain d + 1 integers, or if k is negative, the program should print an informative error message and continue reading neighbor queries.

Output

Unless there's an error, the program should not produce any output while processing the point set. A neighbor query should produce up to k points as output, where k was given by the query. Each point should be output in one line; one line should contain a single point.

Error messages should be proceeded by a "! " (that's a bang followed by a space) and be written to std-err.

Testing

You can use gen-input, available in the class assignment directory /export/home/class/cs-503/a6, to generate input for your program. When run, it writes an randomly generated point set std-out, followed by some randomly generated neighbor queries. You can use it with your program like so:
$ ./gen-input | ./find-neighbors
Because gen-input creates new input each time it's run, you might find it more convenient to save the input in a file to provide your program with consistent input:
$ ./gen-input > input

$ ./find-neighbors < input

References

Similarity Search in High Dimensions via Hashing by Aristides Gionis, Piotr Indyk, and Rajeev Motwani. In Proceedings of the25th International Conference on Very Large Data Bases, pages 518-529, Edinburgh, Scotland, 7-10 September 1999.

Low Latency Photon Mapping with Block Hashing by Vincent Ma and Michael McCool in Proceedings of the ACM SIGGRAPH-EUROGRAPHICS Conference on Graphics Hardware, Saarbrucken, Germany, 01–02 September 2002.


This page last modified on 21 April 2006.