Project 3: Clustering

Start date 22 March, due 31 March beginning of class.

The goal of this project is to choose and evaluate clustering mechanisms. I would suggest using the mechanisms available in Weka, although you may implement your own if you wish.

Use datasets from the UCI Machine Learning Repository. Choose one from each column of the following table:

One from this column	One from this column
Arrythmia	Abalone
Image Segmentation	Auto-MPG
Isolet	Housing
Nursery

If you wish to use other datasets in place of these, please give me a pointer to or description of the datasets and I'll let you know if that is okay (and which column it would count as).

What you need to do for this project is:

Choose two datasets.
Determine how you will measure the quality of the clusters produced.
Choose two algorithms to compare.
Set up and run a comparison experiment, obtaining the quality measures you determined above. You may find that some algorithms cannot be meaningfully applied to some datasets. If so, you can explain why in lieu of the experiment. However, saying The data has continuous values, the algorithm only applies to nominal values isn't good enough - you should instead discretize the continuous attributes. "Not applicable" is only valid if there is no reasonable way of preprocessing the data to make the algorithm apply. Each data set and algorithm you choose must be used at least once.
Explain which algorithm you would use for what types of data and why.

Project Report

The project report should contain the following:

Description of how you measured the cluster quality (this will include a brief overview of the datasets.)
Discussion of each of the four experiments, consisting of:
1. How you prepared the data
2. Parameters chosen for the algorithm
3. Experimental result summary.
For each, you should include a brief discussion of why you made the decisions you did.
Conclusions: General discussion of the appropriate conditions for use of each algorithm. You may instead want to frame this as a discussion of appropriate type of algorithm for a general category of data (probably a more difficult task, but also more interesting.)

You should also include the output from your sample runs.

Scoring

Scoring will be based on:

Appropriateness and correctness of cluster quality measurement (2 points)
Experiment and discussion (1 point each)
Knowledge displayed in conclusions (2 points)

Particularly good discussions of any of the experiments may result in more than 1 point

Turning in the project

Electronic submission preferred. Please use the turnin command (on mentor.ics.purdue.edu, turnin -c cs490d -p proj3 directoryname). If that doesn't work, you can tar/zip and email to clifton_nospam@cs_nojunk.purdue.edu . Pdf is the safest for capturing non-text. Hard copy is acceptable, please hand in at the beginning of class.