Assignment 1: Data Set Selection/Preparation

Start date 17 January, due 24 January beginning of class.

Your task for this assignment is to identify and characterize a data set. It would be best if you have some domain experience, as this will help with data preparation. Answer the following questions about the data:

What the data is about.
What type of benefit you might hope to get from data mining.
What type of data mining (classification, clustering, etc.) you think would be relevant.
For each, illustrate with an example, e.g., if you think clustering is relevant, describe what you think a likely cluster might contain and what the real-world meaning would be.
Name one type of data mining that you think would not be relevant, and describe briefly why not.
Discuss data quality issues: For each attribute,
1. Are there problems with the data?
2. What might be an appropriate response to the quality issues.
For at least two attributes, discuss data preprocessing, and give an example of how it would be done / the outcome on a small subset of the data.
1. What would an appropriate smoothing or generalization technique be?
2. What is an appropriate normalization or data reduction technique?
The goal of this question is to do something equivalent to Han questions 3.3 and 3.5, but on a data set of your own choice. Keep in mind that you should show something quantitative, but also try to keep it easy to grade.
You should be able to figure out why I gave you a choice of smoothing or generalization for the first, and normalization or data reduction for the second...

Turning in assignment

Electronic submission preferred. Please email to clifton_nospam@cs_nojunk.purdue.edu . <--or (preferably) use the turnin command (on mentor.ics.purdue.edu, turnin -c cs490d -p asn1 filename).--> Pdf is the safest for capturing non-text. Hard copy is acceptable, please hand in at the beginning of class.