Project 1: Association Rules

Start date 6 February, due 18 February beginning of class.

Your task for this project is to identify and perform an association rule mining task. This involves

Selecting an appropriate data set
Preparing and preprocessing the data
Finding rules, including appropriate parameter setting
Determining which of the resulting rules are interesting
Figuring out how the interesting rules could be useful

While you are on your own to select an appropriate data set, I will point you to one easy source: The UCI Machine Learning Repository. This contains many data sets, not all of which are appropriate for association rules, so you'll need to do some thinking. You are also welcome to identify data from other sources, especially those that you find personally of interest.

Project Report

The project report should contain the following:

Objectives: What is the domain and what are the potential benefits to be derived from association rule mining. This is high level - not find patterns, but what would improve because of the use of the patterns.
Data set description: What is in the data, and what preprocessing was done to make it amenable for association rule mining. Where choices were made (e.g., parameter settings for discretization, or decisions to ignore an attribute), describe your reasoning behind the choices.
Rule mining process: Parameter settings, choice of algorithm (if you choose to implement something other than the WEKA-provided apriori, you can earn extra credit, but I don't expect it), and the time required.
Resulting rules: Summary (number of rules, general description), and a selection of those you would show to a client.
Recommendations: What should the client do because of the rules discovered.

Also turn in (likely as a separate plain-text file) a complete listing of the rules found, and instructions (preferably machine-readable/executable) for recreating your results. WEKA provides several ways to do this, from command-line scripts to Explorer - your call.

If you iterate over different attribute sets / parameter settings / etc., only turn in the rule list and scripts for your final iteration. You should include a description of the iterations, and why you needed to make changes from your initial choices, in the project description.

Scoring

Scoring will be based on:

Your reasoning behind choice of data set (1-2 points)
Preparation and preprocessing (1-2 points)
Rule generation (1 point)
Choice of interesting rules (1 point)
Evaluation / use of rules (1-2 points)
Overall quality of report, including readability/clarity (1-2 points)

Extra points will be given for making the problem more challenging (provided you do so appropriately - no extra credit for doing something the hard way when an easy way is available.) Examples could include implementing an algorithm other than apriori that you think will be faster than apriori on your data, or accessing data directly from a database (JDBC) rather than as comma-separated value or ARFF formats. For peak credit, such extra challenges should be documented in a way that enables the rest of the class to make use of what you've done, e.g., simple instructions for connecting directly to a campus-accessible database.

Turning in the project

Electronic submission preferred. Please use the turnin command (on mentor.ics.purdue.edu, turnin -c cs490d -p proj1 directoryname). If that doesn't work, you can tar/zip and email to clifton_nospam@cs_nojunk.purdue.edu . Pdf is the safest for capturing non-text. Hard copy is acceptable, please hand in at the beginning of class.