CS590D/STAT598M Data Mining
Fall 2007: Project information
The semester project is a significant undertaking that will allow you to experience the entire process of data mining. You will choose a dataset, a task, apply one or more models/algorithms to the data, and evaluate the modeling results.
Choose an area (data, model, or algorithm) that is interesting to you, with a project scope that is likely to be doable in a semester. The only broad restriction on topic choice is that it must be a data mining application. It is not necessary to design/code your own data mining algorithm, but you can if you choose. If you choose to use existing software and modeling techniques, you will need to compare more than one model in your evaluation and explore reasons why one model performs best.
Here are a couple of ideas for projects. They need to be fleshed out with more detail; the list should be viewed as ideas for inspiration. If none of these interests you, feel free to propose your own topic.
- Analyze a dataset using existing algorithms. If you have some data that provides an interesting classification or clustering problem, modeling those data to discover patterns and/or knowledge might be a good topic. Here are some possible examples of datasets:
- Bioinformatics data
- Web data
- Citation data
- Email data
- Marketing data
- ...
- Extend a current model/algorithm to handle a novel data mining task. Here are some possible examples:
- Classification of streaming data
- Classification of structured data (e.g., sequential, relational)
- Link analysis
- Dynamic graph mining
- Topic detection and tracking
- Concept drift
- ...
- A rigorous comparison of the performance of data mining algorithms on real-world data while varying the properties of the data (e.g., size), the task (e.g., classification vs. density estimation), and/or algorithm settings (e.g., number of clusters).
- Comparison of different learning techniques within a single modeling framework (e.g., decision trees). What relationship is there between the characteristics of the data and the accuracy/efficiency of the model?
- An analysis of feature construction and its impact on model performance.
- ...
Proposal: Due Sept 24
Before a project is undertaken, the key idea must be approved by the instructor. Send the instructor a few paragraphs by email describing (briefly):
- The project's goals and/or hypotheses,
- A description of the data that you will use,
- A list of the algorithms that you will develop or analyze.
Final report: Due Dec 7
The final report should be a 6-8 page report that includes:
- Introduction
- Algorithm/data description
- Methodology
- Results
- Discussion
- Related Work