CS 57300: Data Mining

Final Project

A significant portion of the course will be a self-directed final team project. The goal is to provide you with an opportunity to get hands-on experience in data mining, practice the techniques and algorithms learned in the course in real-life data mining scenarios, and even to the extent possible to work on an open research problem. Students should work on the final project in teams of 2-4 people. The number of participants in a project, and team organization, will be considered in the evaluation of the project.

Project Topics

Projects are relatively open - they must be centered around some time of data mining (computer-learned or -discovered models or patterns in the data), but the specific topic and data to be used are up to you. You may pick any topic, including something related to (or furthering) your research of that of team members. Within a topic, some of the things you could do are:

Application of Existing Data Mining Algorithms. You could choose an interesting dataset (or multiple datasets) and apply at least three different data mining algorithms (not limited to the ones that are covered in this course). Make thorough comparisons among different algorithms in terms of their formulations and assumptions, parameter tuning procedures, and the performance. Reflect on the new domain knowledge that you obtain after applying these data mining algorithms (e.g., do you get any insight about the data?).
Develop New Data Mining Algorithms for Specific Problems. Identify an unconventional domain where data mining has not been widely applied yet (but can be), or a challenging scenario where existing data mining algorithms fall short. Design data mining algorithms for the specific domain/scenario that you have identified. If you work on applying data mining algorithms to an unconventional domain to solve a specific problem, show how the performance of your algorithms compare with the start-of-art methods (which has limited utilization of data mining techniques) for solving that problem. If you work on designing data mining algorithms for a particularly challenging scenario, demonstrate how the performance of your algorithms (can be multi-dimensional, including accuracy, computational efficiency, understandability, etc.) compare with existing data mining algorithms on that scenario.
Address a problem that requires multiple types of data mining. Identify a problem where a solution requires a sequence of applications of different data mining tasks. In many ways this is similar to developing a new data mining algorithm, but can be addressed using novel application of existing algorithms.

This is not an exhaustive list, but meant to give some ideas.

Data Source

You will need to identify the dataset(s) to be used for your project. Most interesting would be original datasets, or ones not previously used in data mining efforts. If you use a well-known machine learning / data mining dataset, such as from the UCI Machine Learning Repository or a Kaggle competition, you are expected to do something different from the standard problems that the datasets are used for. For example, the Iris dataset is generally used for classifying species based on flower measurements; if you wish to use it you'd need to find a new problem (e.g., value to florists - although I don't know how you'd do that with just the Iris dataset.) Another option could be putting multiple datasets together to address a problem (e.g., using the Electric Power Consumption dataset to augment the data in the Kaggle House Prices competition (again, I don't think this is feasible, but you should get the idea.)

Part 1: Proposal, due February 21

The first part is simply to form your team, identify a general problem you want to address, and (perhaps most challenging) where you will get the data. For this part, you will need to turn in a report containing:

An overview of the problem you want to solve
Why the problem is interesting - who cares about it? How is it being solved today? Why could the solution be better?
Where you will get data. Include documentation or analysis of copyright law that shows that your proposed use of the data (for the course project) is allowed.
Risk management: What will you do if you aren't able to make progress on the problem with the given data?
Plan of activities and timeline

Part 2: Data Exploration / Formal Problem definition, due March 28

The second part is data exploration/analysis, and formal definition of the problem from a data mining perspective. This includes performing (and describing what you have done) for:

Literature survey: Identify and discuss at least three research papers that address the same or a similar problem. Briefly summarize what they have done and how what you are doing is different.
Data loading/cleaing
Results of initial data exploration / feature selection
Discussion of the specific data mining task (classification, regression, clustering, pattern discovery, ...) - input to be used, output, etc.

Part 3: Final Report and Presentation due April 25

The third part requires that you perform the data mining task you have determined, analyze the results, and discuss how they address the general problem. This involves:

How you have solved the task (algorithms used, or describe new ones developed) and discussion of outcomes
Formally analyze outcomes - are they robust? How well can you expect them to generalize to the future or other domains (as appropriate for the problem)? How well do your methods perform?
Do the insights you have gained address the original problem? Do you have reason to believe that this could improve the state-of-the-art for that problem?
Discuss the contribution of each team member to the project. If the team does not come to agreement on this portion of the writeup, there will be an opportunity for you to submit an individual addendum separate from the team document.

Each part should be submitted through Gradescope as a PDF. Part 1 is maximum 2 pages, Part 2 is maximum 5 pages, and Part 3 is maximum 10 pages. While not required, we encourage you to follow the Springer LNCS format; this will ensure you have sufficient space given the page limits.

For Part 3, you will also need to prepare and submit (as a PDF in Gradescope) a four slide presentation, capturing (in 7 minutes) the problem, key ideas, outcomes, and challenges. You will make a final presentation in class. While not strictly required, we encourage you to share the presentation responsibilities among all team members (while not always easy, this works very well and gives a great impression when you manage to do it well.)

More details will be provided as the term progresses.