CS 57300: Data Mining

Final Project

A significant portion of the course will be a self-directed final team project. The goal is to provide you with an opportunity to get hands-on experience in data mining, practice the techniques and algorithms learned in the course in real-life data mining scenarios, and even to the extent possible to work on an open research problem. Students should work on the final project in teams of 2-4 people. The number of participants in a project, and team organization, will be considered in the evaluation of the project.

Project Topics

Projects are relatively open - they must be centered around some time of data mining (computer-learned or -discovered models or patterns in the data), but the specific topic and data to be used are up to you. You may pick any topic, including something related to (or furthering) your research of that of team members. Within a topic, some of the things you could do are:

This is not an exhaustive list, but meant to give some ideas.

Data Source

You will need to identify the dataset(s) to be used for your project. Most interesting would be original datasets, or ones not previously used in data mining efforts. If you use a well-known machine learning / data mining dataset, such as from the UCI Machine Learning Repository or a Kaggle competition, you are expected to do something different from the standard problems that the datasets are used for. For example, the Iris dataset is generally used for classifying species based on flower measurements; if you wish to use it you'd need to find a new problem (e.g., value to florists - although I don't know how you'd do that with just the Iris dataset.) Another option could be putting multiple datasets together to address a problem (e.g., using the Electric Power Consumption dataset to augment the data in the Kaggle House Prices competition (again, I don't think this is feasible, but you should get the idea.)

Part 1: Proposal, due February 21

The first part is simply to form your team, identify a general problem you want to address, and (perhaps most challenging) where you will get the data. For this part, you will need to turn in a report containing:

  1. An overview of the problem you want to solve
  2. Why the problem is interesting - who cares about it? How is it being solved today? Why could the solution be better?
  3. Where you will get data. Include documentation or analysis of copyright law that shows that your proposed use of the data (for the course project) is allowed.
  4. Risk management: What will you do if you aren't able to make progress on the problem with the given data?
  5. Plan of activities and timeline

Part 2: Data Exploration / Formal Problem definition, due March 28

The second part is data exploration/analysis, and formal definition of the problem from a data mining perspective. This includes performing (and describing what you have done) for:

  1. Literature survey: Identify and discuss at least three research papers that address the same or a similar problem. Briefly summarize what they have done and how what you are doing is different.
  2. Data loading/cleaing
  3. Results of initial data exploration / feature selection
  4. Discussion of the specific data mining task (classification, regression, clustering, pattern discovery, ...) - input to be used, output, etc.

Part 3: Final Report and Presentation due April 25

The third part requires that you perform the data mining task you have determined, analyze the results, and discuss how they address the general problem. This involves:

  1. How you have solved the task (algorithms used, or describe new ones developed) and discussion of outcomes
  2. Formally analyze outcomes - are they robust? How well can you expect them to generalize to the future or other domains (as appropriate for the problem)? How well do your methods perform?
  3. Do the insights you have gained address the original problem? Do you have reason to believe that this could improve the state-of-the-art for that problem?
  4. Discuss the contribution of each team member to the project. If the team does not come to agreement on this portion of the writeup, there will be an opportunity for you to submit an individual addendum separate from the team document.

Each part should be submitted through Gradescope as a PDF. Part 1 is maximum 2 pages, Part 2 is maximum 5 pages, and Part 3 is maximum 10 pages. While not required, we encourage you to follow the Springer LNCS format; this will ensure you have sufficient space given the page limits.

For Part 3, you will also need to prepare and submit (as a PDF in Gradescope) a four slide presentation, capturing (in 7 minutes) the problem, key ideas, outcomes, and challenges. You will make a final presentation in class. While not strictly required, we encourage you to share the presentation responsibilities among all team members (while not always easy, this works very well and gives a great impression when you manage to do it well.)

More details will be provided as the term progresses.

Valid XHTML 1.1