Email:

Course Topics (jump to outline)

This course will be an introduction to data mining. Topics will range from statistics to machine learning to database, with a focus on analysis of large data sets. Expect at least one project involving real data, that you will be the first to apply data mining techniques to.

The course will be based on Introduction to Data Mining developed under National Science Foundation funding at the Illinois Institute of Technology. See their web site to get a better idea of what the course will be like.

Administrivia

Please send questions to the course newsgroup purdue.class.cs490d. This should be used for most questions. If you have something you don't want made public, send it to clifton_nospam@cs_nojunk.purdue.edu . We will also be using WebCT for recording and distributing grades.

Course Methodology

The course will be taught through lectures, with class participation expected and encouraged. There will be frequent reading assignments to supplement the lectures. The workload will include both written assignments and programming projects. Projects will be primarily individual, and self-contained. (This won't be a compilers/OS style project course.)

Prerequisites

CS348 or CS448 (concurrent registration in 448 okay), or permission of instructor.

It appears that the Computer Technologies program is accepting this as a database management selective. CPT 372 should provide reasonable database programming background. Some Java programming experience/ability helps, but you can probably learn that on your own if you have a reasonable programming background. The other thing is a solid mathematical background. Some statistical background will help (e.g., STAT 225 is sufficient), but a good mathematical background is likely to be enough event without that course. I can assume any CS students have had discrete math, which gives the appropriate level of mathematical maturity. If you haven't had something beyond calculus, you might want to discuss it with me. See information on non-CS students registering for CS courses.

Text

Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor. Morgan Kaufmann Publishers, August 2000. 550 pages. ISBN 1-55860-489-8.

You might also find the following useful (it is the companion book to WEKA, which will be used for course projects):
Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann Publishers, October 1999. 371 pages. ISBN 1-55860-552-5.

Evaluation/Grading:

The exact mix of projects, written homeworks, presentations, etc. is yet to be determined. However, at this point I expect there will be several programming projects. During weeks when you are not working on a project, there will be analytical written homework problems, with a mix of mathematical work (e.g., complexity of an algorithm) and case studies (e.g., discuss different approaches to a real-world data anlysis problem.)

Evaluation will be a subjective process (see my grading standards), however it will be based primarily on your understanding of the material as evidenced in:

Midterm Exam (20%)
Final Exam (27%)
Written assignments, projects, paper reviews (25%)
Final Project (25%)
Evaluation of instructors based on in-class contributions, discussions, and overall performance (3%)

Exams will be open note / open book. To avoid a disparity between resources available to different students, electronic aids are not permitted.

Projects and written work will be evaluated on a ten point scale:

10: Exceptional work. So good that it makes up for substandard work elsewhere in the course. These will be rare, and for many homeworks/problems a perfect score will correspond to an 8.
8: This corresponds to an A grade.
6: This corresponds to a B grade.
4: This corresponds to a C grade.
2: Not really good enough, but something.
0: Missing work, or so bad that you needn't have bothered.

Late work will be penalized 1 point per day (24 hour period). This penalty will apply except in case of documented emergency (e.g., medical emergency), or by prior arrangement if doing the work in advance is impossible due to fault of the instructor (e.g., you are going to a conference and ask to start the project early, but I don't have it ready yet.)

It is likely there will be a final project, consisting of data describing research and education opportunities and needs in the state in the transportation, distribution, and logistics sector. This will be open-ended: You will be given data, and asked to use the knowledge gained in the class to learn something interesting. Final result will be a set of presentation slides describing what you found and how you found it. Particularly good projects are likely to get some exposure (with your name attached, of course) to the Central Indiana Corporate Partnership, a group of top executives of the state's leading companies.

One last reason to take the course can be found at the kdnuggets web site.

Policy on Intellectual Honesty

Please read the departmental academic integrity policy above. This will be followed unless I provide written documentation of exceptions. In particular, I encourage interaction: you should feel free to discuss the course with other students. However, unless otherwise noted work turned in should reflect your own efforts and knowledge.

For example, if you are discussing an assignment with another student, and you feel you know the material better than the other student, think of yourself as a teacher. Your goal is to make sure that after your discussion, the student is capable of doing similar assignments independently; their turned-in assignment should reflect this capability. If you need to work through details, try to work on a related, but different, problem.

If you feel you may have overstepped these bounds, or are not sure, please come talk to me or note on what you turn in that it represents collaborative effort (the same holds for information obtained from other sources that you feel may cause what you turn in to not reflect your true ability.) If I feel you have gone beyond acceptable limits, I will let you know, and if necessary we will find an alternative way of ensuring you know the material. Help you receive in such a borderline case, if cited and not part of a pattern of egregious behavior, is not in my opinion academic dishonesty, and will at most result in a requirement that you demonstrate your ability in some alternate manner.

Course Outline (numbers correspond to week):

Introduction: What is data mining? What makes it a new and unique discipline? Relationship between Data Warehousing, On-line Analytical Processing, and Data Mining.
Reading: Han Chapter 1 through 1.3.
Overview: Data mining tasks - Clustering, Classification, Rule learning, etc.
Reading: Han, rest of Chapter 1. Intro Slides (PDF)
Assignment 1 (due 1/23).
Data Warehousing Data Warehousing Slides (PDF)
Reading: skim Chapter 2.
Data mining process: Data preparation/cleansing, task identification. Slides (PDF)
Reading: Han Chapter 3, skim Chapter 2.
Assignment 2 (due 2/4), Solutions.
Association Rule mining Slides (PDF)
Reading: Chapter 6 through 6.2.3.
Introduction to WEKA: Slides (PDF).
Project 1 (due 2/18)
Association rules - different algorithm types
Reading: 6.2.4, 6.3-6.7.
Classification/Prediction Slides (PDF)
Reading: 7.1, 7.2, 7.4
Classification - tree-based approaches, Neural Networks, etc.
Reading: 7.3, 7.5-7.10.
Project 2 (due 3/5)
Clustering basics Slides (PDF). Reading: 8.1-8.3
- Clustering - statistical approaches. Reading: 8.4-8.6
- Clustering - Neural-net and other approaches Reading: 8.8.7-8.8
Midterm Review Slides (PDF) Midterm covers through classification (not clustering).
Midterm - March 10, in class, open book/notes (Exam and Solutions).
Project 3 (due 3/31)
More on process - CRISP-DM. Slides (PDF)
Reading: CRISP-DM Process Model. Han 10.2, Skim Appendix A and Chapter 5.
Preparation for final project (self-directed analysis of logistics/transportation industry data - a chance to get involved in a new research center.)
March 29: Guest Lecture from Final Project customer.
Text Mining Slides (PDF)
Time Series Mining Slides (PDF)
Mining Data Streams
Multi-Relational Data Mining (PDF). Reading: Saso Dzeroski, Multi-Relational Data Mining: An Introduction, SIGKDD Explorations 5(1), pp. 1-16, July 2003.
Data Mining for Fraud Detection (PDF)
ILP / Decision Rules (PDF). Reading: Tertius, Ripper
Review (PDF)
Project discussion. Presentations (PDF) and Project Reports for which permission to distribute has been given.
Monday
1. Travis Cole
2. Muhammad Nasir
3. Rebecca Holding
Wednesday
1. Victor Leal
2. Michael Hilligoss
3. Adam Welborn
Friday
1. Ravan Carter
2. William Read
3. Ryan Nicoletti
Final Project due last day of class.

Final exam, Monday, May 3, 15:20-17:20, REC 103.

CS 490D: Introduction to Data Mining

MWF 11:30-12:20

REC 103

Chris Clifton