CS 590D: Data Mining

TR 09:00-10:15

REC 121

Chris Clifton

Email: clifton_nospam@cs_nojunk.purdue.edu

Course Topics (jump to outline)

Data Mining has emerged at the confluence of machine learning, statistics, and databases as a technique for discovering summary knowledge in large datasets. This course introduces students to the process and main techniques in data mining, including association rule learning; classification approaches such as inductive inference of decision trees and neural network learning, clustering techniques, and research topics such as inductive logic programming / multi-relational data mining and time series mining.

The emphasis will be on algorithmic issues and data mining from a data management and machine learning viewpoint, it is anticipated that students interested in additional study of data mining will benefit from taking offerings in statistics such as Stat 598M or Stat 695A. It is probably not appropriate for students who have taken ECE 632.

Administrivia

Please send questions to the course newsgroup purdue.class.cs590d. This should be used for most questions. If you have something you don't want made public, send it to clifton_nospam@cs_nojunk.purdue.edu. Critical announcements will be made via the course mailing list. We will be using WebCT Vista for recording and distributing grades.

For now, Professor Clifton will not have regular office hours. Feel free to drop by anytime, or send email with some suggested times to schedule an appointment. You can also try H.323/T.120 desktop videoconferencing (e.g., SunForum, Microsoft NetMeeting.) You can try opening an H.323 connection to blitz.cs.purdue.edu - send email if there is no response, and I'll start it up if I'm in.

Prerequisites

Undergraduate-level expertise in database, algorithms, and statistics; Java programming experience. Students without this background should discuss their preparation with the instructor.

Students from outside Computer Science should send me email explaining why they feel they meet the prerequisites, or come talk to me. When I've approved that I feel you meet the prerequisites, I'll send email, then you can follow the information on non-CS students registering for CS courses to register.

Text

Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining, Addison-Wesley, 2006. ISBN 0-321-32136-7.

This will be supplemented with readings from the current research literature.

You might also find the following useful if you find on-line documentation hard to follow (it is the companion book to WEKA, which will be used for course projects):
Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition, Morgan Kaufmann Publishers, June 2005. 560 pages. ISBN 0-12-088407-0.

Evaluation/Grading:

Evaluation will be a subjective process (see my grading standards), however it will be based primarily on your understanding of the material as evidenced in:

Exams will be open note / open book. To avoid a disparity between resources available to different students, electronic aids are not permitted.

Projects and written work will be evaluated on a ten point scale:

10
Exceptional work. So good that it makes up for substandard work elsewhere in the course. These will be rare, and for many homeworks/problems a perfect score will correspond to an 8.
8
This corresponds to an A grade.
6
This corresponds to a B grade.
4
This corresponds to a C grade.
2
Not really good enough, but something.
0
Missing work, or so bad that you needn't have bothered.

Late work will be penalized 1 point per day (24 hour period). This penalty will apply except in case of documented emergency (e.g., medical emergency), or by prior arrangement if doing the work in advance is impossible due to fault of the instructor (e.g., you are going to a conference and ask to start the project early, but I don't have it ready yet.)

Presentation of papers

Each student will be expected to read and present a paper from the research literature. You should view this as if you were presenting the paper at a conference - be prepared to answer detailed technical questions. However, you do not need to be an advocate for the paper - if you feel the work has problems, feel free to critique it. You are encouraged to meet with me before the presentation to go over your preparation/materials.

Presentations should be prepared for display on a projector. If you make the web-accessible or place them in your ITAP account, they will be accessible on the built-in machine. If you choose to use your own machine, the projector works best at XGA (1024x768) resolution.

Preentations will be scored on a roughly equal weight of how well you demonstrate your knowledge of the paper - not just details, but also the overall importance/contributions - and how well you communicate that knowledge to the class.

Written reviews

Each student will review two papers, and write a written report (as if reviewing a journal article). Read the following for suggestions on how to review a paper:

The review form is from IEEE Transactions on Knowledge and Data Engineering review form. The real IEEE form is an electronic submission - see here for an example of what it really looks like. I prefer you email a text result (the "submit" button won't work). You can use the text-only version I have created.

Reviews are due at the beginning of the class when the reviewed paper is being presented. The hope is that if you review a paper, you will be ready to contribute to / enliven the discussion of the paper.

Reviews will be scored primarily on your demonstration of the understanding of the material in the paper and its importance/impact on data mining. A secondary criteria will be the value of the review to an editor (in deciding if it is worthy of publication) and the author (to improve it.) Don't be afraid to criticize a paper - if you find a critical flaw in a published paper (and it really is a flaw), then you've demonstrated better understanding of the material than the reviewers who decided it should be published, and certainly would have been valuable to the editor.

Email submission of reviews is preferred (to clifton_nospam@cs_nojunk.purdue.edu), but hard copy is acceptable, if you prefer.

Policy on Intellectual Honesty

Please read the departmental academic integrity policy above. This will be followed unless I provide written documentation of exceptions. In particular, I encourage interaction: you should feel free to discuss the course with other students. However, unless otherwise noted work turned in should reflect your own efforts and knowledge.

For example, if you are discussing an assignment with another student, and you feel you know the material better than the other student, think of yourself as a teacher. Your goal is to make sure that after your discussion, the student is capable of doing similar assignments independently; their turned-in assignment should reflect this capability. If you need to work through details, try to work on a related, but different, problem.

If you feel you may have overstepped these bounds, or are not sure, please come talk to me or note on what you turn in that it represents collaborative effort (the same holds for information obtained from other sources that you feel may cause what you turn in to not reflect your true ability.) If I feel you have gone beyond acceptable limits, I will let you know, and if necessary we will find an alternative way of ensuring you know the material. Help you receive in such a borderline case, if cited and not part of a pattern of egregious behavior, is not in my opinion academic dishonesty, and will at most result in a requirement that you demonstrate your ability in some alternate manner.

Course Outline (numbers correspond to week):

Note: Material after the break is from Spring 2005, and is representative. You can expect it to be different.

  1. Introduction: What is data mining? What makes it a new and unique discipline? Relationship between Data Warehousing, On-line Analytical Processing, and Data Mining.
    Data mining tasks - Clustering, Classification, Rule learning, etc.
    Reading: Tan, Chapter 1.
    Intro Slides (PDF)
  2. Data mining process: Data preparation/cleansing, task identification. Slides (PDF)
    Reading: Tan Chapter 2.
    Assignment 1 (due 1/27).
    Introduction to WEKA: Slides (PDF).
    Reading: Tan Chapter 3.
    January 19: Guest Lecture, Prof. Sunil Prabhakar, Data Warehousing / Data Cubes.
  3. Association Rule mining Slides (PDF)
    Assignment 2 (due 2/2).
  4. Classification Slides (PDF) Project 1 (due 2/23)
  5. Classification Prediction: Regression, Neural Networks.
    Reading: Tan 5.4, Appendix C.
  6. Clustering Slides (PDF). Reading: Tan 8.1-8.3, 8.5. Assignment 3 (due 3/7)
  7. Anomaly Detection
    Reading: Tan Chapter 10
  8. March 9: Midterm. Slides (PDF).
    Assignment 4 (due 3/10)
    Midterm - March 10, 19:00-20:30, CS G066. Open book/notes. (Exam and Solutions)
  9. Drop date is 3/20.
    1. More on process - CRISP-DM. Slides (PDF)
      Reading: Process Model, Hard Hats for Data Miners: Myths and Pitfalls of Data Mining.
    2. Daniel Harris presents:
      Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Web Mining: Information and Pattern Discovery on the World Wide Web, In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97), November 1997.
  10. Text Mining and use of Data Mining in Information Retrieval. Presentations:
    1. Chris Clifton presents:
      Chris Clifton, Robert Cooley and Jason Rennie, TopCat: Data Mining for Topic Identification in a Text Corpus, Transactions on Knowledge and Data Engineering 16(8), IEEE Computer Society Press, Los Alamitos, CA, August, 2004. (Slides.)
    2. Carolyn Kraft presents:
      Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and A. Inkeri Verkamo, Mining in the phrasal frontier, Principles of Knowledge Discovery in Databases Conference, Trondheim, Norway, June 1997. Lecture Notes in Computer Science, Springer Verlag, 1997.
  11. Cost/Utility Based Data Mining. Presentations:
    1. Daniel Harris presents:
      P. D. Turney. Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research, (2):369-409, 1995.
    2. Daniel Harris presents:
      Melville, P.; Saar-Tsechansky, M.; Provost, F.; Mooney, R.; An Expected Utility Approach to Active Feature-Value Acquisition, Fifth IEEE International Conference on Data Mining 27-30 Nov. 2005 Page(s):745 - 748
      Chris Clifton presents:
      J. Kleinberg, C. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. Journal of Knowledge Discovery and Data Mining, 2:311-324, 1998.
  12. Data Mining for Intrusion Detection.
    Chris Clifton will introduce with Charles Elkan, KDD Cup '99
    1. Tom Schneider presents:
      Wenke Lee and Sal Stolfo, A Framework for Constructing Features and Models for Intrusion Detection Systems, ACM Transactions on Information and System Security 3(4) (November 2000).
    2. Kejun Mei presents:
      Daniel Barbará, Ningning Wu, Julia Couto, and Sushil Jajodia, ADAM: A Testbed for Exploring the use of Data Mining in Intrusion Detection, ACM SIGMOD Record 30(4) (December 2001) SPECIAL ISSUE: Special section on data mining for intrusion detection and threat analysis, pp. 15-24.
  13. Earth/Atmospheric Science:
    1. Chris Clifton presents:
      Chris Clifton, Change Detection in Overhead Imagery using Neural Networks, International Journal of Applied Intelligence 18(2), Kluwer Academic Publishers, Dordrecht, The Netherlands, March 2003.
    2. Carolyn Kraft presents:
      Paul E. Stolorz and Christopher Dean, Quakefinder: A Scalable Data Mining System for Detecting Earthquakes from Space, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, pp. 208-213.
  14. Earth/Atmospheric science: Preview of the Second NASA Data Mining Workshop: Issues and Applications in Earth Science, to be held May 23-24, 2006, Pasadena, CA.
    1. An Operational Pixel Classifier for the Multi-angle Imaging SpectroRadiometer (MISR) Using Support Vector Machines, Speaker: Michael Garay; Authors: Dominic Mazzoni, Michael Garay, and Roger Davies; Jet Propulsion Laboratory
      and
      Recent HARVIST Results: Classifying Crops from Remote Sensing Data, Speaker: Kiri Wagstaff; Authors: Kiri Wagstaff and Dominic Mazzoni; Jet Propulsion Laboratory
    2. Kejun Mei presents:
      Spatiotemporal Data Mining for Monitoring Ocean Objects, Speaker: Yang Cai; Authors: Yang Cai, Karl Fu, Daniel Chung, Richard Stumpf, Timothy Wynne, and Mitchell Tomlison; Carnegie Mellon University
    3. Clustering Spatio-Temporal Patterns using Levelwise Search Presenter: Raj Bhatnagar; Authors: Abhishek Sharma and Raj Bhatnagar; University of Cincinnati
    4. Predicting Forest Stand Height and Canopy Cover from LANDSAT and LIDAR Data Using Decision Trees, Presenter: Saso Dzeroski; Authors: Saso Dzeroski, Andrej Kobler, Valentin Gjorgjioski, and Pance Panov; Jozef Stefan Institute, Ljubljana, Slovenia
    5. Unraveling the Dominant Influences on the Evolution of Land-Surface Variables using Data Mining, Speaker: Praveen Kumar; Authors: Praveen Kumar, Peter Bajcsy, Amanda B. White, Vikas Mehra, David Tcheng, David Clutter, Wei-Wen Feng, Pratyush Sinha, and Richard Robertson; University of Illinois Urbana
    6. Sensory Stream Data Mining on Chip, Presenter: Yang Cai; Authors: Yang Cai and Yong X. Hu; Carnegie Mellon University
    7. A Hybrid Object-based/Pixel-based Classification Approach to Detect Geophysical Phenomena, Speaker: Rahul Ramachandran; Authors: Xiang Li, Rahul Ramachandran, Sara Graves, and Sunil Movva; University of Alabama in Huntsville
  15. To come: Privacy, Collaborative Filtering, Streams.

Prior Year Papers

Final Project due date: April 30, 2005 (official last day of classes). If you'd like to give a demo as part of your project report, we can schedule it during the last week of classes (if you are ready), or during finals week. The report/writeup is due on 4/30.


Valid XHTML 1.1!