Email:

Course Topics (jump to outline)

Data Mining has emerged at the confluence of machine learning, statistics, and databases as a technique for discovering summary knowledge in large datasets. This course introduces students to the process and main techniques in data mining, including association rule learning; classification approaches such as inductive inference of decision trees and neural network learning, clustering techniques, and research topics such as inductive logic programming / multi-relational data mining and time series mining.

The emphasis will be on algorithmic issues and data mining from a data management and machine learning viewpoint, it is anticipated that students interested in additional study of data mining will benefit from taking offerings in statistics such as Stat 598M or Stat 695A. It is probably not appropriate for students who have taken ECE 632.

Administrivia

Please send questions to the course newsgroup purdue.class.cs590d. This should be used for most questions. If you have something you don't want made public, send it to clifton_nospam@cs_nojunk.purdue.edu . Critical announcements will be made via the course mailing list. We will be using WebCT Vista for recording and distributing grades.

For now, Professor Clifton will not have regular office hours. Feel free to drop by anytime, or send email with some suggested times to schedule an appointment. You can also try H.323/T.120 desktop videoconferencing (e.g., SunForum, Microsoft NetMeeting.) You can try opening an H.323 connection to blitz.cs.purdue.edu - send email if there is no response, and I'll start it up if I'm in.

Prerequisites

Undergraduate-level expertise in database, algorithms, and statistics; Java programming experience. Students without this background should discuss their preparation with the instructor.

Students from outside Computer Science should send me email explaining why they feel they meet the prerequisites, or come talk to me. When I've approved that I feel you meet the prerequisites, I'll send email, then you can follow the information on non-CS students registering for CS courses to register.

Text

Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining, Addison-Wesley, 2006. ISBN 0-321-32136-7.

This will be supplemented with readings from the current research literature.

You might also find the following useful if you find on-line documentation hard to follow (it is the companion book to WEKA, which will be used for course projects):
Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition, Morgan Kaufmann Publishers, June 2005. 560 pages. ISBN 0-12-088407-0.

Evaluation/Grading:

Evaluation will be a subjective process (see my grading standards), however it will be based primarily on your understanding of the material as evidenced in:

Written assignments, projects (25%)
Midterm Exam (25%)
Paper Reviews (5%)
Presentation (10%)
Final Project (35%)

Exams will be open note / open book. To avoid a disparity between resources available to different students, electronic aids are not permitted.

Projects and written work will be evaluated on a ten point scale:

10: Exceptional work. So good that it makes up for substandard work elsewhere in the course. These will be rare, and for many homeworks/problems a perfect score will correspond to an 8.
8: This corresponds to an A grade.
6: This corresponds to a B grade.
4: This corresponds to a C grade.
2: Not really good enough, but something.
0: Missing work, or so bad that you needn't have bothered.

Late work will be penalized 1 point per day (24 hour period). This penalty will apply except in case of documented emergency (e.g., medical emergency), or by prior arrangement if doing the work in advance is impossible due to fault of the instructor (e.g., you are going to a conference and ask to start the project early, but I don't have it ready yet.)

Presentation of papers

Each student will be expected to read and present a paper from the research literature. You should view this as if you were presenting the paper at a conference - be prepared to answer detailed technical questions. However, you do not need to be an advocate for the paper - if you feel the work has problems, feel free to critique it. You are encouraged to meet with me before the presentation to go over your preparation/materials.

Presentations should be prepared for display on a projector. If you make the web-accessible or place them in your ITAP account, they will be accessible on the built-in machine. If you choose to use your own machine, the projector works best at XGA (1024x768) resolution.

Preentations will be scored on a roughly equal weight of how well you demonstrate your knowledge of the paper - not just details, but also the overall importance/contributions - and how well you communicate that knowledge to the class.

Written reviews

Each student will review two papers, and write a written report (as if reviewing a journal article). Read the following for suggestions on how to review a paper:

Alan Jay Smith, The Task of the Referee, IEEE, 1990.
KAIS Editorial Board, Knowledge and Information Systems Guidelines for Reviews, 1998-2001.
Some reviews of one of my papers.

The review form is from IEEE Transactions on Knowledge and Data Engineering review form. The real IEEE form is an electronic submission - see here for an example of what it really looks like. I prefer you email a text result (the "submit" button won't work). You can use the text-only version I have created.

Reviews are due at the beginning of the class when the reviewed paper is being presented. The hope is that if you review a paper, you will be ready to contribute to / enliven the discussion of the paper.

Reviews will be scored primarily on your demonstration of the understanding of the material in the paper and its importance/impact on data mining. A secondary criteria will be the value of the review to an editor (in deciding if it is worthy of publication) and the author (to improve it.) Don't be afraid to criticize a paper - if you find a critical flaw in a published paper (and it really is a flaw), then you've demonstrated better understanding of the material than the reviewers who decided it should be published, and certainly would have been valuable to the editor.

Email submission of reviews is preferred (to clifton_nospam@cs_nojunk.purdue.edu ), but hard copy is acceptable, if you prefer.

Policy on Intellectual Honesty

Please read the departmental academic integrity policy above. This will be followed unless I provide written documentation of exceptions. In particular, I encourage interaction: you should feel free to discuss the course with other students. However, unless otherwise noted work turned in should reflect your own efforts and knowledge.

For example, if you are discussing an assignment with another student, and you feel you know the material better than the other student, think of yourself as a teacher. Your goal is to make sure that after your discussion, the student is capable of doing similar assignments independently; their turned-in assignment should reflect this capability. If you need to work through details, try to work on a related, but different, problem.

If you feel you may have overstepped these bounds, or are not sure, please come talk to me or note on what you turn in that it represents collaborative effort (the same holds for information obtained from other sources that you feel may cause what you turn in to not reflect your true ability.) If I feel you have gone beyond acceptable limits, I will let you know, and if necessary we will find an alternative way of ensuring you know the material. Help you receive in such a borderline case, if cited and not part of a pattern of egregious behavior, is not in my opinion academic dishonesty, and will at most result in a requirement that you demonstrate your ability in some alternate manner.

Course Outline (numbers correspond to week):

Note: Material after the break is from Spring 2005, and is representative. You can expect it to be different.

Introduction: What is data mining? What makes it a new and unique discipline? Relationship between Data Warehousing, On-line Analytical Processing, and Data Mining.
Data mining tasks - Clustering, Classification, Rule learning, etc.
Reading: Tan, Chapter 1. Intro Slides (PDF)
Data mining process: Data preparation/cleansing, task identification. Slides (PDF)
Reading: Tan Chapter 2.
Assignment 1 (due 1/27).
Introduction to WEKA: Slides (PDF).
Reading: Tan Chapter 3.
January 19: Guest Lecture, Prof. Sunil Prabhakar, Data Warehousing / Data Cubes.
Association Rule mining Slides (PDF)
- Basics. Reading: Chapter 6.1-6.4
- Different algorithm types
  Reading: 6.5-6.8, skim 7.1-7.3.
Assignment 2 (due 2/2).
Classification Slides (PDF)
- Basics. Reading: Tan Chapter 4.
Project 1 (due 2/23)
Classification
- Classification Algorithms. Reading: Tan 5.1-5.3.
Prediction: Regression, Neural Networks.
Reading: Tan 5.4, Appendix C.
Clustering Slides (PDF). Reading: Tan 8.1-8.3, 8.5.
- Clustering - Distance-based and Density-based Reading: Tan 8.4, 9.3
- Clustering - Neural-net and other approaches Reading: 9.2, 9.4, 9.5.
Assignment 3 (due 3/7)
Anomaly Detection
Reading: Tan Chapter 10
March 9: Midterm. Slides (PDF).
Assignment 4 (due 3/10)
Midterm - March 10, 19:00-20:30, CS G066. Open book/notes. (Exam and Solutions)
Drop date is 3/20.
1. More on process - CRISP-DM. Slides (PDF)
  Reading: Process Model, Hard Hats for Data Miners: Myths and Pitfalls of Data Mining.
2. Daniel Harris presents:
  Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Web Mining: Information and Pattern Discovery on the World Wide Web, In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97), November 1997.
Text Mining and use of Data Mining in Information Retrieval. Presentations:
1. Chris Clifton presents:
  Chris Clifton, Robert Cooley and Jason Rennie, TopCat: Data Mining for Topic Identification in a Text Corpus, Transactions on Knowledge and Data Engineering 16(8), IEEE Computer Society Press, Los Alamitos, CA, August, 2004. (Slides.)
2. Carolyn Kraft presents:
  Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and A. Inkeri Verkamo, Mining in the phrasal frontier, Principles of Knowledge Discovery in Databases Conference, Trondheim, Norway, June 1997. Lecture Notes in Computer Science, Springer Verlag, 1997.
Cost/Utility Based Data Mining. Presentations:
1. Daniel Harris presents:
  P. D. Turney. Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research, (2):369-409, 1995.
2. Daniel Harris presents:
  Melville, P.; Saar-Tsechansky, M.; Provost, F.; Mooney, R.; An Expected Utility Approach to Active Feature-Value Acquisition, Fifth IEEE International Conference on Data Mining 27-30 Nov. 2005 Page(s):745 - 748
  Chris Clifton presents:
  J. Kleinberg, C. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. Journal of Knowledge Discovery and Data Mining, 2:311-324, 1998.
Data Mining for Intrusion Detection.
Chris Clifton will introduce with Charles Elkan, KDD Cup '99
1. Tom Schneider presents:
  Wenke Lee and Sal Stolfo, A Framework for Constructing Features and Models for Intrusion Detection Systems, ACM Transactions on Information and System Security 3(4) (November 2000).
2. Kejun Mei presents:
  Daniel Barbará, Ningning Wu, Julia Couto, and Sushil Jajodia, ADAM: A Testbed for Exploring the use of Data Mining in Intrusion Detection, ACM SIGMOD Record 30(4) (December 2001) SPECIAL ISSUE: Special section on data mining for intrusion detection and threat analysis, pp. 15-24.
Earth/Atmospheric Science:
1. Chris Clifton presents:
  Chris Clifton, Change Detection in Overhead Imagery using Neural Networks, International Journal of Applied Intelligence 18(2), Kluwer Academic Publishers, Dordrecht, The Netherlands, March 2003.
2. Carolyn Kraft presents:
  Paul E. Stolorz and Christopher Dean, Quakefinder: A Scalable Data Mining System for Detecting Earthquakes from Space, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, pp. 208-213.
Earth/Atmospheric science: Preview of the Second NASA Data Mining Workshop: Issues and Applications in Earth Science, to be held May 23-24, 2006, Pasadena, CA.
1. An Operational Pixel Classifier for the Multi-angle Imaging SpectroRadiometer (MISR) Using Support Vector Machines, Speaker: Michael Garay; Authors: Dominic Mazzoni, Michael Garay, and Roger Davies; Jet Propulsion Laboratory
  and
  Recent HARVIST Results: Classifying Crops from Remote Sensing Data, Speaker: Kiri Wagstaff; Authors: Kiri Wagstaff and Dominic Mazzoni; Jet Propulsion Laboratory
2. Kejun Mei presents:
  Spatiotemporal Data Mining for Monitoring Ocean Objects, Speaker: Yang Cai; Authors: Yang Cai, Karl Fu, Daniel Chung, Richard Stumpf, Timothy Wynne, and Mitchell Tomlison; Carnegie Mellon University
3. Clustering Spatio-Temporal Patterns using Levelwise Search Presenter: Raj Bhatnagar; Authors: Abhishek Sharma and Raj Bhatnagar; University of Cincinnati
4. Predicting Forest Stand Height and Canopy Cover from LANDSAT and LIDAR Data Using Decision Trees, Presenter: Saso Dzeroski; Authors: Saso Dzeroski, Andrej Kobler, Valentin Gjorgjioski, and Pance Panov; Jozef Stefan Institute, Ljubljana, Slovenia
5. Unraveling the Dominant Influences on the Evolution of Land-Surface Variables using Data Mining, Speaker: Praveen Kumar; Authors: Praveen Kumar, Peter Bajcsy, Amanda B. White, Vikas Mehra, David Tcheng, David Clutter, Wei-Wen Feng, Pratyush Sinha, and Richard Robertson; University of Illinois Urbana
6. Sensory Stream Data Mining on Chip, Presenter: Yang Cai; Authors: Yang Cai and Yong X. Hu; Carnegie Mellon University
7. A Hybrid Object-based/Pixel-based Classification Approach to Detect Geophysical Phenomena, Speaker: Rahul Ramachandran; Authors: Xiang Li, Rahul Ramachandran, Sara Graves, and Sunil Movva; University of Alabama in Huntsville
To come: Privacy, Collaborative Filtering, Streams.

Prior Year Papers

Text Mining
1. Sugato Basu, Raymond J. Mooney, Krupakar V. Pasupuleti, and Joydeep Ghosh, Evaluating the novelty of text-mined rules using lexical knowledge Proceedings of the seventh ACM SIGKDD international conference on > Knowledge discovery and data mining.
2. Ping Huang presents Tao Liu, Zheng Chen, Benyu Zhang, Wei-ying Ma, and Gongyi Wu, Improving Text Classification using Local Latent Semantic Indexing, Fourth IEEE International Conference on Data Mining (ICDM'04) November 01 - 04, 2004 Brighton, United Kingdom.
3. Umut Topkara gives a survey of collaborative filtering techniques:
  Greg Linden, Brent Smith and Jeremy York Amazon.com Recommendations: Item-to-Item Collaborative Filtering, IEEE Internet Computing, 7(1):76-80, 2003.
  Saverio Perugini, Marcos Andre Goncalves, and Edward A. Fox, Recommender Systems Research: A Connection-Centric Survey J. Intell. Inf. Syst. 23(2):107-143, 2004.
  Nicholas J. Belkin and W. Bruce Croft, Information filtering and information retrieval: two sides of the same coin?, Commun. ACM 35(12):29-37, 1992.
  Nathaniel Good, J. Ben Schafer, Joseph A. Konstan, Al Borchers, Badrul M. Sarwar, Jonathan L. Herlocker, and John Riedl, Combining Collaborative Filtering with Personal Agents for Better Recommendations AAAI/IAAI 1999, pp. 439-446.
  Badrul M. Sarwar, George Karypis, Joseph A. Konstan, and John Reidl, Item-based collaborative filtering recommendation algorithms, World Wide Web 2001, pp. 285-295.
  J. Ben Schafer, Joseph A. Konstan, and John Riedl, E-Commerce Recommendation Applications Data Mining and Knowledge Discovery 5(1/2):115-153, 2001.
  B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, Application of dimensionality reduction in recommender systems-a case study ACM WebKDD Workshop 2000.
4. Web Mining
  1. Evans Tapia presents: Lan Yi, Bing Liu, Xiaoli Li, Eliminating noisy information in Web pages for data mining, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003.
  2. Jingfeng Yan presents: C.I. Ezeife and Yi Lu, Mining Web Log Sequential Patterns with Position Coded Pre-Order Linked WAP-Tree, Data Mining and Knowledge Discovery. Vol. 10: 1, 5 - 38. 2005.
  3. If someone wants to volunteer, we can also cover one of the following (but time may be short after the preceeding paper):
    Martin Ester, Hans-Peter Kriegel, and Matthias Schubert Web page classification: Web site mining: a new way to spot competitors, customers and suppliers in the world wide web, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 2002.
    Qiming Huang, Qiang Yang, Joshua Zhexue Huang and Michael K. Ng, Mining of Web-Page Visiting Patterns with Continuous-Time Markov Models, Advances in Knowledge Discovery and Data Mining: 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26-28, 2004, Pages 549 - 558
5. Privacy-Preserving Data Mining - Ercan, Yang Wang, Hoffman, Bhargav-Spantzel
  1. Lynn Hoffman will give an overview of privacy issues with respect to data mining (from a to-be-published source.)
  2. Yang Wang presents: Wenliang Du and Zhijun Zhan, Building Decision Tree Classifier on Private Data IEEE International Conference on Data Mining Workshop on Privacy, Security, and Data Mining, December 9, 2002, Maebashi City, Japan, pp. 1-8.
  3. Abhilasha Bhargav-Spantzel presents: Murat Kantarcioglu, Jiashun Jin, and Chris Clifton, When do Data Mining Results Violate Privacy?, The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 22-25, 2004, Seattle, Washington.
  4. Ercan Nergiz presents: Rakesh Agrawal and Ramakrishnan Srikant, Privacy-Preserving Data Mining, Proceedings of the 2000 ACM SIGMOD Conference on Management of Data, May 14-19, 2000, Dallas, TX, pp. 439-450.
6. Biological Data Mining: molecular/graph structures
  1. Jess Kerper presents: A Novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: Support Vector Machine Approach, Sujun Hua and Zhirong Sun, Journal of Molecular Biology 308(2):397-407, April 2001.
  2. Mingwu Zhang presents: Michihiro Kuramochi and George Karypis Frequent Subgraph Discovery The IEEE International Conference on Data Mining (ICDM), 2001. Available in the IEEE digital library.
  3. Mummoorthy Murugesan presents: A Framework for Clustering Evolving Data Streams, Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu, VLDB'03
  4. Mohamed Eltabakh presents: Mining Frequent Patterns in Data Streams at Multiple Time Granularities, Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, and Philip S. Yu, Next Gen. Data Mining, MIT Press, 2003.
7. Time Series:
  1. Fei Pan presents: Eamonn Keogh and Padhric Smyth, A Probablistic Approach to Fast Pattern Matching in Time Series Databases, in Proceedings of the 3rd International Conference on Knowledge Discovery and data Mining, Newport Beach, CA, August 14-17, 1997, pp. 24-20. (Best paper runner-up.) I have hard-copy of this.
  2. Ferit Erin presents: Juan P. Caraça-Valente and Ignacio López-Chavarrías, Discovering similar patterns in time series, In proceedings of the 6th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data mining Boston, MA, Aug 20-23, 2000. pp 497-505.
  3. Chris Clifton - Introduction to Multi-relational Data Mining (PDF)
  4. Jung Wang presents: Discovery of relational association rules, Luc Dehaspe and Hannu TT Toivonen. In N. Lavrac and S. Dzeroski, editors, Relational Data Mining, 189 - 212. Springer-Verlag, 2001. (preliminary version)
8. Evolving Data
  1. Javed Siddique presents: Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique, David W. Cheung, Jiawei Han, Vincent T. Ng, C.Y. Wong, ICDE'96.
  2. Sarvjeet Singh presents: Bing Liu, Wynne Hsu, Yiming Ma, Discovering the Set of Fundamental Rule Changes, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001), San Francisco, CA; Aug 20-23, 2001
  3. Hung-pin Kao presents: Pasley, A. and J. Austin (2004), Distribution forecasting of high frequency time series, Decision Support Systems 37(4): 501-513.
  4. Discussion / Course evaluations

Final Project due date: April 30, 2005 (official last day of classes). If you'd like to give a demo as part of your project report, we can schedule it during the last week of classes (if you are ready), or during finals week. The report/writeup is due on 4/30.

CS 590D: Data Mining

TR 09:00-10:15

REC 121

Chris Clifton