CS590D/STAT598M Fall 2007 Time: MW 11:30-12:45pm Location: CIVL 2113
Schedule Project Resources
Professor Jennifer Neville
Lawson 2142D neville[at]cs.purdue.edu 6-9387
Office hours: MW 1-2pm or by appointment
Yi Fang
Lawson B132 fangy[at]cs.purdue.edu 6-9444
Office hours: TTh 2-3pm or by appointment
Data Mining has emerged at the confluence of artificial intelligence, statistics, and databases as a technique for automatically discovering summary knowledge in large datasets. This course introduces students to the process and main techniques in data mining, including classification, clustering, and pattern mining approaches. Data mining systems and applications will also be covered, along with selected topics in current research.
STAT516 or an equivalent introductory statistics course, CS 381 or an equivalent course that covers basic programming skills (e.g., STAT 598G), or permission of instructor.
D. Hand, H. Mannila, P. Smyth (2001). Principles of Data Mining. MIT Press.
Introduction (1 week)
What is data mining? Overview of the data mining process and associated tasks.
Data (1 week) Types of data: attributes, instances. Data preparation: data cleaning, feature construction.
Basic Statistics (1 week)
Populations and samples. Statistical inference. Measures of significance. Statistical power. Exploratory data analysis.
Predictive Modeling (3 weeks)
Classification problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., naive Bayes, decision trees, regression). Evaluation: metrics, cross-validation, learning curves.
Descriptive Modeling (3 weeks)
Clustering problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., k-means, neural networks, spectral clustering). Evaluation: metrics, subjective assessment.
Pattern Mining (2 weeks)
Pattern detection formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., association rules, graph mining, anomaly detection). Evaluation: metrics, interestingness, understandability.
Data Mining Systems and Applications (2 weeks)
Process standardization. System issues (e.g., visualization, scalability). Example systems (e.g., SKICAT, fraud detection). Myths and pitfalls of data mining.
Current Research Topics (2 weeks)
Examples: text mining, web mining, utility-based data mining, privacy-preserving data mining, intrusion detection, earth/atmospheric science data, high-dimensional data, streaming data, structured data, biological data.