CS573 Fall 2012 Time: TTh 1:30-2:45pm Location: BIOCHEM 105
Professor Jennifer Neville
Lawson 2142D neville[at]cs.purdue.edu 6-9387
Office hours: By appointment
Philip Ritchey, Jaewoo Lee
Email: cs573-ta@cs.purdue.edu
Office hours (Philip): Wed 10-11am, LWSN B116C (#03)
Office hours (Jaewoo): Fri 1-2pm, LWSN 2149 (#14)
Data Mining has emerged at the confluence of artificial intelligence, statistics, and databases as a technique for automatically discovering summary knowledge in large datasets. This course introduces students to the process and main techniques in data mining, including classification, clustering, and pattern mining approaches. Data mining systems and applications will also be covered, along with selected topics in current research.
STAT516 or an equivalent introductory statistics course, CS 381 or an equivalent course that covers basic programming skills (e.g., STAT 598G), or permission of instructor.
D. Hand, H. Mannila, P. Smyth (2001). Principles of Data Mining. MIT Press. Available online as e-book.
Introduction (1 week)
What is data mining? Overview of the data mining process and associated tasks. Example systems (e.g., SKICAT, fraud detection).
Background and basics (1 week)
Types of data: attributes, instances. Populations and samples. Random variables and distributions. Statistical inference.
Exploratory data analysis (2 weeks)
Data cleaning and preprocessing. Sampling. Feature construction and discovery. Visualization methods. Hypothesis testing.
Predictive Modeling (3 weeks)
Classification problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., naive Bayes, decision trees, regression). Evaluation: metrics, cross-validation, learning curves.
Understanding and Extending Model Performance (1 week)
Error analysis. Feature selection. Ensemble techniques. Statistical learning theory.
Descriptive Modeling (3 weeks)
Clustering problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., k-means, neural networks, spectral clustering). Evaluation: metrics, subjective assessment.
Pattern Mining (3 weeks)
Pattern detection formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., association rules, graph mining, anomaly detection). Evaluation: metrics, interestingness, understandability.
Current Research Topics (as time permits)
Examples: text mining, web mining, utility-based data mining, privacy-preserving data mining, earth/atmospheric science data, high-dimensional data, streaming data, structured data, biological data, social network data.