Data Mining

CS573 • Fall 2012 • Time: TTh 1:30-2:45pm • Location: BIOCHEM 105

Schedule


Instructor

Professor Jennifer Neville
Lawson 2142D • neville[at]cs.purdue.edu • 6-9387
Office hours: By appointment

Teaching assistants

Philip Ritchey, Jaewoo Lee
Email: cs573-ta@cs.purdue.edu
Office hours (Philip): Wed 10-11am, LWSN B116C (#03)
Office hours (Jaewoo): Fri 1-2pm, LWSN 2149 (#14)

Description

Data Mining has emerged at the confluence of artificial intelligence, statistics, and databases as a technique for automatically discovering summary knowledge in large datasets. This course introduces students to the process and main techniques in data mining, including classification, clustering, and pattern mining approaches. Data mining systems and applications will also be covered, along with selected topics in current research.

Prerequisites

STAT516 or an equivalent introductory statistics course, CS 381 or an equivalent course that covers basic programming skills (e.g., STAT 598G), or permission of instructor.

Text

D. Hand, H. Mannila, P. Smyth (2001). Principles of Data Mining. MIT Press. Available online as e-book.

Assignments

Grading

Grades will be posted on Blackboard.

Late policy

Each person will be allowed four days of extensions which can be applied to any combination of assignments during the semester without penalty. Use of a partial day will be counted as a full day. After that a late penalty of 10% per day will be assigned. No assignment will be accepted more than 5 days late.


Course outline

Introduction (1 week)
What is data mining? Overview of the data mining process and associated tasks. Example systems (e.g., SKICAT, fraud detection).

Background and basics (1 week)
Types of data: attributes, instances. Populations and samples. Random variables and distributions. Statistical inference.

Exploratory data analysis (2 weeks)
Data cleaning and preprocessing. Sampling. Feature construction and discovery. Visualization methods. Hypothesis testing.

Predictive Modeling (3 weeks)
Classification problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., naive Bayes, decision trees, regression). Evaluation: metrics, cross-validation, learning curves.

Understanding and Extending Model Performance (1 week)
Error analysis. Feature selection. Ensemble techniques. Statistical learning theory.

Descriptive Modeling (3 weeks)
Clustering problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., k-means, neural networks, spectral clustering). Evaluation: metrics, subjective assessment.

Pattern Mining (3 weeks)
Pattern detection formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., association rules, graph mining, anomaly detection). Evaluation: metrics, interestingness, understandability.

Current Research Topics (as time permits)
Examples: text mining, web mining, utility-based data mining, privacy-preserving data mining, earth/atmospheric science data, high-dimensional data, streaming data, structured data, biological data, social network data.