Data Mining

CS590D/STAT598M • Fall 2007 • Time: MW 11:30-12:45pm • Location: CIVL 2113

ScheduleProjectResources


Instructor

Professor Jennifer Neville
Lawson 2142D • neville[at]cs.purdue.edu • 6-9387
Office hours: MW 1-2pm or by appointment

Teaching assistant

Yi Fang
Lawson B132 • fangy[at]cs.purdue.edu • 6-9444
Office hours: TTh 2-3pm or by appointment

Description

Data Mining has emerged at the confluence of artificial intelligence, statistics, and databases as a technique for automatically discovering summary knowledge in large datasets. This course introduces students to the process and main techniques in data mining, including classification, clustering, and pattern mining approaches. Data mining systems and applications will also be covered, along with selected topics in current research.

Prerequisites

STAT516 or an equivalent introductory statistics course, CS 381 or an equivalent course that covers basic programming skills (e.g., STAT 598G), or permission of instructor.

Text

D. Hand, H. Mannila, P. Smyth (2001). Principles of Data Mining. MIT Press.

Assignments

Grading

Grades will be posted on Web CT Vista.

Late policy

Assignments will be accepted up to 5 days late with a penalty of 10% per day. No assignment will be accepted more than 5 days late.


Course outline

Introduction (1 week)
What is data mining? Overview of the data mining process and associated tasks.

Data (1 week) Types of data: attributes, instances. Data preparation: data cleaning, feature construction.

Basic Statistics (1 week)
Populations and samples. Statistical inference. Measures of significance. Statistical power. Exploratory data analysis.

Predictive Modeling (3 weeks)
Classification problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., naive Bayes, decision trees, regression). Evaluation: metrics, cross-validation, learning curves.

Descriptive Modeling (3 weeks)
Clustering problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., k-means, neural networks, spectral clustering). Evaluation: metrics, subjective assessment.

Pattern Mining (2 weeks)
Pattern detection formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., association rules, graph mining, anomaly detection). Evaluation: metrics, interestingness, understandability.

Data Mining Systems and Applications (2 weeks)
Process standardization. System issues (e.g., visualization, scalability). Example systems (e.g., SKICAT, fraud detection). Myths and pitfalls of data mining.

Current Research Topics (2 weeks)
Examples: text mining, web mining, utility-based data mining, privacy-preserving data mining, intrusion detection, earth/atmospheric science data, high-dimensional data, streaming data, structured data, biological data.