Data Mining

CS573/STAT598M • Spring 2009 • Time: TTh 12:00-1:15pm • Location: Lawson 1106

ScheduleProjectResources


Instructor

Professor Jennifer Neville
Lawson 2142D • neville[at]cs.purdue.edu • 6-9387
Office hours: By appointment

Teaching assistant

Rongjing Xiang
Lawson B116 • rxiang[at]cs.purdue.edu
Office hours: WF 11am-12pm, Lawson B116B

Description

Data Mining has emerged at the confluence of artificial intelligence, statistics, and databases as a technique for automatically discovering summary knowledge in large datasets. This course introduces students to the process and main techniques in data mining, including classification, clustering, and pattern mining approaches. Data mining systems and applications will also be covered, along with selected topics in current research.

Prerequisites

STAT516 or an equivalent introductory statistics course, CS 381 or an equivalent course that covers basic programming skills (e.g., STAT 598G), or permission of instructor.

Text

D. Hand, H. Mannila, P. Smyth (2001). Principles of Data Mining. MIT Press.

Assignments

Grading

Grades will be posted on Blackboard.

Late policy

Each person will be allowed four days of extensions which can be applied to any combination of assignments during the semester without penalty. Use of a partial day will be counted as a full day. After that a late penalty of 10% per day will be assigned. No assignment will be accepted more than 5 days late.


Course outline

Introduction (1 week)
What is data mining? Overview of the data mining process and associated tasks.

Data (1 week) Types of data: attributes, instances. Data preparation: data cleaning, feature construction.

Basic Statistics (1 week)
Populations and samples. Statistical inference. Measures of significance. Statistical power. Exploratory data analysis.

Predictive Modeling (3 weeks)
Classification problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., naive Bayes, decision trees, regression). Evaluation: metrics, cross-validation, learning curves.

Descriptive Modeling (3 weeks)
Clustering problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., k-means, neural networks, spectral clustering). Evaluation: metrics, subjective assessment.

Pattern Mining (3 weeks)
Pattern detection formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., association rules, graph mining, anomaly detection). Evaluation: metrics, interestingness, understandability.

Data Mining Systems and Applications (1 week)
Process standardization. System issues (e.g., visualization, scalability). Example systems (e.g., SKICAT, fraud detection). Myths and pitfalls of data mining.

Current Research Topics (as time permits)
Examples: text mining, web mining, utility-based data mining, privacy-preserving data mining, intrusion detection, earth/atmospheric science data, high-dimensional data, streaming data, structured data, biological data.