CS57300 • Spring 2017 • Time: TTh 3:00-4:15pm • Location: WANG 2599 (Overflow room: 2563)

Schedule • Piazza • Blackboard

Professor Jennifer Neville

Lawson 2142D • neville[at]cs.purdue.edu • 6-9387

Office hours: Friday 11-12

Rohit Rangan, Sait Celebi, Youhan Fang

Email: cs573-ta@cs.purdue.edu

Office hours: Monday and Thursday 5-6pm, HAAS G50. Distance section: Saturday 4pm ET via WebEx.

Questions: We will use Piazza for class questions/discussion. Instead of sending email to the ta list, please post your questions on Piazza.

Email: cs390dm-ta [at] cs.purdue.edu

Data Mining has emerged at the confluence of artificial intelligence, statistics, and databases as a technique for automatically discovering summary knowledge in large datasets. This course introduces students to the process and main techniques in data mining, including classification, clustering, and pattern mining approaches. Data mining systems and applications will also be covered, along with selected topics in current research.

- Identify key elements of data mining systems and the knowledge discovery process
- Understand how algorithmic elements interact to impact performance
- Recognize various types of data mining tasks
- Implement and apply basic algorithms and standard models
- Understand how to evaluate performance, as well as formulate and test hypotheses

STAT516 or an equivalent introductory statistics course, CS 381 or an equivalent course that covers basic programming skills (e.g., STAT 598G), or permission of instructor.

The primary text for the class is listed below. Additional reading materials will be distributed as necessary. Reading assignments will be posted on the schedule, please check regularly.

- D. Hand, H. Mannila, P. Smyth (2001).
*Principles of Data Mining*. MIT Press.

There will be five homework assignments that will be posted on the schedule. Homework assignments should be submitted on Blackboard or on data.cs.purdue.edu using Turnin. Details will be provided in the assignments. Programming assignments should written in python, unless otherwise noted.

In general, questions about the details of homework assignments should be posted on Piazza, and will be answered by the TAs and instructor.

There will be one midterm and a comprehensive final exam. Exams will be closed book and closed notes.

- Homework: 55%
- Midterm: 20%
- Final exam: 25%

Assignments are to be submitted by the due date listed. Each person will be allowed **four** days of extensions which can be applied to any combination of assignments during the semester without penalty. After that a late penalty of 10% per day will be assigned. Use of a partial day will be counted as a full day. Use of extension days must be stated explicitly in the late submission (either directly in the submission header or by accompanying email to the TA list), otherwise late penalties will apply. Extensions cannot be used after the final day of classes (ie., Apr 29). Extension days cannot be rearranged after they are applied to a submission. Use them wisely!

Assignments will NOT BE accepted if they are more than five days late. Additional extensions will be granted only due to serious and documented medical or family emergencies.

**Introduction** (1 week)

What is data mining? Overview of the process and associated tasks. Example applications.

**Background and basics** (1.5 weeks)

Types of data: attributes, instances. Populations and samples. Random variables and distributions. Statistical inference. R and Python.

**Predictive Modeling** (2 weeks)

Classification problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., perceptron, naive Bayes, decision trees, nearest neighbor).

**Exploratory data analysis** (2.5 weeks)

Data cleaning and preprocessing. Sampling. Feature construction and discovery. Anomaly detection. Visualization methods. Hypothesis testing.

**Advanced Predictive Modeling** (3 weeks)

Overview of more advanced algorithms (e.g., support vector machines, deep learning, ensembles, latent variable models). Evaluation: metrics, cross-validation, learning curves, error analysis.

**Descriptive Modeling** (2.5 weeks)

Clustering problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., k-means, hierarchical clustering, spectral clustering) and advanced algorithms (e.g., co-clustering, collaborative filtering). Evaluation: metrics, subjective assessment.

**Pattern Mining** (2 weeks)

Pattern detection formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., association rules, anomaly detection). Evaluation: metrics, interestingness, understandability.