CS 37300: Data Mining and Machine Learning

Semester: Fall 2019, also offered on Spring 2021 and Fall 2018
Time and place: Tuesday and Thursday, 4.30pm-5.45pm, Electrical Engineering Building 170
Instructor: Jean Honorio, Lawson Building 2142-J (Please send an e-mail for appointments)
TAs: Jiayi Liu, email: liu2861 at purdue.edu, Office hours: Friday 10am-noon, HAAS G50
Susheel Suresh, e-mail: suresh43 at purdue.edu, Office hours: Wednesday 3-5pm, HAAS G50
Vinith Budde, email: budde at purdue.edu, Office hours: Monday 3pm-5pm, HAAS G50

This course will introduce students to the field of data mining and machine learning, which sits at the interface between statistics and computer science. Data mining and machine learning focuses on developing algorithms to automatically discover patterns and learn models of large datasets. This course introduces students to the process and main techniques in data mining and machine learning, including exploratory data analysis, predictive modeling, descriptive modeling, and evaluation.

In particular, topics in supervised learning include: linear and non-linear classifiers, anomaly detection, rating, ranking, model selection. Topics in unsupervised learning include: clustering, mixture models. Topics in probabilistic modeling includes: maximum likelihood estimation, naive Bayes classifier. Topics in non-parametric methods include: nearest neighbors, classification trees.

Prerequisites

CS 18200 and CS 25100 and concurrently (STAT 35000 or STAT 35500 or STAT 51100).

Textbooks

There is no official text book for this class. I will post slides and pointers to reading materials. Recommended books for further reading include:

Principles of Data Mining by David Hand, Heikki Mannila and Padhraic Smyth. (Free with Purdue ID)
A Course in Machine Learning by Hal Daumé III. (Free)
Pattern Recognition and Machine Learning by Christopher M. Bishop.
Machine Learning by Tom Mitchell.
Pattern Classification, 2nd Edition by Richard O. Duda, Peter E. Hart, David G. Stork.

Assignments

There will be up to eight homeworks, one midterm exam, one final exam and one project (dates posted on the schedule). The homeworks are to be done individually (when programming is required, we will use Python). The project is to be done in groups of 3 students.

For the project, you will write a half-page project plan (around 1-2 weeks before the midterm), a 2-4 page preliminary results report (around 1-2 weeks after the midterm) and a 4-8 page final results report (around 1-2 weeks before the final exam). The project should include: Neither I nor the TAs will provide any help regarding programming-related issues.

Grading

In-class quizzes/attendance: 5%
Homeworks: 35%
Project: 10%
Midterm exam: 25%
Final exam: 25%

Late policy

Assignments are to be submitted by the due date listed. Assignments will not be accepted if they are even one minute late.

Academic Honesty

Please read the departmental academic integrity policy here. This will be followed unless we provide written documentation of exceptions. We encourage you to interact amongst yourselves: you may discuss and obtain help with basic concepts covered in lectures and homework specification (but not solution). However, unless otherwise noted, work turned in should reflect your own efforts and knowledge. Sharing or copying solutions is unacceptable and could result in failure. You are expected to take reasonable precautions to prevent others from using your work.

Additional course policies

Please read the general course policies here.

Schedule

Date Topic (Tentative) Notes
Tue, Aug 20 Lecture 1: introduction Python
Thu, Aug 22 Lecture 2: probability review (joint, marginal and conditional probabilities)
Tue, Aug 27 Lecture 3: statistics review (independence, maximum likelihood estimation)
Thu, Aug 29 Lecture 4: linear algebra review
(iClicker: attendance)
Linear algebra in Python
Homework 1: due on Sep 5, at end of lecture
Tue, Sep 3 Lecture 5: elements of data mining and machine learning algorithms
Thu, Sep 5 Lecture 6: linear classification, perceptron
(iClicker: quiz 1)
Homework 1 due
Homework 1 solution
Tue, Sep 10 Lecture 7: perceptron (convergence), support vector machines (introduction)
(iClicker: attendance)
Homework 2: due on Sep 17, 11.59pm EST
Thu, Sep 12 Lecture 8: generative probabilistic modeling, maximum likelihood estimation, classification
(iClicker: attendance)
Tue, Sep 17 Lecture 9: generative probabilistic classification (naive Bayes), non-parametric methods (nearest neighbors)
(iClicker: attendance)
Homework 2 due
Thu, Sep 19 Lecture 10: non-parametric methods (classification trees)
(iClicker: quiz 2)
Homework 3: due on Sep 26, 11.59pm EST
Tue, Sep 24 Case Study 1
(iClicker: attendance)
Thu, Sep 26 Lecture 11: performance measures, cross-validation, statistical hypothesis testing Homework 3 due
Tue, Oct 1 Lecture 12: model selection and generalization (VC dimension)
(iClicker: attendance)
Homework 4: due on Oct 10, 11.59pm EST
Thu, Oct 3 Case Study 2
(iClicker: attendance)
Tue, Oct 8 OCTOBER BREAK
Thu, Oct 10 Lecture 13: dimensionality reduction, principal component analysis (PCA) Homework 4 due
Tue, Oct 15 MIDTERM (lectures 1 to 12, all case studies) 4.30pm-5.45pm, Electrical Engineering Building 170
Homework 5: due on Oct 22, 11.59pm EST
Thu, Oct 17 Midterm solution
(iClicker: attendance)
Tue, Oct 22 Lecture 14: nonlinear feature mappings, kernels, kernel perceptron, kernel support vector machines
(iClicker: attendance)
Homework 5 due
Thu, Oct 24 Lecture 15: ensemble methods: bagging, boosting, bias/variance tradeoff
(iClicker: attendance)
Homework 6: due on Oct 31, 11.59pm EST
Tue, Oct 29 Case Study 3
(iClicker: attendance)
Project plan due (see Assignments for details) [Word] or [Latex] format
Thu, Oct 31 Lecture 16: clustering, k-means, hierarchical clustering
(iClicker: attendance)
Homework 6 due
Tue, Nov 5 Lecture 17: clustering, mixture models, expectation-maximization (EM) algorithm
(iClicker: attendance)
Homework 7: due on Nov 12, 11.59pm EST
Thu, Nov 7 Lecture 18: anomaly detection, one-class support vector machines
(iClicker: attendance)
Tue, Nov 12 Lecture 19: Bayesian networks (independence)
(iClicker: attendance)
Homework 7 due
Thu, Nov 14 Lecture 20: pattern discovery, association rules, frequent itemsets
(iClicker: attendance)
Preliminary project report, due on Nov 16, 11.59pm EST
Tue, Nov 19 Lecture 21: feature selection (univariate/multivariate, filter/wrapper/embedded methods, L1-norm regularization)
(iClicker: attendance)
Thu, Nov 21 Lecture 22: data quality, preprocessing, visualization, distances
(iClicker: attendance)
Tue, Nov 26 FINAL EXAM (lectures 13 to 21, all case studies) 4.30pm-5.45pm, Electrical Engineering Building 170
Thu, Nov 28 THANKSGIVING VACATION
Tue, Dec 3 Final exam solution Final project report, due on Dec 3, 11.59pm EST
Thu, Dec 5