CS 37300: Data Mining and Machine Learning

Semester: Spring 2021, also offered on Fall 2019 and Fall 2018
Time and place: Monday, Wednesday and Friday, 10.30am-11.20am EST
Instructor: Jean Honorio (Please send an e-mail for appointments)
TAs: Kevin Bello, email: kbellome at purdue.edu, Office hours: Monday 1pm-3pm EST
Prerit Gupta, email: gupta596 at purdue.edu, Office hours: Friday 2pm-4pm EST
Chuyang Ke, email: cke at purdue.edu, Office hours: Tuesday 2pm-4pm EST
Jin Son, email: son74 at purdue.edu, Office hours: Thursday 3pm-5pm EST
Anxhelo Xhebraj, email: axhebraj at purdue.edu, Office hours: Wednesday noon-2pm EST
Kaiyuan Zhang, email: zhan4057 at purdue.edu, Office hours: Tuesday 10am-noon EST

This course will introduce students to the field of data mining and machine learning, which sits at the interface between statistics and computer science. Data mining and machine learning focuses on developing algorithms to automatically discover patterns and learn models of large datasets. This course introduces students to the process and main techniques in data mining and machine learning, including exploratory data analysis, predictive modeling, descriptive modeling, and evaluation.

In particular, topics in supervised learning include: linear and non-linear classifiers, anomaly detection, rating, ranking, model selection. Topics in unsupervised learning include: clustering, mixture models. Topics in probabilistic modeling includes: maximum likelihood estimation, naive Bayes classifier. Topics in non-parametric methods include: nearest neighbors, classification trees.

Prerequisites

CS 18200 and CS 25100 and concurrently (STAT 35000 or STAT 35500 or STAT 51100).

Textbooks

There is no official text book for this class. I will post slides and pointers to reading materials. Recommended books for further reading include:

Principles of Data Mining by David Hand, Heikki Mannila and Padhraic Smyth. (Free with Purdue ID)
A Course in Machine Learning by Hal Daumé III. (Free)
Pattern Recognition and Machine Learning by Christopher M. Bishop.
Machine Learning by Tom Mitchell.
Pattern Classification, 2nd Edition by Richard O. Duda, Peter E. Hart, David G. Stork.

Assignments

There will be up to eight homeworks, one midterm exam, one final exam and one project (dates posted on the schedule). The homeworks are to be done individually (when programming is required, we will use Python). The project is to be done in groups of 3 students.

For the project, you will write a half-page project plan (around 1-2 weeks before the midterm), a 2-4 page preliminary results report (around 1-2 weeks after the midterm) and a 4-8 page final results report (around 1-2 weeks before the final exam). The project should include: Neither I nor the TAs will provide any help regarding programming-related issues.

Grading

Quizzes/attendance: 5%
Homeworks: 35%
Project: 10%
Midterm exam: 25%
Final exam: 25%

Late policy

Assignments are to be submitted by the due date listed. Assignments will not be accepted if they are even one minute late.

Academic Honesty

Please read the departmental academic integrity policy here. This will be followed unless we provide written documentation of exceptions. We encourage you to interact amongst yourselves: you may discuss and obtain help with basic concepts covered in lectures and homework specification (but not solution). However, unless otherwise noted, work turned in should reflect your own efforts and knowledge. Sharing or copying solutions is unacceptable and could result in failure. You are expected to take reasonable precautions to prevent others from using your work.

Additional course policies

Please read the general course policies here.

Schedule

Date Topic (Tentative) Notes
Wed, Jan 20 Lecture 1: introduction Python
Fri, Jan 22 Lecture 2: probability review (joint, marginal and conditional probabilities)
Mon, Jan 25     (lecture continues)
Lecture 3: statistics review (independence, maximum likelihood estimation)
Wed, Jan 27     (lecture continues)
Fri, Jan 29 Lecture 4: linear algebra review Linear algebra in Python
Homework 1: due on Feb 5, 11.59pm EST
Mon, Feb 1 Lecture 5: elements of data mining and machine learning algorithms
Wed, Feb 3     (lecture continues)
Lecture 6: linear classification, perceptron
Fri, Feb 5 Homework 1 due
Mon, Feb 8     (lecture continues)
Wed, Feb 10     (lecture continues)
Lecture 7: perceptron (convergence), support vector machines (introduction)
Fri, Feb 12     (lecture continues) Homework 2: due on Feb 19, 11.59pm EST
Mon, Feb 15     (lecture continues)
Wed, Feb 17 READING DAY
Fri, Feb 19 Lecture 8: generative probabilistic modeling, maximum likelihood estimation, classification Homework 2 due
Mon, Feb 22     (lecture continues)
Lecture 9: generative probabilistic classification (naive Bayes), non-parametric methods (nearest neighbors)
Homework 3: due on Mar 1, 11.59pm EST
Web, Feb 24     (lecture continues)
Lecture 10: non-parametric methods (classification trees)
Fri, Feb 26     (lecture continues)
Mon, Mar 1 Case Study 1 Homework 3 due
Wed, Mar 3     (lecture continues)
Lecture 11: performance measures, cross-validation, statistical hypothesis testing
Fri, Mar 5     (lecture continues)
Lecture 12: model selection and generalization (VC dimension)
Homework 4: due on Mar 12, 11.59pm EST
Mon, Mar 8     (lecture continues)
Wed, Mar 10 Case Study 2
Fri, Mar 12     (lecture continues)
Lecture 13: dimensionality reduction, principal component analysis (PCA)
Homework 4 due
Mon, Mar 15     (lecture continues)
Wed, Mar 17 MIDTERM (lectures 1 to 12, all case studies) Start: Wednesday March 17, 10.30am EST
End: Thursday March 18, 10.30am EST
Fri, Mar 19     (lecture continues)
    (midterm solution)
Homework 5: due on Mar 26, 11.59pm EST
Mon, Mar 22 Lecture 14: nonlinear feature mappings, kernels, kernel perceptron, kernel support vector machines
Wed, Mar 24     (lecture continues)
Fri, Mar 26 Lecture 15: ensemble methods: bagging, boosting, bias/variance tradeoff Homework 5 due
Homework 6: due on Apr 2, 11.59pm EST
Mon, Mar 29     (lecture continues)
Case Study 3
Project plan due ([Word] or [Latex] format)
Wed, Mar 31     (lecture continues)
Fri, Apr 2 Homework 6 due
Mon, Apr 5 Lecture 16: clustering, k-means, hierarchical clustering Homework 7: due on Apr 12, 11.59pm EST
Wed, Apr 7     (lecture continues)
Lecture 17: clustering, mixture models, expectation-maximization (EM) algorithm
Fri, Apr 9     (lecture continues)
Lecture 18: anomaly detection, one-class support vector machines
Mon, Apr 12     (lecture continues) Homework 7 due
Wed, Apr 14 Lecture 19: Bayesian networks (independence)
Fri, Apr 16     (lecture continues)
Lecture 20: pattern discovery, association rules, frequent itemsets
Preliminary project report, due on Apr 16, 11.59pm EST
Mon, Apr 19     (lecture continues)
Lecture 21: feature selection (univariate/multivariate, filter/wrapper/embedded methods, L1-norm regularization)
Wed, Apr 21     (lecture continues)
Fri, Apr 23     (lecture continues)
Lecture 22: data quality, preprocessing, visualization, distances
Mon, Apr 26 FINAL EXAM (lectures 13 to 21, all case studies) Start: Monday April 26, 10.30am EST
End: Tuesday April 27, 10.30am EST
Wed, Apr 28     (lecture continues)
Fri, Apr 30     (final exam solution) Final project report, due on Apr 30, 11.59pm EST