CS 37300: Data Mining and Machine Learning

Semester:	Fall 2018, also offered on Spring 2021 and Fall 2019
Time and place:	Tuesday and Thursday, 1.30pm-2.45pm, Mathematical Sciences Building 175
Instructor:	Jean Honorio, Lawson Building 2142-J (Please send an e-mail for appointments)
TAs:	Hao Ding, e-mail: ding209 at purdue.edu, Office hours: Friday 2pm-4pm, HAAS G50 Ruijiu Mao, e-mail: mao95 at purdue.edu, Office hours: Thursday 11am-1pm, HAAS G50 Md Nasim, e-mail: mnasim at purdue.edu, Office hours: Wednesday 2pm-4pm, HAAS G50 Susheel Suresh, e-mail: suresh43 at purdue.edu, Office hours: Tuesday 3pm-5pm, HAAS G50

Machine learning offers a new paradigm of computing — computer systems that can learn to perform tasks by finding patterns in data, rather than by running code specifically written to accomplish the task by a human programmer. The most common machine-learning scenario requires a human teacher to annotate data (identify relevant phenomenon that occurs in the data), and use a machine-learning algorithm to generalize from these examples. Generalization is at the heart of machine learning — how can the machine go beyond the provided set of examples and make predictions about new data. In this class we will look into different machine learning scenarios, look into several algorithms analyze their performance and learn the theory behind them.

Topics in supervised learning include: linear and non-linear classifiers, anomaly detection, rating, ranking, model selection. Topics in unsupervised learning include: clustering, mixture models. Topics in probabilistic modeling includes: maximum likelihood estimation, naive Bayes classifier. Topics in non-parametric methods include: nearest neighbors, classification trees.

Prerequisites

CS 18200 and CS 25100 and concurrently (STAT 35000 or STAT 35500 or STAT 51100).

Textbooks

There is no official text book for this class. I will post slides and pointers to reading materials. Recommended books for further reading include (* freely available online):

* The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman.
* A Course in Machine Learning by Hal Daumé III.
Pattern Classification, 2nd Edition by Richard O. Duda, Peter E. Hart, David G. Stork.
Pattern Recognition and Machine Learning by Christopher M. Bishop.
Machine Learning by Tom Mitchell.
Probabilistic Graphical Models by Daphne Koller and Nir Friedman.

Assignments

There will be up to eight homeworks, one midterm exam, one final exam and one project (dates posted on the schedule). The homeworks are to be done individually and in Python. The project is to be done in groups of 4 students.

For the project, you will write a half-page project plan (around 1-2 weeks before the midterm), a 2-4 page preliminary results report (around 1-2 weeks after the midterm) and a 4-8 page final results report (around 1-2 weeks before the final exam). The project should include:

a definition of the problem, possibly relevant to your interests.
a description of the dataset (or datasets) to be used. Datasets should be already publicly available (you should provide a URL), since there is not enough time for you to collect data. Possible datasets include: ADHD 200 (Whole Brain Data), Brain & Nouns, Connectomics, Higgs Boson, Labeled Faces in the Wild, Loan Default Prediction, Movielens, T-Drive, Yahoo Bidding (A1), Yahoo Ranking (C14).
a description of the experimental setup, e.g., cross-validation, parameter tuning, etc.
experimental results, showing not only when the algorithm succeeds but also when the algorithm fails. This might include: plots of number of samples versus accuracy (you can use different subsets of the same dataset), regularization parameter versus accuracy, ROC curves, plots of different datasets, etc.
you are allowed to either implement learning algorithms from scratch or use third-party code (e.g., liblinear for SVM). But ANY other thing such as cross-validation, parameter tuning, computing the values for the ROC curve, etc. should be written by yourself.
you can use either Python, C++, MATLAB or Java.
do not spend too much time on things such as "understanding the data", "memory problems because your data is too big", etc. Only if you are already familiar with computer vision, brain data, natural language processing, big data, parallelism, etc. then you can make use of those things, but this will not imply that you will get a higher grade just based on that fact. In general, I would recommend using easy-to-understand datasets, and smaller subsets of the data, for instance.

Neither I nor the TAs will provide any help regarding programming-related issues.

Grading

Homeworks: 40%
Midterm exam: 20%
Final exam: 20%
Project: 20%

Late policy

Assignments are to be submitted by the due date listed. Assignments will not be accepted if they are even one minute late.

Academic Honesty

Please read the departmental academic integrity policy here. This will be followed unless we provide written documentation of exceptions. We encourage you to interact amongst yourselves: you may discuss and obtain help with basic concepts covered in lectures and homework specification (but not solution). However, unless otherwise noted, work turned in should reflect your own efforts and knowledge. Sharing or copying solutions is unacceptable and could result in failure. You are expected to take reasonable precautions to prevent others from using your work.

Additional course policies

Please read the general course policies here.

Schedule

Date	Topic (Tentative)	Notes
Tue, Aug 21	Lecture 0: linear algebra review Notes: [1]	Python and Linear algebra in Python
Thu, Aug 23	Lecture 1: perceptron (introduction) Notes: [1]
Tue, Aug 28	Lecture 2: perceptron (convergence), support vector machines (introduction) Notes: [1]
Thu, Aug 30	Lecture 3: nonlinear feature mappings, kernels (introduction), kernel perceptron	Homework 1: due on Sep 4, 11.59pm EST
Tue, Sep 4	Lecture 4: SVM with kernels Notes: [1]
Thu, Sep 6	(lecture continues)	Homework 2: due on Sep 11, 11.59pm EST
Tue, Sep 11	Lecture 5: anomaly detection (one-class SVM), multi-way classification Notes: [1]
Thu, Sep 13	Lecture 6: rating (ordinal regression), PRank, ranking, rank SVM Notes: [1]
Tue, Sep 18	Lecture 7: regression, feature selection (information ranking, regularization, subset selection) Notes: [1]	Homework 3: due on Sep 23, 11.59pm EST
Thu, Sep 20	Lecture 8: ensembles and boosting Notes: [1]
Tue, Sep 25	Lecture 9: performance measures, cross-validation, statistical hypothesis testing Notes: [1]	Homework 4: due on Sep 30, 11.59pm EST
Thu, Sep 27	(lecture continues)
Tue, Oct 2	Lecture 10: statistics review, model selection (introduction)	Homework 5: due on Oct 7, 11.59pm EST
Thu, Oct 4	Lecture 11: model selection (VC dimension) Notes: [1]
Tue, Oct 9	OCTOBER BREAK
Thu, Oct 11	Lecture 12: dimensionality reduction, principal component analysis (PCA)
Tue, Oct 16	MIDTERM (lectures 1 to 11)	1.30pm-2.45pm, Mathematical Sciences Building 175
Thu, Oct 18	Midterm solution (01, 02, 03)	Homework 6: due on Oct 23, 11.59pm EST
Tue, Oct 23	Case Study 1
Thu, Oct 25	Case Study 2
Tue, Oct 30	Lecture 13: probability review (joint, marginal and conditional probabilities)	Project plan due (see Assignments for details) [Word] or [Latex] format
Thu, Nov 1	Lecture 14: statistics review (independence, maximum likelihood estimation)
Tue, Nov 6	Lecture 15: generative probabilistic modeling, maximum likelihood estimation, classification	Homework 7: due on Nov 13, at end of lecture
Thu, Nov 8	Lecture 16: clustering, mixture models, expectation-maximization (EM) algorithm
Tue, Nov 13	Case Study 3	Homework 7 solution
Thu, Nov 15	Lecture 17: Bayesian networks (independence) Refs: [1] (not mandatory to be read)	Preliminary project report, due on Nov 16, 11.59pm EST
Tue, Nov 20	Lecture 18: generative probabilistic classification (naive Bayes), non-parametric methods (nearest neighbors)
Thu, Nov 22	THANKSGIVING VACATION
Tue, Nov 27	Lecture 19: non-parametric methods (classification trees)
Thu, Nov 29	FINAL EXAM (lectures 12 to 19, all case studies)	1.30pm-2.45pm, Mathematical Sciences Building 175 Final project report, due on Dec 1, 11.59pm EST
Tue, Dec 4	Final exam solution
Thu, Dec 6	—