CS 37300: Data Mining and Machine Learning

Semester: Spring 2021, also offered on Fall 2019 and Fall 2018
Time and place: Monday, Wednesday and Friday, 10.30am-11.20am EST
Instructor: Jean Honorio, Lawson Building 2142-J (Please send an e-mail for appointments)
TAs: Kevin Bello, email: kbellome at purdue.edu, Office hours: Monday 1pm-3pm EST
Prerit Gupta, email: gupta596 at purdue.edu, Office hours: Friday 2pm-4pm EST
Chuyang Ke, email: cke at purdue.edu, Office hours: Tuesday 2pm-4pm EST
Jin Son, email: son74 at purdue.edu, Office hours: Thursday 3pm-5pm EST
Anxhelo Xhebraj, email: axhebraj at purdue.edu, Office hours: Wednesday noon-2pm EST
Kaiyuan Zhang, email: zhan4057 at purdue.edu, Office hours: Tuesday 10am-noon EST

This course will introduce students to the field of data mining and machine learning, which sits at the interface between statistics and computer science. Data mining and machine learning focuses on developing algorithms to automatically discover patterns and learn models of large datasets. This course introduces students to the process and main techniques in data mining and machine learning, including exploratory data analysis, predictive modeling, descriptive modeling, and evaluation.

In particular, topics in supervised learning include: linear and non-linear classifiers, anomaly detection, rating, ranking, model selection. Topics in unsupervised learning include: clustering, mixture models. Topics in probabilistic modeling includes: maximum likelihood estimation, naive Bayes classifier. Topics in non-parametric methods include: nearest neighbors, classification trees.

Prerequisites

CS 18200 and CS 25100 and concurrently (STAT 35000 or STAT 35500 or STAT 51100).

Textbooks

There is no official text book for this class. I will post slides and pointers to reading materials. Recommended books for further reading include:

Principles of Data Mining by David Hand, Heikki Mannila and Padhraic Smyth. (Free with Purdue ID)
A Course in Machine Learning by Hal Daumé III. (Free)
Pattern Recognition and Machine Learning by Christopher M. Bishop.
Machine Learning by Tom Mitchell.
Pattern Classification, 2nd Edition by Richard O. Duda, Peter E. Hart, David G. Stork.

Assignments

There will be up to eight homeworks, one midterm exam, one final exam and one project (dates posted on the schedule). The homeworks are to be done individually (when programming is required, we will use Python). The project is to be done in groups of 3 students.

For the project, you will write a half-page project plan (around 1-2 weeks before the midterm), a 2-4 page preliminary results report (around 1-2 weeks after the midterm) and a 4-8 page final results report (around 1-2 weeks before the final exam). The project should include: Neither I nor the TAs will provide any help regarding programming-related issues.

Grading

Quizzes/attendance: 5%
Homeworks: 35%
Project: 10%
Midterm exam: 25%
Final exam: 25%

Late policy

Assignments are to be submitted by the due date listed. Assignments will not be accepted if they are even one minute late.

Academic Honesty

Please read the departmental academic integrity policy here. This will be followed unless we provide written documentation of exceptions. We encourage you to interact amongst yourselves: you may discuss and obtain help with basic concepts covered in lectures and homework specification (but not solution). However, unless otherwise noted, work turned in should reflect your own efforts and knowledge. Sharing or copying solutions is unacceptable and could result in failure. You are expected to take reasonable precautions to prevent others from using your work.

Additional course policies

Please read the general course policies here.

Schedule

Date Topic (Tentative) Notes
Wed, Jan 20 Lecture 1: introduction Python
Fri, Jan 22 Lecture 2: probability review (joint, marginal and conditional probabilities)
Mon, Jan 25     (lecture continues)
Lecture 3: statistics review (independence, maximum likelihood estimation)
Wed, Jan 27     (lecture continues)
Fri, Jan 29 Lecture 4: linear algebra review Linear algebra in Python
Homework 1: due on Feb 5, 11.59pm EST
Mon, Feb 1 Lecture 5: elements of data mining and machine learning algorithms
Wed, Feb 3     (lecture continues)
Lecture 6: linear classification, perceptron
Fri, Feb 5 Homework 1 due
Homework 1 solution
Mon, Feb 8     (lecture continues)
Wed, Feb 10     (lecture continues)
Lecture 7: perceptron (convergence), support vector machines (introduction)
Fri, Feb 12     (lecture continues) Homework 2: due on Feb 19, 11.59pm EST
Mon, Feb 15     (lecture continues)
Wed, Feb 17 READING DAY
Fri, Feb 19 Lecture 8: generative probabilistic modeling, maximum likelihood estimation, classification Homework 2 due
Mon, Feb 22     (lecture continues)
Lecture 9: generative probabilistic classification (naive Bayes), non-parametric methods (nearest neighbors)
Homework 3: due on Mar 1, 11.59pm EST
Web, Feb 24     (lecture continues)
Lecture 10: non-parametric methods (classification trees)
Fri, Feb 26     (lecture continues)
Case Study 1