## Data Mining & Machine Learning

CS37300 • Fall 2017 • Time: MWF 1:30-2:20pm • Location: Grissom Hall 103

Schedule • Optional Textbooks: Principles of Data Mining (David J. Hand; Heikki Mannila; Padhraic Smyth, FREE with PUID), A Course in Machine Learning (Hal Daume III, FREE),Pattern Recognition and Machine Learning (Bishop) and The Elements of Statistical Learning (Hastie, Tibshirani, Friedman : FREE) • Blackboard • Piazza### Instructor

Professor Bruno Ribeiro

Lawson 2142C • ribeiro[at]cs.purdue.edu (Please use Piazza, communicating over email will be SLOW)

Office hours: Mondays, 12noon-1pm, LWSN 2142C

Office Hour Changes:

- Change from Monday 9/4, noon to Tuesday 9/5, 10-11am
- Change from Monday 9/11, noon to Monday 9/11, 2:30-3:30pm

### Teaching assistants

Israa Al-Qassem, Treavor Bonjour, Leonardo Teixeira

Office hours:

- Tue, 2:00pm to 3:00pm (Trevor, HAAS 143)
- Thu 4:30pm - 5:30pm (Leo, HAAS G050)
- Fri 3:00pm- 4:00pm (Israa, HAAS G050)

Questions: We will use Piazza for class questions/discussion. Instead of sending email to the ta list, please post your questions on Piazza.

Email: Please use Piazza to communicate with TAs and Instructor.

### Computing Resources

All enrolled students have access to the Scholar cluster.
Students can remote login via ssh to scholar.rcac.purdue.edu or via their web browsers using the Remote Desktop web app.

For lighter tasks, tasks that don't need hours of computation, students can also start iPython Notebooks in the scholar cluster.

### Description

This course will introduce students to the field of machine learning and data mining, which sits at the interface between statistics and computer science. Data mining and machine learning focuses on developing algorithms to automatically discover patterns and learn models of large datasets. This course introduces students to the process and main techniques in data mining and machine learning, including exploratory data analysis, predictive modeling, descriptive modeling, and evaluation.

### Learning objectives

Upon completing the course, students should be able to:- Identify key elements of data mining and machine learning algorithms
- Understand how algorithmic elements interact to impact performance
- Understand how to choose algorithms for different analysis tasks
- Analyze data in both an exploratory and targeted manner
- Implement and apply basic algorithms for supervised and unsupervised learning
- Accurately evaluate the performance of algorithms, as well as formulate and test hypotheses

### Prerequisites

Prerequisites: CS182, CS251. Concurrent prerequisite: ST350 or ST511.

### Text

The texts below are recommended but not required. Reading materials will be distributed as necessary. Reading assignments will be posted on the schedule, please check regularly.

- Principles of Data Mining (David J. Hand; Heikki Mannila; Padhraic Smyth, FREE with PUID)
- A Course in Machine Learning by Hal Daume III is a good book with important practical guidelines
- The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman is an excellent reference book, available on the web for free at the link.
- Pattern Recognition and Machine Learning by Christopher M. Bishop is a very detailed and thorough book on the foundations of machine learning. A good textbook to buy to have as a reference for machine learning courses (not required).

### Assignments and exams

There will be five homework/programming assignments that will be posted on the schedule. Homework assignments should be submitted in class, unless otherwise noted. Programming assignments should written in python 3, unless otherwise noted, and should be submitted on data.cs.purdue.edu using Turnin. Details will be provided in the assignments.

In general, questions about the details of homework assignments must be posted on Piazza (either as a public or private question). There are no guarantees any emailed questions will be answered, please use Piazza. Example solutions, when applicable, will be made available after homework is returned to students.

There will be several online quizzes as well as a midterm and comprehensive final exam. Exams will be closed book and closed notes.

### Grading

- Quizzes/participation: 10%
- Data Challenge (Kaggle Competition): +5% (extra credit)
- Homework: 45% (top 6 grades out of the 7 assigments)
- Midterm: 20%
- Final exam: 25%

### Late policy

Assignments MUST be submitted by the due date listed.

**IMPORTANT: All the deadlines are 11:59PM (midnight) of the due dates; No late submissions accepted!**

Late assigments will not be graded but only the top 6 out of the 7 assigments will count towards your grade.
If at the time of th 7th assigment, the student has perfect grades on the 6 past assigments, the 7th assigment will count as extra credit (amount of credit to be defined).

Assignments will NOT BE accepted if they are LATE (even if just one minute late). Additional extensions (beyond one missed homework) will be granted only due to serious and documented medical or family emergencies but never after the HW solution is released.

### Academic honesty

Please read the departmental academic integrity policy. This will be followed unless we provide written documentation of exceptions. We encourage you to interact amongst yourselves: you may discuss and obtain help with basic concepts covered in lectures or the textbook, homework specification (but not solution), and program implementation (but not design). However, unless otherwise noted, work turned in should reflect your own efforts and knowledge. Sharing or copying solutions is unacceptable and could result in failure. We use copy detection software, so do not copy code and make changes (either from the Web or from other students). You are expected to take reasonable precautions to prevent others from using your work.### Additional course policies

Please read the general course policies here.### Resources

Stanford's CS229 Lecture Notes by Andrew Ng are a concise introduction to machine learning.

Andrew Ng's Coursera course contains excellent explanations.

Pedro Domnigos's Coursera course is a more advanced course.

### Course outline

**Introduction** (1 week)

What is data mining? What is machine learning? Overview of the process and associated tasks. Example applications.
Types of data: attributes, instances. Python and libraries.

**Background and basics** (1 week)

Populations and samples. Random variables and distributions.

**Exploratory data analysis** (2 weeks)

Data cleaning and preprocessing. Sampling. Feature construction and discovery. Visualization methods. Hypothesis testing.

**Predictive Modeling** (3 weeks)

Classification problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., naive Bayes, decision trees, nearest neighbor, SVM, logistic regression). Evaluation: metrics, cross-validation, learning curves.

**Understanding and Extending Model Performance** (1 week)

Error analysis. Feature selection. Ensemble techniques.

**Deep Learning** (3 weeks)

Perceptron. Discriminative and Generative models neural networks.

**(Nov 6)** Mr. Matt Booty (Corporate Vice President of Redmond's Minecraft Team at Microsoft) will addresses the class.

**Descriptive Modeling** (2 weeks)

Clustering problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., k-means, hierarchical clustering, spectral clustering). Evaluation: metrics, subjective assessment.