CS57300  Graduate Data Mining
CS57300: Fall 2016 — Time: Tue & Thu 9:00a10:15a — Location: Physics Building 203
Schedule
Instructor
Bruno Ribeiro
LWSN 2142C (ribeiro@cs.purdue.edu)
Regular office hours: Fridays 1m2pm (LWSN 2142C or 2150)
Office Hour Anomalies: (09/02, 9/23) 12pm1pm
When sending emails about the course please include [CS57300] in the SUBJECT.
Without [CS57300] in the subject it is unlikely I will read it.
TAs
Israa AlQassem (ialqasse@purdue.edu)
Office hours: HASS G050, Wednesdays 2pm  3pm.
Rohit Rangan (rrangan@purdue.edu)
Office hours: HASS G050, Tuesdays 12pm1pm.
Description
This is the website for CS57300 (graduate) Data Mining.
Finals
The final is scheduled for Mon 12/12 7:00p  9:00p PHYS 203
Learning objectives
Upon completing the course, students should be able to:
 Identify key elements of data mining systems and the knowledge discovery process
 Understand how algorithmic elements interact to impact performance
 Recognize various types of data mining tasks
 Implement and apply basic algorithms and standard models
 Understand how to evaluate performance, as well as formulate and test hypotheses
Prerequisites
STAT516 or an equivalent introductory statistics course, CS 381 or an equivalent course that covers basic programming skills (e.g., STAT 598G), or permission of instructor.
Text
The texts below are recommended but not required. Reading materials will be distributed as necessary. Reading assignments will be posted on the schedule, please check regularly.
 James, Witten, Hastie, and Tibshirani, Introduction to Statistical Learning..
 Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning.
 C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
 Michael Lavine, Introduction to Statistical Thought (introduction to statistics with plenty of R examples, free online)
Assignments and exams
There will be 5 homework assignments. The lowest homework grade will be discarded. Homework assignments should be submitted on https://ribeirowww.rcac.purdue.edu/cs57300. Details will be provided in the assignments. Programming assignments should written in python, unless otherwise noted.
All homework will be due on the designated Friday 11:59pm (Eastern) deadlines. You are allowed, however, ONE update of your solutions by 11:59pm (Eastern) on Sunday without any penalty (48 hours after the deadline). There are no homework extensions. Late homework assignments will not be accepted.
In general, questions about the details of homework assignments should be posted on Piazza, and will be answered by the TAs and instructor.
There will be one takehome midterm and a comprehensive (inclass) final exam. The final will be closed book and closed notes.
Grading
 Homework: 70%
 Final exam: 30%
Grades posting
here.
Late policy
All homework will be due on the designated Friday 11:59pm (Eastern) deadlines. You are allowed, however, ONE update of your solutions by 11:59pm (Eastern) on Sunday without any penalty (48 hours after the deadline). There are no homework extensions. Late homework assignments will not be accepted.
Academic honesty
Please read the departmental
academic integrity policy. This will be followed unless we provide written documentation of exceptions. We encourage you to interact amongst yourselves: you may discuss and obtain help with basic concepts covered in lectures or the textbook, homework specification (but not solution), and program implementation (but not design). However, unless otherwise noted, work turned in should reflect your own efforts and knowledge. Sharing or copying solutions is unacceptable and could result in failure. We use copy detection software, so do not copy code and make changes (either from the Web or from other students). You are expected to take reasonable precautions to prevent others from using your work.
Additional course policies
Please read the general course policies
here.
Schedule (Subject to Change)
 Course Overview:
 Background:
 08/25 Background (Population sampling. Random variables and distributions) [slides]
 08/30 Background (Linear Algebra review, Statistical inference review. Cluster Use. R and Python) [slides]
 09/01 Background (Regression & Working with Data) [slides]
 09/06 Principles of Website Functionality & Advertisement [slides]
 Find Best Hypothesis:
 09/08 Hypothesis Testing (A/B Testing) [slides]

Chapter 18.7: Feature Assessment and the MultipleTesting Problem; Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of
Statistical Learning. Get it online here
 T. A. B. Snijders, Hypothesis Testing: Methodology and Limitations. [pdf]
 Kohavi R, Longbotham R, Sommerfield D, Henne RM. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery. 2009 Feb 1;18(1):14081. [paper] [pdf].
 American Statistical Association Statement On Statistical Significance And pvalues [PDF]
 09/13 Testing Multiple Hypotheses [slides]
 09/15 MultiArmed Bandits (MAB) [slides]
 Sebastien Bubeck and Nicolo CesaBianchi, Regret Analysis of Stochastic and Nonstochastic Multiarmed Bandit Problems, Foundations and Trends® in Machine Learning, Vol 5, Issue 1 (free with a PUID) link
 Lai TL, Robbins H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics. 1985 Mar 31;6(1):422.
 P.R. Kumar and Pravin Varaiya, Stochastic Systems: Estimation, Identification, and Adaptive Control, 1986 [pdf]
 A. Mahajan, D. Teneketzis, Multiarmed bandit problems. In Foundations and Applications of Sensor Management 2008 (pp. 121151). Springer US.
 Olivier Chapelle and Lihong Li, An Empirical Evaluation of Thompson Sampling, NIPS 2011. pdf
 09/20 MultiArmed Bandits (Cont) [slides]
 Classifying Items:
 09/22 Classification Tasks (Discriminative v.s. Predictive) [slides] [HW2 Solution]
 Chapter 4 (Classification) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of
Statistical Learning. Get it online here
 Chapter 4, C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 (see Blackboard)
 09/27 Classification II [slides]
 Chapter 4 (Classification) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of
Statistical Learning. Get it online here
 Chapter 4, C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 (see Blackboard)
 09/29 Decision Trees I [slides]
 Chapter 9.2, (TreeBased Methods) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of
Statistical Learning. Get it online here
 Chapter 14.4, (Treebased Models) C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 (see Blackboard)
 Chapter 10, (Boosting and Additive Trees) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning [pdf]
 10/04 Decision Trees II
 Finding Good Hypotheses:
 10/06 Methods and Criteria for Model Selection I (Measuring classification error) [slides]
 Chapter 7, (Model Assessment and Selection) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of
Statistical Learning. Get it online here
 Analysis of ML experiments
 10/11 OCTOBER BREAK
 10/13 Methods and Criteria for Model Selection II (AIC, BIC, Cross Validation) + Feature construction for SVMs
 Chapter 7, (Model Assessment and Selection) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of
Statistical Learning. Get it online here
 Analysis of ML experiments
 10/18 Dimensionality Reduction [slides]
 Chapter 14, (Unsupervised Learning) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning [pdf]
 Chapter 10, (Unsupervised Learning) James, Witten, Hastie, and Tibshirani, Introduction to Statistical Learning. [pdf]
 Chapter 12, (Continuous Latent Variables) C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
 Visualizing & Coping with Missing Data:
 10/20 (NO CLASS)
 10/25 Collaborative Filtering (missing values) [slides]
 10/27 Tensor Decomposition [slides]
 11/01 Clustering I [slides]
 Hastie, Tibshirani and Friedman, The Elements of Statistical Learning, Chapter 14
 Bishop, Pattern Recognition and Machine Learning, Chapter 9
 11/03 Clustering II [slides]
 Hastie, Tibshirani and Friedman, The Elements of Statistical Learning, Chapter 14
 Bishop, Pattern Recognition and Machine Learning, Chapter 9
 Find Important & Related Items:
 11/08 Link Analysis & Prediction Heuristics (PageRank, Personalized PageRank, HITS) [slides]
 11/10 Link Spam & Personalized PageRank Heuristics [slides]
 11/15 Link Analysis & Prediction Heuristics (Link prediction, Missing values) [slides]
 Tailoring Analysis to Problem:
 11/17 Latent Variable Models: Naive Bayes [slides]
 Daphne Koller, Nir Friedman
Probabilistic Graphical Models: Principles and Techniques (Ch. 3)
The MIT Press, 2009
 11/22 Latent Variable Models & Inference [slides]
 Ruslan Salakhutdinov, Andriy Mnih
Probabilistic Matrix Factorization, NIPS 2007
 Ruslan Salakhutdinov, Andriy Mnih
Bayesian probabilistic matrix factorization using Markov chain Monte Carlo, ICML 2008
 Xiong, L., Chen, X., Huang, T.K., Schneider, J. G. & Carbonell, G.
Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization, SDM 2010
 Yılmaz, K. Y., Cemgil, A. T. & Simsekli, U.
Generalised Coupled Tensor Factorisation, NIPS 2011
 Chapter 3.4 (Shrinkage Methods) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning.
 Chapter 8.2 (The Bootstrap and Maximum Likelihood Methods) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning.
 Chapter 8.3 (Bayesian Methods) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning.
 11/24 THANKSGIVING BREAK
 Improving Weak Models:
 11/29 Ensembles & Bagging, Random Forest, Mixture of Experts [slides]
 Chapters 8 & 10, (Model Inference and Averaging, Boosting and Additive Trees) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning [html]
 12/01 Ensemble & Boosting, Unsupervised Methods [slides]
 Chapters 8 & 10, (Model Inference and Averaging, Boosting and Additive Trees) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning [html]
 11/08 Deep Learning [slides]
 Chapter 11, (Neural Networks) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning [pdf]
 Deep Learning Tutorial
 Review:
 FINAL: Mon 12/12 7:00p  9:00p PHYS 203