CS57300 - Graduate Data Mining

CS57300: Spring 2018 — Time: Tue & Thu 9:00a-10:15a — Location: Wetherill Lab of Chemistry 320

Schedule


Instructor

Bruno Ribeiro
LWSN 2142C (use Piazza to communicate with me)
Regular office hours: Thursdays 10:30-11:30am
Office Hour Anomalies: TBD

All communication will be on Piazza (emailing will be SLOW at best and unresponsive at worst).

TAs

Linjie Li. Office hours: HASS G50, Thursdays 1:30-2:30 PM.

Anoop Santhosh. Office hours: HASS G50, Monday 9:00-10:00AM.

Akhil Israni. Office hours: HASS G50, Friday 3:00-4:00PM.

Description

This is the website for CS57300 (graduate) Data Mining.

Finals

The final is scheduled at TBD

Learning objectives

Upon completing the course, students should be able to:

Prerequisites

STAT516 or an equivalent introductory statistics course, CS 381 or an equivalent course that covers basic programming skills (e.g., STAT 598G), or permission of instructor.
This course has a heavy programming assigment load. All assigments are in python. This course will assume the students are procifient on python or can learn it quickly (say, in less than 1 week).

There will be no waiving of prerequisites. Waiving only for trully exceptional cases, where the student can PROVE she/he masters Linear Algebra, Statistics, and C++/Python programming (and can learn Python programming by herself/himself quickly wihtout help). A skill test will be admistered at a later date to determine whether or not a student qualifies for prerequisite waiving.

Text

The texts below are recommended but not required. Reading materials will be distributed as necessary. Reading assignments will be posted on the schedule, please check regularly.


Assignments and exams

There will be 5 to 6 homework assignments. The lowest homework grade will be discarded. Homework assignments should be submitted on Blackboard. Details will be provided in the assignments. Programming assignments should written in python 3, unless otherwise noted.

All homework will be due on the designated deadlines. There are no homework extensions. Late homework assignments will not be accepted.

Questions about the details of homework assignments should be posted on Piazza, and will be answered by the TAs or instructor.

There will be one individual course project.

There will be one comprehensive (in-class) final exam. The final will be closed book and closed notes.

Grading

Grades posting here.

Late policy

All homework will be due on the designated deadlines. There are no homework extensions. Late homework assignments will not be accepted.

Academic honesty

Please read the departmental academic integrity policy. This will be followed unless we provide written documentation of exceptions.
  • Unless stated otherwise, each student should write up their own solutions independently. You need to indicate the names of the people you discussed a problem with; ideally you should discuss with no more than two other people.
  • NO PART OF THE STUDENT'S ASSIGMENT SHOULD BE COPIED FROM ANOTHER STUDENT (Plagiarism). We encourage you to interact amongst yourselves: you may discuss and obtain help with basic concepts covered in lectures or the textbook, homework specification (but not solution), and program implementation (but not design). However, unless otherwise noted, work turned in should reflect your own efforts and knowledge. Sharing or copying solutions is unacceptable and could result in failure. We use copy detection software, so do not copy code and make changes (either from the Web or from other students). You are expected to take reasonable precautions to prevent others from using your work.
  • Any student not following these guidelines are subject to an automatic F (final grade).
  • Additional course policies

    Please read the general course policies here.


    Schedule (Subject to Change)

    Date Topic Notes Reading Slides
    01/09 Course Overview and Review (Course Objectives, Population sampling. Random variables and distributions.) Principles of Data Mining, Chapter 1 Intro (Lecture 1)
    01/11 Review (Linear Algebra review, Statistical Estimation, Maximum Likelhood Estimation (MLE). Using the scholar cluster. Python 3 overview) Python Resources & Cluster User Principles of Data Mining, Chapter 2, 4 Lecture 2 (LinAlg, MLE, cluster use)
    01/16 Review (Regression, Posteriors, & Working with Data) Notes Working with Data Principles of Data Mining, Chapter 3,4 Lecture 3
    01/18 Principles of Website Functionality & Advertisement Lecture 4
    01/23 Classification and Regression Tasks (Discriminative v.s. Predictive). Assessing Accuracy
    • Chapter 4.1, 4.2 (Classification) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning. Get it online here
    • Chapter 4, C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 (see Blackboard)
    • Principles of Data Mining, Chapter 5, 6
    • Chapter 4.1, 4.2 (Classification) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning. Get it online here
    Lecture 5 (Intro, Model Score & Search, Accuracy Measures, Linear Regression, Perceptron)
    01/25 Improving Model Score & Search: Logistic Regression & SVM Classifiers
    • Chapter 12 (SVM), Chapter 4.4 Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning. Get it online here
    • Chapter 4.4 (Logistic Regression) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning. Get it online here
    Lecture 6 (Logistic Regression, SVM)
    01/30 Controlling Model Space: Priors and Regularization Lecture 7
    02/01 Exploratory Data Analysis / Feature Construction Lecture 8
    02/06 Descriptive modeling: Clustering [Guest Lecture] Principles of Data Mining, Chapter 8.4 Lecture 9
    02/08 Probabilistic Models: The Naive Bayes Model Lecture 10
    02/13 Decision Trees
    • Chapter 9.2, (Tree-Based Methods) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning. Get it online here
    • Chapter 14.4, (Tree-based Models) C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 (see Blackboard)
    Lecture 11 (Decision Trees)
    02/15 Ensemble Methods
    (Improving weak models with Boosting and Unstable models with Bagging)
    Lecture 12 (Bagging and Boosting)
    02/20 Boosted Decision Trees
    • Chapter 10, (Boosting and Additive Trees) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning. Get it online here
    Lecture 13 (Boosted Decision Trees)
    02/22 Introduction to Neural Networks A Course in Machine Learning (Chapter 4, Perceptron), Hal Daume III
    Lecture 14
    02/27 Deep Neural Networks for Classification Tasks (Feedforward Networks) Pattern Recognition and Machine Learning (Chapter 5.1), Bishop.
    Lecture 15
    03/01 Deep Neural Networks: Training (Backpropagation) Pattern Recognition and Machine Learning (Chapters 5.2-5.3), Bishop.
    Lecture 16
    03/06 Hands-on: Training a Deep Neural Network from Scratch Pattern Recognition and Machine Learning (Chapters 5.3,5.5), Bishop.
    Lecture 17
    03/08 Model Evaluation: Finding Good Hypotheses (Model Selection, Measuring Classification Error: AIC, BIC, Cross Validation) Lecture 18
    03/13 Spring Break
    03/15 Spring Break
    03/20 Hypothesis Testing and
    Bayesian Hypothesis Testing
    (A/B Testing)
    • T. A. B. Snijders, Hypothesis Testing: Methodology and Limitations. [pdf]
    • Kohavi R, Longbotham R, Sommerfield D, Henne RM. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery. 2009 Feb 1;18(1):140-81. [paper] [pdf].
    • American Statistical Association Statement On Statistical Significance And p-values [PDF]
    Lecture 19
    03/22 Model Assessment:
    Testing Multiple Hypotheses
    Lecture 20
    03/27 Decision Under Uncertainty:
    Multi-armed Bandits
    (Thompson Sampling)
    • Thompson Sampling an excellent tutorial by Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen
    Lecture 21
    03/29 Deep Neural Networks: Classification Tasks with Structed Data (Convolutional Networks) Lecture 22
    04/03 Collaborative Filtering Tasks and Classical Solutions Lecture 23
    04/05 Deep Neural Networks: Classification Tasks with Structed Data (Graph Data) Lecture 24
    04/10 Deep Neural Networks: Prediction Tasks with Time Series Data (Latent Markov Embeddings, Intro to NCE, word2vec) Lecture 25
    04/12 Hands-on: Training Deep Learning See Piazza post about GPU resources Lecture 26
    04/17 Dimensionality Reduction (Classical PCA)
    • Chapter 14, (Unsupervised Learning) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning [pdf]
    • Chapter 12, (Continuous Latent Variables) C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
    Lecture 27
    04/19 Representation Learning (Advanced Dimensionality Reduction)
    • Chapter 14, (Unsupervised Learning) Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning [pdf]
    • Chapter 12, (Continuous Latent Variables) C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
    Lecture 28
    04/24 Descriptive modeling: Clustering from a representation learning perspective Principles of Data Mining, Chapter 6.4, 9 Lecture 29
    04/26 Final review