CS39000-DM0 Spring 2013 Time: TTh 3:00-4:15pm Location: HAAS G066
Schedule Textbook Piazza BlackboardProfessor Jennifer Neville
Lawson 2142D neville[at]cs.purdue.edu 6-9387
Office hours: Fridays 10:00am-11:00am
Syed Naqvi
Office hours: Wednesdays 5-6pm, LWSN B116
Questions: We will use Piazza for class questions/discussion. Instead of sending email to the ta list, please post your questions on Piazza.
Email: cs390dm-ta [at] cs.purdue.edu
This course will introduce students to the field of data mining and machine learning, which sits at the interface between statistics and computer science. Data mining and machine learning focuses on developing algorithms to automatically discover patterns and learn models of large datasets. This course introduces students to the process and main techniques in data mining and machine learning, including exploratory data analysis, predictive modeling, descriptive modeling, and evaluation.
Prerequisites: CS182, CS251. Concurrent prerequisite: ST350 or ST511.
D. Hand, H. Mannila, P. Smyth (2001). Principles of Data Mining. MIT Press. Available online as e-book. Additional reading materials will be distributed as necessary. Reading assignments will be posted on the schedule, please check regularly.
There will be six homework/programming assignments that will be posted on the schedule. Homework assignments should be submitted in class, unless otherwise noted. Programming assignments should written in python, unless otherwise noted, and should be submitted on data.cs.purdue.edu using Turnin. Details will be provided in the assignments.
In general, questions about the details of homework assignments should be directed to the TA, though you should feel free to mail the instructor whenever you have a question. Example solutions will be made available when homework is returned to students.
There will be several in class quizzes as well as two midterms and comprehensive final exam. Exams and quizzes will be closed book and closed notes.
Assignments are to be submitted by the due date listed. Each person will be allowed four days of extensions which can be applied to any combination of assignments during the semester without penalty. After that a late penalty of 15% per day will be assigned. Use of a partial day will be counted as a full day. Use of extension days must be stated explicitly in the late submission (either directly in the submission header or by accompanying email to the TA), otherwise late penalties will apply. Extensions cannot be uses after the final day of classes (ie., Apr 26). Extension days cannot be rearranged after they are applied to a submission. Use them wisely!
Assignments will NOT BE accepted if they are more than five days late. Additional extensions will be granted only due to serious and documented medical or family emergencies.
Introduction (1 week)
What is data mining? What is machine learning? Overview of the process and associated tasks. Example applications.
Background and basics (1 week)
Types of data: attributes, instances. Populations and samples. Random variables and distributions. R and Python.
Exploratory data analysis (2 weeks)
Data cleaning and preprocessing. Sampling. Feature construction and discovery. Visualization methods. Hypothesis testing.
Predictive Modeling (3 weeks)
Classification problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., naive Bayes, decision trees, nearest neighbor). Evaluation: metrics, cross-validation, learning curves.
Understanding and Extending Model Performance (1 week)
Error analysis. Feature selection. Ensemble techniques.
Descriptive Modeling (3 weeks)
Clustering problem formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., k-means, hierarchical clustering, spectral clustering). Evaluation: metrics, subjective assessment.
Pattern Mining (2 weeks)
Pattern detection formulation. Algorithmic elements: representation, scoring functions, search, inference. Overview of basic algorithms (e.g., association rules, anomaly detection). Evaluation: metrics, interestingness, understandability.