CS24200 Spring 2018 Lectures: TTh 9:30-10:20am LWSN 1106 Lab: T 3:30-5:20pm HAAS G056Schedule Piazza Blackboard Vocareum
Professor Jennifer Neville
Lawson 2142D neville[at]purdue.edu 6-9387
Office hours: Mon 11-12pm, LWSN 2142D
Office hours: Fri 5-6pm, LWSN 2149
Questions: We will use Piazza for class questions/discussion. Instead of sending email to the ta list, please post your questions on Piazza.
Email: cs242-ta [at] cs.purdue.edu
This course provides a broad introduction to the field of data science. The course focuses on using computational methods and statistical techniques to analyze massive amounts of data and to extract knowledge. It provides an overview of foundational computational and statistical tools for data acquisition and cleaning, data manipulation, data analysis and evaluation, visualization and communication of results, data management and big data systems. The course surveys the complete data science process from data to knowledge and gives students hands-on experience with tools and methods.
Prerequisites: CS18000 and CS18200 (grade of C or better); CS38003 (grade of C or better); STAT 35500 (can be taken concurrently)
Readings will be assigned from the texts below. All texts are available online via the Purdue library. Reading assignments will be posted on the schedule, please check regularly. Additional materials will be distributed as necessary.
There will be short exercises to be completed weekly in each lab. There will be five homework assignments and three projects that will be posted on the schedule. Assignments should be submitted online in Blackboard or via turnin on data.cs.purdue.edu. Details will be provided in the assignments. Programming projects should written in Python 3, unless otherwise noted. In general, questions about the details of homeworks/projects should be directed to the TA on Piazza.
There will be a midterm and comprehensive final exam. Exams will be closed book and closed notes.
Assignments are to be submitted by the due date listed. Each person will be allowed four days of extensions which can be applied to any combination of assignments (homework/projects only) during the semester without penalty. After that a late penalty of 15% per day will be assigned. Use of a partial day will be counted as a full day. Use of extension days must be stated explicitly at the time of the late submission (by accompanying email to the TA), otherwise late penalties will apply. Extensions cannot be used after the final day of classes (ie., Apr 28). Extension days cannot be rearranged after they are applied to a submission. Use them wisely!
Assignments will NOT BE accepted if they are more than five days late. Additional extensions will be granted only due to serious and documented medical or family emergencies.
Introduction (1 week)
What is Data Science? Examples, applications, and results obtained using data science techniques. Overview of the data science process.
Background and basics (2.5 weeks)
Review of Python. Using Python notebooks. Types of data and data representations. How to acquire data (e.g., crawling), how to process and parse data. Data manipulation, data wrangling, and data cleaning.
Visualization and basic statistics (2.5 weeks)
Introduction to R. Visualization principles and goals. Basic plots in R. The importance of communicating results. Visualizing distributions and relationships.
Hypothesis testing and causality (2.5 weeks)
Introduction to statistical inference, populations/samples. Overview of hypothesis testing, A/B testing, and how to draw conclusions from data. Correlation vs causation.
Similarity and clustering (2 weeks)
Definitions and examples of common similarity/distance measures. Overview of basic clustering methods and how to interpret/evaluate results. Dimensionality reduction.
Large scale analysis (1 week)
Data engineering overview. Discussion of databases and SQL, mapReduce processing, Spark, and Hadoop.
Collaborative filtering (1.5 weeks)
Recommender systems and collaborative filtering, including basic methods and applications.
Ethics (1 week)
Overview of ethical issues of privacy, fairness, and bias in data science.