Introduction to Data Science

CS/STAT24200 • Fall 2018 • Lectures: MW 11:30-12:20pm LWSN 1106 • Lab: W 3:30-5:20pm HAAS G056



Professor Jennifer Neville
Lawson 2142D • neville[at] • 6-9387
Office hours: Friday 11am-12pm

Teaching assistant

Mahak Goindani (grad TA), Neil Kulkarni (ugrad TA)
Office hours: Tuesday 5-6pm (Neil), Thursday 5-6pm (Mahak), both in HAAS G50
Questions: We will use Piazza for class questions/discussion. Instead of sending email to the ta list, please post your questions on Piazza.
Email: cs242-ta [at]


This course provides a broad introduction to the field of data science. The course focuses on using computational methods and statistical techniques to analyze massive amounts of data and to extract knowledge. It provides an overview of foundational computational and statistical tools for data acquisition and cleaning, data manipulation, data analysis and evaluation, visualization and communication of results, data management and big data systems. The course surveys the complete data science process from data to knowledge and gives students hands-on experience with tools and methods.

Learning objectives

Upon completing the course, students should be able to:


Prerequisites: CS18000 and CS18200 (grade of C or better); CS38003 (grade of C or better); STAT 35500 (can be taken concurrently)


Readings will be assigned from the texts below. All texts are available online via the Purdue library. Reading assignments will be posted on the schedule, please check regularly. Additional materials will be distributed as necessary.

Assignments and exams

There will be short exercises to be completed weekly in each lab. There will be five homework assignments and three projects that will be posted on the schedule. Assignments should be submitted online in Blackboard or via turnin on Details will be provided in the assignments. Programming projects should written in Python 3, unless otherwise noted. In general, questions about the details of homeworks/projects should be directed to the TA on Piazza.

There will be a midterm and comprehensive final exam. Exams will be closed book and closed notes.


Grades will be posted on Blackboard.

Late policy

Assignments are to be submitted by the due date listed. Each person will be allowed four days of extensions which can be applied to any combination of assignments (homework/projects only) during the semester without penalty. After that a late penalty of 15% per day will be assigned. Use of a partial day will be counted as a full day. Use of extension days must be stated explicitly at the time of the late submission (by accompanying email to the TA), otherwise late penalties will apply. Extensions cannot be used after the final day of classes (ie., Dec 8). Extension days cannot be rearranged after they are applied to a submission. Use them wisely!

Assignments will NOT BE accepted if they are more than five days late. Additional extensions will be granted only due to serious and documented medical or family emergencies.

Academic honesty

Please read the departmental academic integrity policy. This will be followed unless we provide written documentation of exceptions. We encourage you to interact amongst yourselves: you may discuss and obtain help with basic concepts covered in lectures or the textbook, homework specification (but not solution), and program implementation (but not design). However, unless otherwise noted, work turned in should reflect your own efforts and knowledge. Sharing or copying solutions is unacceptable and could result in failure. We use copy detection software, so do not copy code and make changes (either from the Web or from other students). You are expected to take reasonable precautions to prevent others from using your work.

Additional course policies

Please read the general course policies here.

Course outline

Introduction (1 week)
What is Data Science? Examples, applications, and results obtained using data science techniques. Overview of the data science process.

Background and basics (2.5 weeks)
Review of Python. Using Python notebooks. Types of data and data representations. How to acquire data (e.g., crawling), how to process and parse data. Data manipulation, data wrangling, and data cleaning.

Visualization and basic statistics (2.5 weeks)
Introduction to R. Visualization principles and goals. Basic plots in R. The importance of communicating results. Visualizing distributions and relationships.

Hypothesis testing and causality (2.5 weeks)
Introduction to statistical inference, populations/samples. Overview of hypothesis testing, A/B testing, and how to draw conclusions from data. Correlation vs causation.

Similarity and clustering (2 weeks)
Definitions and examples of common similarity/distance measures. Overview of basic clustering methods and how to interpret/evaluate results. Dimensionality reduction.

Large scale analysis (1 week)
Data engineering overview. Discussion of databases and SQL, mapReduce processing, Spark, and Hadoop.

Collaborative filtering (1.5 weeks)
Recommender systems and collaborative filtering, including basic methods and applications.

Ethics (1 week)
Overview of ethical issues of privacy, fairness, and bias in data science.