CS 24200/STAT 24200: Introduction to Data Science - Department of Computer Science - Purdue University Skip to main content

CS 24200/STAT 24200: Introduction to Data Science

Course Description

This course provides a broad introduction to the field of data science. The course focuses on using computational methods and statistical techniques to analyze massive amounts of data and to extract knowledge.  It provides an overview of foundational computational and statistical tools for data acquisition and cleaning, data manipulation, data analysis and evaluation, visualization and communication of results, data management and big data systems.  The course surveys the complete data science process from data to knowledge and gives students hands-on experience with tools and methods.

Course Outline

Week 1

Course overview, organization, and expectations. What is Data Science?

More about Data Science.   Where do we get data from? Use examples, applications, and results obtained using data science techniques.  Start review of Python.

Week 2

 Review of Python constructs including functions; main differences to Java.  Using Python (iPython notebook) to manipulate data.

Python operations most relevant to operations on data. Different data representations (tables, tuples, dictionaries, lists, matrices) and use of libraries. Types of data (categorical, ordinal, counts, real-valued).  Manipulations, extractions, and selections.

Week 3

Getting and preparing data: File I/O. Data formats: delimited values, markup languages, ad-hoc formats. Practical issues and solutions processing large datasets.

Parsing different formats (e.g., CSV, XML, HTML, JSON, etc.) with libraries and regular expressions.  Brief overview of Unix and selected Unix commands including  including awk, sed, grep.

Week 4

Introduction to R. Data manipulation in R. 

Comparing R and Python. Review of basic statistics (as needed). 

Week 5

Interpreting and exploring data through visualizations. Charts, histograms, treemaps, matPlotLib functions. Visualization principles and goals. Examples of good and bad visualization.

The importance of communicating results. How to present statistics and identify shortcomings (lying with statistics). Simpson’s paradox, inspection paradox, and empirical hands-on description of small sample problems. Practical solutions to small sample issues (e.g. Laplace smoothing).

Week  6

Hypothesis testing.  A/B testing and the role of the null hypothesis. The caveats of p-values and alternative approaches (introduce Bayesian tools).

Empirically describe issues with multiple hypothesis testing. Why testing multiple hypotheses often finds spurious correlations. Understand why a variety data-driven studies often draw conclusions not supported by data (e.g. medical studies).

Week 7

Data manipulation and data wrangling: Filtering, transforming, aggregating, sorting, feature construction (1-of-K coding, normalization, combining features).

Types and sources of errors (missing values, noisy data, integration errors, outliers, bias in data). Examples of errors arising in different applications. Data cleaning. Introduction to SQL queries

Week 8

Similarity and Distance: definitions and examples of common measures. Introduction to clustering: k-means clustering, (spectral clustering?), hierarchical clustering.

Understanding clustering results. Examples and applications. Dimensionality Reduction. Clustering versus classification (understanding unsupervised versus supervised learning).

Week 9

Discussion of background, tasks,  and goals of  Project 2.

Visualizing and presenting multi-dimensional clusters. Ties together clustering, dimensionality reduction, and visualization.

Week 10

Networks and graphs: representations and concepts. Review of relevant matrix algebra and introduction to matrix operations using numpy. Modeling and visualization (Graphviz, Cytoscape).

Characterizing networks (bipartite graphs, power-law graphs). Friends-of-friends characterization (your friends have more friends than you do, but they are not taller); Ranking, trust and centrality metrics (degree centrality, betweenness centrality, PageRank, Trustrank).

Week 11

Data Manipulation at Scale. Databases and the relational algebra. Using SQL.

Parallel databases; using parallel SQL.

Week 12

MapReduce processing.  Overview of Spark, Hadoop.

Introduce Project 3: Putting it all together.  Given data in an SQL database, clean (relatively obvious, but not pre-specified), task (comparative analysis of subgroups), output (report including visualizations).  Designed so that use of tools and techniques used in Labs 8, 9, 11 are appropriate and cover most of what is needed.

Week 13

Collaborative filtering. Different types and applications of collaborative filtering. Tools and simple examples of factor models (tool: Singular Value Decomposition?)

Recommendations systems. Different approaches and popular applications (Netflix Prize).

Week 14

Introduction to Predictive Modeling and evaluating predictions (linear regression as running example?).

Putting it all together: Data analytics. Data Mining. Machine Learning.

Week 15

Review 

Last Updated: Feb 15, 2019 4:35 PM

Department of Computer Science, 305 N. University Street, West Lafayette, IN 47907

Phone: (765) 494-6010 • Fax: (765) 494-0739

Copyright © 2024 Purdue University | An equal access/equal opportunity university | Copyright Complaints

Trouble with this page? Disability-related accessibility issue? Please contact the College of Science.