- Future Students
- Academic Progams
- Undergraduate Program
- Current Semester CS Courses
- New Course Offerings
- Upcoming Semesters
- Previous Semesters
- Canonical Syllabi
- Course Access & Request Policy
- Academic Integrity Policy
- Grad Student Registration
- Variable Title Courses
- Study Abroad
- Professional Practice
- Co-Op Professional Practice
- Non-Co-Op Professional Practice
- ISS Application Process for International Students (CPT, OPT, RCL, Program Extension, COEL)
- Pass/Not Pass Spring 2020
CS 24200/STAT 24200: Introduction to Data Science
This course provides a broad introduction to the field of data science. The course focuses on using computational methods and statistical techniques to analyze massive amounts of data and to extract knowledge. It provides an overview of foundational computational and statistical tools for data acquisition and cleaning, data manipulation, data analysis and evaluation, visualization and communication of results, data management and big data systems. The course surveys the complete data science process from data to knowledge and gives students hands-on experience with tools and methods.
Course overview, organization, and expectations. What is Data Science?
More about Data Science. Where do we get data from? Use examples, applications, and results obtained using data science techniques. Start review of Python.
Review of Python constructs including functions; main differences to Java. Using Python (iPython notebook) to manipulate data.
Python operations most relevant to operations on data. Different data representations (tables, tuples, dictionaries, lists, matrices) and use of libraries. Types of data (categorical, ordinal, counts, real-valued). Manipulations, extractions, and selections.
Getting and preparing data: File I/O. Data formats: delimited values, markup languages, ad-hoc formats. Practical issues and solutions processing large datasets.
Parsing different formats (e.g., CSV, XML, HTML, JSON, etc.) with libraries and regular expressions. Brief overview of Unix and selected Unix commands including including awk, sed, grep.
Introduction to R. Data manipulation in R.
Comparing R and Python. Review of basic statistics (as needed).
Interpreting and exploring data through visualizations. Charts, histograms, treemaps, matPlotLib functions. Visualization principles and goals. Examples of good and bad visualization.
The importance of communicating results. How to present statistics and identify shortcomings (lying with statistics). Simpson’s paradox, inspection paradox, and empirical hands-on description of small sample problems. Practical solutions to small sample issues (e.g. Laplace smoothing).
Hypothesis testing. A/B testing and the role of the null hypothesis. The caveats of p-values and alternative approaches (introduce Bayesian tools).
Empirically describe issues with multiple hypothesis testing. Why testing multiple hypotheses often finds spurious correlations. Understand why a variety data-driven studies often draw conclusions not supported by data (e.g. medical studies).
Data manipulation and data wrangling: Filtering, transforming, aggregating, sorting, feature construction (1-of-K coding, normalization, combining features).
Types and sources of errors (missing values, noisy data, integration errors, outliers, bias in data). Examples of errors arising in different applications. Data cleaning. Introduction to SQL queries
Similarity and Distance: definitions and examples of common measures. Introduction to clustering: k-means clustering, (spectral clustering?), hierarchical clustering.
Understanding clustering results. Examples and applications. Dimensionality Reduction. Clustering versus classification (understanding unsupervised versus supervised learning).
Discussion of background, tasks, and goals of Project 2.
Visualizing and presenting multi-dimensional clusters. Ties together clustering, dimensionality reduction, and visualization.
Networks and graphs: representations and concepts. Review of relevant matrix algebra and introduction to matrix operations using numpy. Modeling and visualization (Graphviz, Cytoscape).
Characterizing networks (bipartite graphs, power-law graphs). Friends-of-friends characterization (your friends have more friends than you do, but they are not taller); Ranking, trust and centrality metrics (degree centrality, betweenness centrality, PageRank, Trustrank).
Data Manipulation at Scale. Databases and the relational algebra. Using SQL.
Parallel databases; using parallel SQL.
MapReduce processing. Overview of Spark, Hadoop.
Introduce Project 3: Putting it all together. Given data in an SQL database, clean (relatively obvious, but not pre-specified), task (comparative analysis of subgroups), output (report including visualizations). Designed so that use of tools and techniques used in Labs 8, 9, 11 are appropriate and cover most of what is needed.
Collaborative filtering. Different types and applications of collaborative filtering. Tools and simple examples of factor models (tool: Singular Value Decomposition?)
Recommendations systems. Different approaches and popular applications (Netflix Prize).
Introduction to Predictive Modeling and evaluating predictions (linear regression as running example?).
Putting it all together: Data analytics. Data Mining. Machine Learning.