CS 49000-DSC: Data Science Capstone

Semester: Spring 2022
Time and place: Monday, Wednesday and Friday, 11.30am-12.20pm, Lawson Building 1142
Instructor: Jean Honorio (Please send an e-mail for appointments)
TAs: Nikhil Goyal, email: goyal70 at purdue.edu
Yonghan Jung, email: jung222 at purdue.edu
Jiacheng Li, email: li2829 at purdue.edu
Hasan Mahmood, email: mahmood6 at purdue.edu
Tanmaya Udupa, email: tudupa at purdue.edu

The Capstone course aims at providing students with an opportunity to integrate their accumulated knowledge and technical and social skills in order to identify and solve a real-world data science problem, with a special emphasis on the application domain. A capstone project is sponsored by a corporate partner or by an academic research group.

The Capstone course serves as a final preparation for students entering into the profession. Students will conduct a team-based project through the entire data science pipeline, by following the six phases of the CRISP-DM (CRoss-Industry Standard Process for Data Mining) methodology. Students get experience in working as teams, participating in project planning, writing reports and giving presentations.

Learning Objectives

After successful completion of this course, a student will be able to: In this syllabus, the term "business" refers to the corporate partner or academic research group that sponsors the capstone project.

Prerequisites

CS 37300.

Textbooks

There is no official textbook for this class. We will follow the CRISP-DM (CRoss-Industry Standard Process for Data Mining) manual. In addition, any data science book will be fine, for instance:

Grading

Team participation: Individual participation:

Projects

This semester, we thank the following academic research groups and corporate partners for sponsoring projects, and the DataMine* and the Data Science Consulting Service** for their invaluable help:
Team TA Meetings Project
7 Jiacheng Li TA: Thursday 7:00pm-8:00pm, online Prof. Stanley Chan, Electrical and Computer Engineering: Chan's group aims at understanding the vulnerability of machine learning under adversarial image-based attacks such as color perturbations.
3 Yonghan Jung TA: Tuesday 2:30pm-3:30pm, online Prof. Yiheng Feng**, Civil Engineering: Feng's group aims at using a freeway vehicle trajectory dataset to model the behavior of drivers. This would help autonomous vehicles to avoid collisions with surrounding vehicles and better plan their trajectories.
11 Hasan Mahmood TA: Wednesday 4:30pm-5:30pm, online Prof. Andrew Flachs, Anthropology: Flachs' group aims at analyzing hundreds of pages from interviews of farmers and farmer's market managers, to uncover adaptations to the pandemic and similarities/agreements.
1 Jiacheng Li TA: Friday 1:00pm-2:00pm, online Prof. Wen Jiang**, Biological Sciences: Jiang's group aims to reconstruct the 3D structure of viruses from Cryo-EM: noisy 2D projection images from arbitrary, unknown viewpoints.
9 Jiacheng Li TA: Monday 10:00am-11:00am, online Prof. Guang Lin**, Mechanical Engineering: Physics-based simulation is computationally expensive. Lin's group aims at using machine learning to produce similar but faster results than a simulator.
10 Yonghan Jung TA: Monday 2:30pm-3:30pm, online Prof. Sorin Matei, Communication: Matei's group aims at analyzing the bombing missions conducted by the allies in World War II, for tasks such as data reconstruction, prediction and intervention.
2 Hasan Mahmood TA: Wednesday 2:00pm-3:00pm, online Prof. Eric Waltenburg, Political Science: Waltenburg's group aims at analyzing opinions of the US Circuit Courts of Appeal to find out whether more complex (covering more issues) and polarized decisions relate to more diverse panels.
8 Tanmaya Udupa TA: Friday 9:30am-10:30am, MRGN 112
Expert: Monday 9:30am-10:20am, online
Cat Digital*: Caterpillar needs to categorize invoices from parts sales to customers, with improved results over existing natural language processing algorithms.
5 Tanmaya Udupa TA: Tuesday 11:30am-12:30pm, MRGN 212
Expert: Thursday 12:30pm-1:20pm, online
Ford Motor Company*: Ford aims to analyze voice of customer (social media, surveys, etc.) data by analyzing topic shifts over time to observe emerging trends.
12 Tanmaya Udupa TA: Thursday 3:30pm-4:30pm, MRGN 206
Expert: Tuesday 3:30pm-4:20pm, online
Helmer Scientific*: Helmer would like to better predict their ability to meet their forecast commitments and determine trends that could impact product mix in manufacturing.
6 Nikhil Goyal TA: Friday 4:20pm-5:20pm, MRGN 212
Expert: Monday 3:30pm-4:20pm, online
Midcontinent Independent System Operator*: MISO aims to use predictive analytics to allocate system resources (CPU, RAM, server, etc.) to calculate the optimal dispatch of generation to meet electricity needs.
4 Nikhil Goyal TA: Friday 10:30am-11:30am, MRGN 112
Expert: Monday 11:30am-12:20pm, online
Republic Airways*: Republic Airways aims to find markers or deviations in normal performance from the Fault History DataBase, Quick Access Recorder, and other telemetry, to predict future faults.

Late policy

Assignments that are submitted after the specified deadline will incur a 10% reduction in score per day late.

Academic Honesty

Please read the departmental academic integrity policy here. This will be followed unless we provide written documentation of exceptions. We encourage you to interact amongst yourselves: you may discuss and obtain help with basic concepts covered in lectures and homework specification (but not solution). However, unless otherwise noted, work turned in should reflect your own efforts and knowledge. Sharing or copying solutions is unacceptable and could result in failure. You are expected to take reasonable precautions to prevent others from using your work.

Additional course policies

Please read the general course policies here.

Schedule

Date Topic (Tentative) Notes
Mon, Jan 10 Course introduction
CRISP-DM (CRoss-Industry Standard Process for Data Mining) methodology
Wed, Jan 12 Phase 1: Business understanding
Case Study Report 1: Business understanding (password-protected)
Fri, Jan 14      lecture continues
Mon, Jan 17 MARTIN LUTHER KING JR. DAY
Wed, Jan 19 Phase 2: Data understanding
Case Study Report 2: Data understanding (password-protected)
(attendance by iClicker)
Business understanding report, due on Wed, Feb 2, 11.59pm EST
(See Brightspace for directions)
Fri, Jan 21      lecture continues
(attendance by iClicker)
Mon, Jan 24 Phase 3: Data preparation
Case Study Report 3: Data preparation (password-protected)
(attendance by iClicker)
Wed, Jan 26      lecture continues
(attendance by iClicker)
Fri, Jan 28
Mon, Jan 31 Cross-validation
(attendance by iClicker)
Wed, Feb 2      lecture continues
Model selection
(attendance by iClicker)
Business understanding report due
Data understanding report, due on Wed, Feb 16, 11.59pm EST
Data understanding report, due on Fri, Feb 18, 11.59pm EST
(See Brightspace for directions)
Fri, Feb 4      lecture continues
(attendance by Zoom)
online lecture, see Piazza for details
Mon, Feb 7 Student presentations of Business understanding report
(attendance by iClicker)
see Piazza for details
Wed, Feb 9      presentations continue
(attendance by iClicker)
Fri, Feb 11      presentations continue
(attendance by iClicker)
Mon, Feb 14      presentations continue
(attendance by iClicker)
Wed, Feb 16      presentations continue
(only teams 5, 8 and 12)
Fri, Feb 18 Phase 4: Modeling
Case Study Report 4: Modeling (password-protected)
Case Study Code and Data (password-protected)
(attendance by iClicker)
Data understanding report due
Data preparation report, due on Fri, Mar 4, 11.59pm EST
Mon, Feb 21      lecture continues
(attendance by iClicker)
Wed, Feb 23 Student presentations of Data understanding report
(attendance by iClicker)
Fri, Feb 25      presentations continue
(attendance by iClicker)
Mon, Feb 28      presentations continue
(attendance by iClicker)
Wed, Mar 2      presentations continue
(attendance by iClicker)
Fri, Mar 4      presentations continue
(only teams 5 and 12)
Data preparation report due
Modeling report, due on Fri, Apr 1, 11.59pm EST
Modeling report, due on Wed, Apr 6, 11.59pm EST
Mon, Mar 7      presentations continue
(only teams 4 and 8)
Wed, Mar 9 Student presentations of Data preparation report
(attendance by iClicker)
Fri, Mar 11      presentations continue
(attendance by iClicker)
Mon, Mar 14 SPRING VACATION
Wed, Mar 16 SPRING VACATION
Fri, Mar 18 SPRING VACATION
Mon, Mar 21      presentations continue
(attendance by iClicker)
Wed, Mar 23      presentations continue
(only teams 5 and 12)
Fri, Mar 25      presentations continue
(attendance by iClicker)
Mon, Mar 28      presentations continue
(only teams 4 and 6)
Wed, Mar 30 Office hours through Zoom
(Team 1, 7 and 11 attended)
Fri, Apr 1 Office hours through Zoom
(Team 5, 12 and 1 attended)
Mon, Apr 4 Phase 5: Evaluation
Case Study Report 5: Evaluation (password-protected)
(attendance by iClicker)
Wed, Apr 6 Phase 6: Deployment
Case Study Report 6: Deployment (password-protected)
(attendance by iClicker)
Modeling report due
Evaluation report, due on Fri, Apr 15, 11.59pm EST
Fri, Apr 8 Student presentations of Modeling report
(attendance by iClicker)
Mon, Apr 11      presentations continue
(attendance by iClicker)
Wed, Apr 13      presentations continue
(attendance by iClicker)
Fri, Apr 15      presentations continue
(attendance by iClicker)
Evaluation report due
Mon, Apr 18      presentations continue
(attendance by iClicker)
Wed, Apr 20      presentations continue
(attendance by iClicker)
Presentation slides, due on Wed, Apr 27, 11.59pm EST
Fri, Apr 22      presentations continue
(attendance by iClicker)
Mon, Apr 25      presentations continue
(attendance by iClicker)
Wed, Apr 27      presentations continue
(only teams 5 and 12)
Presentation slides due
Fri, Apr 29      presentations continue
(only teams 6 and 8)
Additional reading materials: [1] and [2] (not mandatory to be read)