Purdue CS440: Large-scale Data Analytics
(Spring 2026)




Course Description

"Big data" has been a buzzword for a long time. Many disruptive techniques have been developed to address various aspects of big data. This course will cover the key concepts, design principles, and systems to analyze large-scale data in order to extract novel and transformative insights. Tentative topics include database fundamentals, big data storage (e.g., HDFS), big data computing frameworks (e.g., Hadoop and Spark), data warehouses, data lakes, graph analytics (e.g., Spark Graph), data streaming (e.g., Spark Streaming), large-scale machine learning (e.g., Spark MLlib), vector databases, and cloud-native data analytics.




Instructor

  • Jianguo Wang
  • Email: csjgwang@purdue.edu (note: must include "[CS440]" in the subject)



Teaching Assistants

  • Shige Liu (liu3529@purdue.edu)
  • Yunan Zhang (zhan4404@purdue.edu)
  • Jiayi Liu (liu4127@purdue.edu)



Logistics

  • When: MW 3:30p-4:20p
  • Where: Wilmeth Active Learning Center 3087
  • Office hour: after class or make appointment
  • Pre-requisites: CS242, CS251, and CS373



Labs and PSOs

Labs and PSOs will start from the 3rd week.
  • L01: Tuesday 1:30pm-3:20pm (LWSN B131)
  • L02: Friday 9:30am-11:20am (LWSN B131)
  • L03: Wednesday 9:30am-11:20am (LWSN B131)
  • L04: Wednesday 11:30am-1:20pm (LWSN B131)



Online communications

  • We'll use Piazza, e.g., announcements, discussions, and Q&A.
  • We'll NOT use Brightspace except for sending emails occasionally.
  • We'll use Gradescope for submitting and grading homeworks.



Textbooks (Optional)

Note that textbooks are optional and the lectures slides are self-contained.



Grading

  • Homeworks: 20% (2 * 10%)
  • Midterm exam: 25%
  • Final exam: 35%
  • Project: 20% (2 * 10%)
    • Projects are related to the labs and will be explained in the labs.
  • Extra credits: 5%



Academic Integrity and More




Schedule

Lecture

Topic

Lec 1 (01/12) Course Introduction
Lec 2 (01/14) Relational DB
Lec 3 (01/19) No class due to MLK Day
Lec 4 (01/21) No class due to conference trip
Lec 5 (01/26) SQL
Lec 6 (01/28) SQL 2
Lec 7 (02/02) Database Storage
Lec 8 (02/04) Index
Lec 9 (02/09) Query Processing
Lec 10 (02/11) Query Processing 2
Lec 11 (02/16) Transaction
Lec 12 (02/18) Concurrency Control
Lec 13 (02/23) Crash Recovery
Lec 14 (02/25) Crash Recovery 2
Lec 15 (03/02) Distributed Databases
Lec 16 (03/04) Midterm Exam (In-class)
Lec 17 (03/09) Hadoop
Lec 18 (03/11) SQL-on-Hadoop
Lec 19 (03/16) No class due to Spring break
Lec 20 (03/18) No class due to Spring break
Lec 21 (03/23) Big Data Storage
Lec 22 (03/25) Big Data Storage 2
Lec 23 (03/30) Spark Core
Lec 24 (04/01) Spark SQL
Lec 25 (04/06) Spark ML
Lec 26 (04/08) Spark Streaming
Lec 27 (04/13) Spark Graph
Lec 28 (04/15) Vector Data Analytics
Lec 29 (04/20) Vector Data Analytics 2
Lec 30 (04/22) Cloud-Native Data Analytics
Lec 31 (04/27) Cloud-Native Data Analytics 2
Lec 32 (04/29) Review