Topics in Data Mining
CS69000-DM1: Fall 2015 — Time: Tue & Thu 9:00a-10:15a — Location: Stanley Coulter Hall G032
Schedule
Instructor
Bruno Ribeiro
LWSN 2142C (ribeiro$cs.purdue.edu), where $== @
Office hours: Monday 9am-11am, Tuesday 11:30am-1:30pm
Description
This seminar will consist of readings and presentations on data mining for network analysis. Topics will be wide-ranging, including collecting network data in the wild, analyzing partially-observed networks, network A/B testing, predicting user trajectories, and understanding network dynamics. Classes consist of a mix of both traditional lectures and student presentations. This course requires reading research articles before class, submitting paper reviews, presenting one or two articles, attending presentations of others, and a final project.
Learning objectives
Learn state-of-the art techniques in data mining, probability theory, and statistics to
perform cutting-edge research in data mining & data science.
Prerequisites
Good undergraduate level exposure to basic concepts in calculus, linear algebra, probability theory, statistics, and machine learning. In particular, students should be comfortable with probability models, p-values, and setting up and numerically solving ordinary differential equations. Students entering the class should have good programming skills and knowledge of algorithms.
Text
No textbook is suggested for this course.
Assignments and exams
There will be several paper reviews as well as student presentations. Each enrolled student must present a final project by the end of the semester with a midterm checkpoint (presentation).
All attending students are required to present at least one paper in class. These requirements are subject to change without notice.
Grading
- Paper reviews/Class participation: 30%
- Paper presentation: 10%
- Midterm project milestone: 20%
- Final project 40%
Grades posting
here.
Late policy
Reviews, projects and their milestones are to be submitted by the due date listed, no extensions.
Academic honesty
Please read the departmental
academic integrity policy. This will be followed unless we provide written documentation of exceptions. We encourage you to interact amongst yourselves: you may discuss and obtain help with basic concepts covered in lectures or the textbook, homework specification (but not solution), and program implementation (but not design). However, unless otherwise noted, work turned in should reflect your own efforts and knowledge. Sharing or copying solutions is unacceptable and could result in failure. We use copy detection software, so do not copy code and make changes (either from the Web or from other students). You are expected to take reasonable precautions to prevent others from using your work.
Additional course policies
Please read the general course policies
here.
- Aug 25 (Organization):
- Aug 27 (Data Collection Design):
- Veitch, D., & Tune, P. (2015). Optimal Skampling for the Flow Size Distribution. IEEE Transactions on Information Theory 2015.
- Murai, F., Ribeiro, B., Towsley, D., & Wang, P. (2013). On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling. JSAC 2013.
- Presentation slides [pptx]
- Sept 1 (Obtaining the Data 1/2):
- Sept 3 (Obtaining the Data 2/2):
- Sept 8 (Mining & Predicting User Trajectories):
- Sept 10 (A/B Testing (lecture)):
- Sept 15 (MAB (lecture)):
- Sept 17 (Bandits, Context, and Network A/B Test (lecture)):
- Sept 22 (Network A/B Testing):
- (Alina) Johan Ugander, Brian Karrer, Lars Backstrom, and Jon Kleinberg (2013), Graph cluster randomization: network exposure to multiple universes, KDD 2013.
- (Gaurav) Huan Gui, Ya Xu, Anmol Bhasin, and Jiawei Han (2015), Network A/B Testing: From Sampling to Estimation, WWW 2015
- Sept 24 (Link Prediction (lecture)):
- Fundamentals of Link Prediction. Low rank decomposition, missing data & network dynamics [pptx].
- Sept 29 (Link Prediction (presentations)):
- Oct 1 (Network Epidemics (lecture)):
- Fundamentals of Network Epidemic Modeling [pdf].
- Oct 6 (Popularity Forecast (lecture)):
- A Primer on Population-level Modeling [pdf].
- Oct 8 (Forecasting Social Media Cascades (lecture)):
- Fundamentals of Forecasting & Time Series Analysis [pdf].
- Oct 15 (Forecasting Social Media Cascades (presentations)):
- (Yash) Cheng, J., Adamic, L., Dow, P. A., Kleinberg, J. M., & Leskovec, J. (2014). Can cascades be predicted? WWW 2014
- (Bruno) Bruno Ribeiro, Minh Hoang, Ambuj Singh (2015), Beyond Models: Forecasting Complex Network Processes Directly from Data, WWW 2015
- Oct 20 (Data Streaming (lecture)):
- (Lecture) Principles of Data Streaming: Sketches, Bloom filters, Sampling & Probabilistic Matching [pptx].
- Oct 22 (Data Streaming (presentations)):
- (Lecture (cont)) Principles of Data Streaming: Sketches, Bloom filters, Sampling & Probabilistic Matching [pptx].
- (Bruno) N. Ahmed, J. Neville, and R. Kompella. Network Sampling: From Static to Streaming Graphs. ACM Transactions on Knowledge Discovery from Data 2014.
- (Bruno) Zhao, P., Aggarwal, C. C., & Wang, M. (2011). gSketch: On Query Estimation in Graph Stream. Proceedings of the VLDB Endowment, 5(3), 193–204. doi:10.14778/2078331.2078335
- Oct 27 (Factorization & Network Evolution):
- A Tensor Decomposition Primer. [pdf]
To Read:
- Ermiş, B., Acar, E., & Cemgil, A. T. (2013). Link prediction in heterogeneous data via generalized coupled tensor factorization. Data Mining and Knowledge Discovery, 29(1), 203–236.
- Yılmaz, K. Y., Cemgil, A. T., & Simsekli, U. (2011). Generalised Coupled Tensor Factorisation. In Advances in Neural Information Processing Systems (pp. 2151–2159).
- Oct 29 (Project Milestone):
- Advanced Topics in Tensor Decomposition. [pdf]
- Nov 3 (Project Milestone):
- Project Milestone Presentations
- Nov 5 (Network Communities):
- Nov 10: No Class
- Nov 12 (Network Utility):
- Principles of Network utility maximization. Distributed utility maximization [pdf].
- Nov 17 (Methods and Criteria for Model Selection):
- Methods and Criteria for Model Selection [pdf]:
- Compression, Minimum Description Length approaches.
- AIC, Bayes Factors, BIC, Mallow's C_{p}.
- Nov 19 (Summarizing Graphs):
- Nov 24 (Recommendations on Graphs (lecture)):
- (Bruno) Principles of crowsourcing and graph-based recommendations [pdf].
- Dec 1 (Recommendations on Graphs (presentations)):
- Dec 3 (Reviews of Reviews):
- Reading papers and Reviewing papers
- Dec 8 (Final Project):
- Final Project Presentations
- Dec 10 (Final Project):
- Final Project Presentations
- Dec 12 (ONLINE):