CS59200BDS New Trends in Big Data Systems
Instructor: Walid G. Aref
Semester: Fall 2022
Tuesdays and Thursdays 1:30-2:45PM (KRAN G004)
Office Hours: Mondays 2-6PM (Over Zoom to
discuss the projects)
Contact Information: aref@purdue.edu, Room Number: HAAS 164C
Overview:
In this course, we will explore together recent trends and advances in Big Data Systems. It is assumed that the student has had at least one undergraduate or graduate database systems course. This course is project-centric. Each student will choose a semester-long project. Students can work alone or in groups. Students are expected to report regularly their progress in their project during class time and in separate meetings along with presenting research papers that directly contribute to their project. Each student is expected to present around 3 research presentations each semester. For project reporting, each group will report at least once every two weeks, and for students electing to go faster track, they can report once a week.
Students will meet with the instructor weekly or biweekly in person or over Zoom to demonstrate their progress. Each meeting will be around 15-30 minutes long based on the progress made by the group. Several projects from multiple groups may be presented in one project meeting, and hence there will an opportunity for groups to see each other’s project presentations.
We will study topics related to new memory technologies and their impact on database engines, e.g., main-memory database techniques, persistent memory hierarchies, and SSDs, their effect on query processing, data indexing, concurrency, lock-free protocols, reliability, and recovery. We will study challenges in realizing NoSQL databases, implications of new hardware technologies on database engine realization including cache hierarchies, NUMA awareness, and vectorization techniques. We will have some focus on location data and how to support location data in big data systems.
Tentative Topics To Be Covered
In Class
Based on the projects that the students will select, topics to be covered in class (via paper presentations) will cover some of the following topics
- Background Material (By the Instructor)
o Multi-dimensional Indexing
o Generalized Search Trees
o Big Data Streaming Techniques
o Micro-batching in Spark Streaming
- Modern Storage Technologies and Their Characteristics
o SSD
o Persistent/Non Volatile Memory
- Non-Uniform Memory Access (NUMA) Awareness and Load Balancing in Distributed Query Processing (Work Stealing – Pull vs. Work Sharing – Push)
- Indexing Techniques Suitable for Big Data Systems
o LSM and how to make it more efficient
o Skiplists and their use in big data systems
o Adaptive Indexing
o Latch-free Data Indexing
- NoSQL and Multi-Model Big Data Systems
- Main-memory Big Data System Techniques
- Learned Multi-Dimensional Indexing
- Column Stores vs. Row Stores
- Distributed Shared Storage for Cloud Databases
- Relaxed Notions of Concurrency
o Linearizability/Strong Consistency
o Eventual Consistency
o Snapshot Isolation
o Causal Consistency
- Multi-version Concurrency Control
- Vectorized Database Techniques
- Big Graph Data Systems
- Document Data Systems
- Query Optimization Frameworks
Tentative Project Topics:
Students can pick from the tentative list of projects below or propose their own projects (has to be approved by the instructor)
1. Studying how documents are stored and are indexed in big data systems that adopt the document model and study indexing and query processing techniques over these documents
· In this project, you will be studying the papers, documentation, and code (if open source) of real big data systems that support documents, and explore how documents are stored, indexed, and what operations are performed on them.
2. Studying Indexing Techniques in Non-Volatile Memory
· In this project, you will be studying the characteristics of the various technologies and highlights of how they work, and how these characteristics affect index design.
3. Studying SSD and Persistent Memory Storage Technologies
· In this projects, you will be studying the characteristics of SSD and various PT storage technologies, their properties, and how they affect the design of Big Data Storage and Retrieval systems.
4. Studying Main-memory Indexing Techniques
· In this project, you will be studying how indexes are realized in main-memory based big data systems, and how they differ from traditional disk-based indexes and how their concurrency control is handled.
5. Studying NUMA-aware query processing techniques
· In this project, you will be studying the NJUMA Architecture (Non-Uniform Memory Access), how it affects performance on various big data system components, e.g., query processing and indexing.
6. Studying LSM-based storage techniques in big data systems
· In this project, you will be studying LSM and what makes it popular in NoSQL and big data systems.
7. Studying the use of skiplists in big data systems
· In this project, you will be searching for and studying big data systems that use skiplists and trying to identify why they are popular, their advantages, and how they are used, e.g., as in Pebbles DB.
8. Survey of NoSQL Stores
· In this project, you will study several NoSQL stores that support key-value, document, graph, wide-column, location, etc.
9. Survey of Learned Multi-dimensional Indexes
· In this project, you will be collaborating to write a survey article about learned indexes for the multi-dimensional space.
10. Survey of Distance Oracle Techniques
· In this project, you will be collaborating to write a survey article about distance oracles.
11. Survey of Column-store Big Data Techniques
· In this project, you will be studying column-store data techniques in terms of storage, indexing, and query processing.
12. Survey of Graph Data Engines
· In this project, you will be studying and contrasting various graph data engines, their query languages, and their processing capabilities.
13. Studying Distributed Shared Memory and how it relates to Distributed Shared Storage in Cloud Databases
· In this project, you will be studying distributed shared storage in cloud setups, e.g., Amazon S3 and the effect of adding a distributed shared memory layer.
14. Studying Relaxed Notions of Concurrency for Big Data Systems
· In this project, you will be studying relaxed notions of concurrency protocols that are suitable for big data systems.
15. Studying Multi-version Concurrency Control Techniques
· In this project, you will be studying MVCC and how it is implemented in various systems, e.g., PostgreSQL, MySQL, and SQLLite and how to make MVCC more efficient and how it can support various isolation levels.
16. Studying and Comparing Multi-Model Big Data Systems
· In this project, you will be studying how multi-model databases are designed by comparing the design decisions of several multi-model big data systems and how they work, their supported data models, programming languages, storage models, cross-model optimizations.
17. Survey of Cloud Database Techniques
· In this project, you will be studying how cloud database techniques differ from regular database systems.
18. Survey of Vectorized Database Techniques
· In this project, you will be studying how database systems speed up query processing by using vectorization.
19. Write Ahead Log Techniques for Main Memory Databases
· In this project, you will be studying techniques for Write Ahead Log in main memory databases.
20.
How to Support GDPR in
Big Location Data Systems
·
General Data Protection
Regulation (GDPR)
· In this project, you will study GDPR and provide a design of how big location data systems can support the various features of GDPR.
21. Realize a Bw-tree and Bz-tree indexes
· In this project, you will realize these two indexes
22. Survey of the use of Epoch-based techniques in database systems
· In this project, you will study how Epoch-based techniques can be used in database systems, e.g., for garbage collection, along with multi-version concurrency control, and for group commit, among other usages, along with a study of the advantages and disadvantages of Epoch-based techniques.
Project Deliverables:
Each group is expected to meet with the instructor during Zoom office hours (Mondays 2-5PM) on weekly/biweekly basis to discuss project progress. By the end of the project, each group is expected to deliver a final report including the following:
Paper Presentations in Class:
Students are expected to present around 3 papers in class (in ppt) preferably but not necessarily related to their project.
Evaluation:
Students will be evaluated based on their project progress and accomplishments as well as their class presentations.
- Project (70%): Distributed as follows:
o (35%) Regular bi-weekly progress using ppt presentations
o (20%) Final Project Presentation
o (15%) Final Project Report + Documented Source Code (if applicable)
- Class Presentations (30%)
o (10% each) 3 in-class presentations
o (10%) in-class discussions
Tentative List of Papers for Project and Course
Presentations
1.
Robert D. Blumofe
and Charles E. Leiserson. 1999. Scheduling
Multithreaded Computations by Work Stealing. J. ACM 46, 5 (Sept. 1999),
720–748. https: //doi.org/10.1145/324133.324234
2.
David Chase and Yossi Lev. 2005.
Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual
ACM symposium on Parallelism in algorithms and architectures. 21–28.
3.
James Dinan,
D Brian Larkins, Ponnuswamy Sadayappan,
Sriram Krishnamoorthy, and Jarek Nieplocha.
2009. Scalable work stealing. In Proceedings of the Conference on High
Performance Computing Networking, Storage and Analysis. IEEE, 1–11.
4. Nhat Minh Lê, Antoniu Pop, Albert Cohen, and Francesco Zappa Nardelli. 2013. Correct and efficient work-stealing for weak memory models. ACM SIGPLAN Notices 48, 8 (2013), 69–80.
5.
*Puya Memarzia, Suprio Ray, and
Virendra C Bhavsar. 2020. The Art of Efficient In-memory Query Processing on
NUMA Systems: A Systematic Approach. In 2020 IEEE 36th International Conference
on Data Engineering (ICDE). IEEE, 781–792.
6.
*Suprio
Ray, Catherine Higgins, Vaishnavi Anupindi, and Saransh Gautam. 2020. Enabling NUMA-aware Main Memory
Spatial Join Processing: An Experimental Study. ADMS@ VLDB (2020)
7.
https://linux.die.net/man/2/mbind
for setting the NUMA policies in Linux (and 𝑠𝑒𝑡_𝑚𝑒𝑚𝑝𝑜𝑙𝑖𝑐𝑦)
8.
*T
Li, B Chandramouli, S Madden. 2022. Performant
Almost-Latch-Free Data Structures Using Epoch Protection. Data Management on
New Hardware, 1-10. https://dl.acm.org/doi/fullHtml/10.1145/3533737.3535091
9.
*M Athanassoulis, MS Kester, LM Maas, R Stoica,
S Idreos, A Ailamaki.
Designing Access Methods: The RUM Conjecture. EDBT 2016, 461-466 https://stratos.seas.harvard.edu/files/stratos/files/rum.pdf
10. *G.Graefe, F.Halim, S.Idreos, H.Kuno, and S.Manegold. Concurrency
control for adaptive indexing. PVLDB, 5(7):656–667, 2012.
11. *G.Graefe,F.Halim,
S.Idreos,H.A.Kuno,S.Manegold,andB.Seeger.
Transactional support for adaptive indexing. VLDBJ, 23(2):303–328, 2014.
12. *J.Rao and K.A.Ross. Making B+-trees cache conscious in main memory.
In SIGMOD, 2000.
13. *Joy Arulraj, Justin Levandoski,
Umar Farooq Minhas, and Per-Ake Larson. 2018. Bztree:
A High-Performance Latch-Free Range Index for Non-Volatile Memory. Proc. VLDB
Endow. 11, 5 (Jan. 2018), 553–565. https://doi.org/10.1145/3164135.3164147
14. *Shimin Chen and Qin Jin.2015.
Persistent B+-Trees in Non-Volatile Main Memory. Proc. VLDB Endow. 8, 7 (Feb.
2015), 786–797. https://doi.org/10.14778/2752939. 2752947
15. Deukyeon
Hwang, Wook-Hee Kim, Youjip
Won, and Beomseok Nam. 2018. Endurable Transient
Inconsistency in Byte-Addressable Persistent B+-Tree. In 16th USENIX Conference
on File and Storage Technologies (FAST 18). USENIX As- sociation, Oakland, CA,
187–200. https://www.usenix.org/conference/fast18/ presentation/hwang
16. Se Kwon Lee, K. Hyun Lim, Hyunsub Song, Beomseok Nam, and Sam H. Noh. 2017. WORT: Write Optimal
Radix Tree for Persistent Memory Storage Systems. In 15th USENIX Conference on
File and Storage Technologies (FAST 17). USENIX Association, Santa Clara, CA,
257–270. https://www.usenix.org/conference/fast17/technical-
sessions/presentation/lee-se-kwon
17. *Xinjing Zhou, Lidan
Shou, Ke Chen, Wei Hu, and Gang Chen. 2019. DPTree: Differential Indexing for Persistent Memory. Proc.
VLDB Endow. 13, 4 (Dec. 2019), 421–434. https://doi.org/10.14778/3372716.3372717
18. Pengfei Zuo, Yu Hua, and Jie Wu. 2018.
Write-Optimized and High-Performance Hashing Index Scheme for Persistent
Memory. In 13th USENIX Symposium on Operating Systems Design and Implementation
(OSDI 18). USENIX Association, Carlsbad, CA, 461–476. https://www.usenix.org/conference/osdi18/presentation/zuo
19. Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong,
and Bingsheng He. 2015. NV-Tree: Reducing Consistency
Cost for NVM-based Single Level Systems. In 13th USENIX Conference on File and
Storage Technologies (FAST 15). USENIX Association, Santa Clara, CA, 167–181. https://www.usenix.org/conference/fast15/technical-sessions/presentation/yang
20. *Jihang Liu, Shimin
Chen, and Lujun Wang. 2020. LB+Trees:
Optimizing Persistent Index Performance on 3DXPoint Memory. Proc. VLDB Endow.
13, 7 (March 2020), 1078–1090. https://doi.org/10.14778/3384345.3384355
21. Baotong Lu,
Xiangpeng Hao, Tianzheng
Wang, and Eric Lo. 2020. Dash. Pro- ceedings of the
VLDB Endowment 13, 8 (Apr 2020), 1147–1161. https://doi.org/10.
14778/3389133.3389134
22. *Shaonan Ma, Kang Chen, Shimin Chen, Mengxing Liu, Jianglang Zhu, Hongbo Kang, and Yongwei Wu. 2021. ROART: Range-query Optimized Persistent
ART. In 19th USENIX Conference on File and Storage Technologies (FAST 21).
USENIX Associ- ation,Virtual,1–16.
https://www.usenix.org/conference/fast21/presentation/ma
23. Moohyeon
Nam, Hokeun Cha,Youngri
Choi, Sam H. Noh, and Beomseok Nam. 2019.
Write-Optimized Dynamic Hashing for Persistent Memory. In 17th USENIX
Conference on File and Storage Technologies (FAST 19). USENIX Association,
Boston, MA, 31–44. https://www.usenix.org/conference/fast19/presentation/nam
24. Ismail Oukid, Johan Lasperas,
Anisoara Nica,Thomas
Willhalm, and Wolfgang Lehner. 2016. FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree
for Storage Class Memory. In Proceedings of the 2016 International Conference
on Management of Data (SIGMOD ’16). Association for Computing Machinery, New
York, NY, USA, 371–386. https://doi.org/10.1145/2882903.2915251
25. *Zixuan Wang, Xiao Liu, Jian Yang,
Theodore Michailidis, Steven Swanson, and Jishen Zhao. 2020. Characterizing and Modeling Non-Volatile
Memory Systems. In 2020 53rd Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO).496–508. https://doi.org/10.1109/MICRO50266.2020.00049
29. *Ali Davoudian, Liu Chen, and Mengchi Liu. 2018. A Survey on NoSQL Stores. ACM Comput. Surv. 51, 2, Article 40
(March 2019), 43 pages. https://doi.org/10.1145/3158661
30. P. Raju, R. Kadekodi, V. Chidambaram,
and I. Abraham, “PebblesDB: Building Key-Value Stores
using Fragmented Log-Structured Merge Trees,” in Proceedings of the 26th
Symposium on Operating Systems Principles. Shanghai China: ACM, 2017, pp.
497–514.
31. Java’s ConcurrentSkipListMap. http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ConcurrentSkipListMap.html,
Apr. 2014.
32. Original skip list paper: PUGH, W. Skip lists: A probabilistic
alternative to balanced trees. Commun. ACM
33, 6 (1990), 668–676.
33. *LEHMAN, P. L., AND YAO, S. B. Efficient locking for concurrent operations
on b-trees. ACM Trans. Database Syst. 6, 4 (1981), 650–670.
34. *M.A. Bender, M.Farach-Colton,
R.Johnson, S.Mauras, T.Mayer, C.A.Phillips, and H. Xu.
Write-optimized skip lists. In PODS’17, 2017.
35. *Kersten, Timo, et al. "Everything you always wanted to know
about compiled and vectorized queries but were afraid to ask." Proceedings
of the VLDB Endowment 11.13 (2018): 2209-2222.
36. *Sepanta Zeighami, Raymond Chi-Wing Wong. Bridging the
Gap Between Theory and Practice on Insertion-Intensive Database. Proceedings
of the VLDB Endowment 2020. https://arxiv.org/pdf/2003.01064.pdf
37. *Chen Luo · Michael J. Carey. LSM-based Storage Techniques: A
Survey, VLDBJ 2019. https://arxiv.org/pdf/1812.07527.pdf
38. *The Cascades Framework for Query Optimization. http://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2017/2014/Papers/Cascades-graefe.pdf
39. *Goetz Graefe. Volcano-An Extensible and
Parallel Query Evaluation System. IEEE Transactions on Knowledge and Data
Engineering Vol. 6, No. 1, Feb. 1994.
40. *The Apache Kafka Streaming Engine. https://kafka.apache.org/ 2021.
41. *Arjun Singhvi, Aditya Akella, Maggie Anderson, Rob Cauble,
Harshad Deshmukh, Dan Gibson, Milo M. K. Martin, Amanda Strominger,
Thomas F. Wenisch, Amin Vahdat. 2021. CliqueMap: Productionizing an RMA- Based Distributed
Caching System. In ACM SIGCOMM 2021 Conference (SIGCOMM ’21), August 23–28,
2021, Virtual Event, USA. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3452296.3472934
42. Guy Golan-Gueta, Edward Bortnikov, Eshcar Hillel, and Idit Keidar. 2015. Scaling
concurrent log-structured data stores. In Proceedings of the Tenth European
Conference on Computer Systems (EuroSys '15).
Association for Computing Machinery, New York, NY, USA, Article 32, 1–14. https://doi.org/10.1145/2741948.2741973
43. Lu, Y.; Yu, X.; Cao, L.; and Madden, S., Epoch-based Commit and Replication in
Distributed OLTP Databases (pdf). Proc. VLDB Endow., 14(5): 743–756. 2021.
44. Shatdal,
Kant, Naughton. Cache Conscious Algorithms for Relational Query Processing.”
VLDB 1994.
45. Michael Stonebraker, Chuck Bear, Ugur Çetintemel, Mitch Cherniack, Tingjian Ge, Nabil Hachem, Stavros Harizopoulos,
John Lifter, Jennie Rogers, and Stan Zdonik. One Size Fits All?
– Part 2: Benchmarking Results. CIDR 2007.
46. da Trindade, J. M. F.; Karanasos, K.; Curino, C.; Madden, S.; and Shun, J., Kaskade:
Graph Views for Efficient Graph Analytics (pdf). In 36th
IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA,
April 20-24, 2020, pages 193–204, 2020.
47. Boncz, Manegold,
Kersten. Database Architecture Optimized for the New Bottleneck: Memory
Access.” VLDB 1999.
48. Stonebraker,
M.; Madden, S.; Abadi, D. J.; Harizopoulos, S.; Hachem, N.; and Helland, P., The end of an
architectural era: it's time for a complete rewrite (pdf). In
Brodie, M. L., editor(s), Making Databases Work: the
Pragmatic Wisdom of Michael Stonebraker, pages 463–489. ACM / Morgan &
Claypool, 2019.
49. Ailamaki,
DeWitt, Hill, Skounakis. Weaving Relations for Cache
Performance. VLDB 2001.
50. Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack,
Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden,
Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran,
and Stan Zdonik. 2018. C-store: a column-oriented
DBMS. Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker. Association for Computing Machinery and Morgan
& Claypool, 491–518. https://doi.org/10.1145/3226595.3226638.
51. Hankins, Patel. Data morphing: an adaptive, cache-conscious
storage technique. VLDB 2003.
52. Shao, Schindler, Schlosser, Ailamaki, and
Ganger. Clotho: Decoupling Memory Page Layout from Storage Organization. VLDB
2004.
53. Ramamurthy, DeWitt, Su. A Case For Fractured Mirrors. VLDB 2002.
54. Harizopoulos,
Liang, Abadi, Madden. Performance Tradeoffs in Read-Optimized Databases, VLDB
2006.
55. Holloway, DeWitt. Read-Optimized Databases, In-Depth, VLDB 2008.
56. Boncz, Zukowski,
Nes. DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing,
DaMoN 2008.
57. Shah, Harizopoulos, Wiener, Graefe. Fast Scans and Joins Using Flash Drives. DaMoN 2008.
58. Tsirogiannis, Harizopoulos, Shah, Wiener, Graefe.
Query Processing Techniques for Solid State Drives, SIGMOD 2009.
59. Graefe.
Efficient columnar storage in B-trees. SIGMOD Record 03/2007.
60. Jaluta, Sippu,
and Soisalon-soininen. Concurrency control and
recovery for balanced Blink trees. The VLDB Journal, 14(2):257-277, 2005.
DOI: 10.1007/s00778-004-0140-6
Course Policy:
In
this class, we adopt the following policy: http://spaf.cerias.purdue.edu/cpolicy.html
Students are
strongly advised that any act of cheating or plagiarism will result in a
score of 0 for the entire assignment (project or paper presentations) and repeat offences
will be reported to the Office of the Dean of Students and will result in
an automatic F grade.
In
the event of a major campus emergency, course requirements, deadlines, and
grading percentages are subject to changes that may be necessitated by a
revised semester calendar. If such unusual circumstances arise, students may
determine any such changes by contacting me via email (aref@purdue.edu)
Class Schedule
Date |
Topic |
Presenter |
Comments |
8/23 |
Introduction |
Walid |
|
8/25 |
Multi-Dimensional Indexing 1 |
Walid |
|
8/30 |
Multi-Dimensional Indexing 2 |
Walid |
|
9/1 |
Big Data Streaming Systems and Techniques: Concepts |
Walid |
|
9/6 |
Big Data Streaming Systems and Techniques: Query Processing |
Walid |
|
9/8 |
Big Data Streaming Systems and Techniques: Micro-batching |
Walid |
|
9/13 |
Extensible Indexing |
Walid |
|
9/15 |
a)
Scheduling Multithreaded
Computations by Work Stealing b)
Dynamic circular work-stealing
deque. |
Ruihong Ruihong |
|
9/20 |
a)
Skip lists: A probabilistic
alternative to balanced trees b)
Write-optimized skip lists |
Durga Qiyang |
|
9/22 |
a)
C-store: a column-oriented DBMS.
Making Databases Work b)
Persistent B+-Trees in
Non-Volatile Main Memory |
Pavan Darwin |
|
9/27 |
a)
Transactional support for
adaptive indexing. b)
The Apache Kafka Streaming |
Abdullah Akshay |
|
9/29 |
Concurrency control and recovery for balanced Blink trees |
Yeasir |
|
10/4 |
a)
Performant Almost-Latch-Free Data
Structures Using Epoch Protection b)
Making B+-trees cache conscious
in main memory |
Libin Qiyang |
|
10/6 |
LSM-based Storage Techniques: A Survey |
Shubham |
|
10/11 |
October Break – No Class |
|
|
10/13 |
a)
PushdownDB:
Accelerating a DBMS using S3 computation b)
Flexpushdowndb:
Hybrid pushdown and caching in a cloud DBMS |
Yunxin Yunxin |
|
10/18 |
a)
Kaskade:
Graph Views for Efficient Graph Analytics b)
Performance Tradeoffs in
Read-Optimized Databases |
Pavan Darwin |
|
10/20 |
Database Architecture Optimized for the New Bottleneck: Memory
Access |
Durga |
|
10/25 |
Fast Scans and Joins Using Flash Drives |
Abdullah |
|
10/27 |
a)
Designing Access Methods: The RUM
Conjecture b)
Data morphing: an adaptive,
cache-conscious storage technique |
Lu Akshay |
|
11/1 |
Volcano-An Extensible and Parallel Query Evaluation System |
Yeasir |
|
11/3 |
a)
Bztree:
A High-Performance Latch-Free Range Index for Non-Volatile Memory b)
Efficient columnar storage in
B-trees |
Libin Yunxin |
|
11/8 |
The Cascades Framework for Query Optimization |
Shubham |
|
11/10 |
PebblesDB:
Building Key-Value Stores using Fragmented Log-Structured Merge Trees |
Qiyang |
|
11/15 |
a)
Bridging the Gap Between Theory
and Practice on Insertion-Intensive Database b)
Read-Optimized Databases,
In-Depth |
Lu Darwin |
|
11/17 |
Query Processing Techniques for Solid State Drives |
Abdullah |
|
11/22 |
Everything you always wanted to know about compiled and
vectorized queries but were afraid to ask |
Durga |
|
11/24 |
Thanksgiving Break – No Class |
|
|
11/29 |
a)
Characterizing and Modeling
Non-Volatile Memory Systems b)
Cache Conscious Algorithms for
Relational Query Processing |
Pavan Akshay |
|
12/1 |
The end of an architectural era: it's time for a complete
rewrite |
Yeasir |
|
12/5 |
Final Project Presentations (Monday 2-5PM) |
|
|
12/6 |
a)
FPTree:
A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory b)
Scaling concurrent log-structured
data stores. |
Libin Shubham |
|
12/8 |
a)
The Art of Efficient In-memory
Query Processing on NUMA Systems: a Systematic
Approach b)
Enabling NUMA-aware Main Memory
Spatial Join Processing: An Experimental Study |
Ruihong Lu |
|
12/12 |
Final Project Presentations (Monday 2-5PM) |
See schedule below |
|
12/13 |
Final Project Presentations |
See schedule below |
|
12/15 |
Final Project Presentations |
See schedule below |
|
12/17 |
Semester End |
|
|
Project Assignments
Student |
Project |
|
Abdullah |
Survey of Learned Multi-dimensional Indexes |
|
Akshay |
Survey of NoSQL Stores |
|
Darwin |
Survey of NoSQL Stores |
|
Durga |
Survey of Graph Data Engines |
|
Libin |
The BW-Graph: A Graph Structure for New Hardware Platforms |
|
Lu |
Studying NUMA-aware query processing techniques |
|
Pavan |
Studying the use of skiplists in big data systems |
|
Qiyang |
Survey of Learned Multi-dimensional Indexes |
|
Ruihong |
Studying NUMA-aware query processing techniques |
|
Shubham |
Studying LSM-based storage
techniques in big data systems |
|
Yeasir |
Studying NUMA-aware query processing techniques |
|
Yunxin |
Survey of Cloud Database Techniques |
|
Project Meetings Schedule (with Walid)
Project meetings take place every Monday over zoom from 2pm-6pm
according to the following schedule:
Meetings |
Students |
Notes |
Fast Track (Weekly) |
Libin,
Pavan, Yeasir, Yunxin |
Biweekly students will present first, and then Fast track
students next |
Biweekly (Group 1) |
Abdullah,
Akshay, Darwin, Qiyang |
Students with other time commitments will be given priority to
present first |
Biweekly (Group 2) |
Durga,
Lu, Ruihong, Shubham |
Attending and listening to other students’ weekly or biweekly
presentations is an excellent opportunity for other students to learn about
the subject as well as to provide feedback and participate in the discussions |
Final Project Presentations Schedule
Paper Assignments:
Durga |
32. Original skip list paper: PUGH, W. Skip lists: A
probabilistic alternative to balanced trees. Commun.
ACM 33, 6 (1990), 668–676. |
Durga |
35. *Kersten, Timo, et al. "Everything you
always wanted to know about compiled and vectorized queries but were afraid
to ask." Proceedings of the VLDB Endowment 11.13
(2018): 2209-2222. |
Durga |
47. Boncz, Manegold, Kersten. Database Architecture Optimized for
the New Bottleneck: Memory Access.” VLDB 1999. |
Pavan |
da Trindade, J. M. F.; Karanasos, K.; Curino, C.; Madden, S.; and Shun, J., Kaskade:
Graph Views for Efficient Graph Analytics (pdf).
In 36th IEEE International Conference on Data Engineering, ICDE 2020,
Dallas, TX, USA, April 20-24, 2020, pages 193–204, 2020. |
Pavan |
C-store: a column-oriented DBMS. Making Databases Work: The
Pragmatic Wisdom of Michael Stonebraker. |
Pavan |
Zixuan
Wang, Xiao Liu, Jian Yang, Theodore Michailidis,
Steven Swanson, and Jishen Zhao. 2020. Characterizing and Modeling Non-Volatile Memory Systems. In 2020
53rd Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO).496–508. https://doi.org/10.1109/MICRO50266.2020.00049 |
Yeasir |
60. Jaluta, Sippu,
and Soisalon-soininen. Concurrency control and
recovery for balanced Blink trees. The VLDB Journal, 14(2):257-277, 2005. |
Yeasir |
39. *Goetz Graefe. Volcano-An
Extensible and Parallel Query Evaluation System. IEEE Transactions on
Knowledge and Data Engineering Vol. 6, No. 1, Feb. 1994. |
Yeasir |
|
Darwin |
Holloway, DeWitt. Read-Optimized Databases, In-Depth, VLDB 2008. |
Darwin |
Harizopoulos,
Liang, Abadi, Madden. Performance Tradeoffs in Read-Optimized Databases, VLDB
2006. |
Darwin Libin |
*Shimin Chen and Qin Jin.2015.
Persistent B+-Trees in Non-Volatile Main Memory. Proc. VLDB Endow. 8, 7 (Feb.
2015), 786–797. https://doi.org/10.14778/2752939. 2752947 Bztree:
A High-Performance Latch-Free Range Index for Non-Volatile Memory. |
Libin |
Performant Almost-Latch-Free Data Structures Using Epoch
Protection. |
Libin |
FPTree:
A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class
Memory. |
Lu |
Enabling NUMA-aware Main Memory Spatial Join Processing: An
Experimental Study |
Lu |
Designing Access Methods: The RUM Conjecture |
Lu |
Bridging the Gap Between Theory and Practice on
Insertion-Intensive Database |
Shubham |
*Chen Luo · Michael J. Carey. LSM-based Storage Techniques: A
Survey |
Shubham |
*The Cascades Framework for Query Optimization |
Shubham |
Guy Golan-Gueta, Edward Bortnikov, Eshcar Hillel, and Idit Keidar. 2015. Scaling concurrent log-structured data stores. |
Yunxin |
Graefe. Efficient
columnar storage in B-trees. SIGMOD Record 03/2007. |
Yunxin |
|
Yunxin |
|
Qiyang |
30. PebblesDB: Building Key-Value
Stores using Fragmented Log-Structured Merge Trees |
Qiyang |
34. Write-optimized skip lists |
Qiyang |
*J.Rao and K.A.Ross. Making B+-trees cache conscious in main memory.
In SIGMOD, 2000. |
Akshay |
The
Apache Kafka Streaming
Engine. https://kafka.apache.org/ 2021. |
Akshay |
Hankins, Patel. Data morphing: an adaptive, cache-conscious
storage technique. VLDB 2003. |
Akshay |
Shatdal,
Kant, Naughton. Cache Conscious Algorithms for Relational Query
Processing.” VLDB 1994 |
Ruihong |
Scheduling Multithreaded Computations by Work Stealing |
Ruihong |
Dynamic circular work-stealing deque. |
Ruihong |
The Art of Efficient In-memory Query Processing on NUMA Systems:
a Systematic Approach, ICDE 2020 |
Abdullah |
11. *G.Graefe,
F.Halim, S.Idreos, H.A.Kuno, S.Manegold, and B.Seeger. Transactional support for adaptive
indexing. VLDBJ, 23(2):303–328, 2014 |
Abdullah |
Shah, Harizopoulos, Wiener, Graefe. Fast Scans and Joins Using Flash
Drives. DaMoN 2008. |
Abdullah |
58. Tsirogiannis, Harizopoulos, Shah, Wiener, Graefe. Query
Processing Techniques for Solid State Drives, SIGMOD 2009 |