CS59200BDS New Trends in Big Data Systems

Instructor: Walid G. Aref

Semester: Fall 2022

 

Tuesdays and Thursdays 1:30-2:45PM (KRAN G004)

Office Hours: Mondays 2-6PM (Over Zoom to discuss the projects)

Contact Information: aref@purdue.edu, Room Number: HAAS 164C

 

Overview:

In this course, we will explore together recent trends and advances in Big Data Systems. It is assumed that the student has had at least one undergraduate or graduate database systems course. This course is project-centric. Each student will choose a semester-long project. Students can work alone or in groups. Students are expected to report regularly their progress in their project during class time and in separate meetings along with presenting research papers that directly contribute to their project. Each student is expected to present around 3 research presentations each semester. For project reporting, each group will report at least once every two weeks, and for students electing to go faster track, they can report once a week.

 

Students will meet with the instructor weekly or biweekly in person or over Zoom to demonstrate their progress. Each meeting will be around 15-30 minutes long based on the progress made by the group. Several projects from multiple groups may be presented in one project meeting, and hence there will an opportunity for groups to see each other’s project presentations.

 

We will study topics related to new memory technologies and their impact on database engines, e.g., main-memory database techniques, persistent memory hierarchies, and SSDs, their effect on query processing, data indexing, concurrency, lock-free protocols, reliability, and recovery. We will study challenges in realizing NoSQL databases, implications of new hardware technologies on database engine realization including cache hierarchies, NUMA awareness, and vectorization techniques. We will have some focus on location data and how to support location data in big data systems.

 

Tentative Topics To Be Covered In Class

Based on the projects that the students will select, topics to be covered in class (via paper presentations) will cover some of the following topics

-       Background Material (By the Instructor)

o   Multi-dimensional Indexing

o   Generalized Search Trees

o   Big Data Streaming Techniques

o   Micro-batching in Spark Streaming

-       Modern Storage Technologies and Their Characteristics

o   SSD

o   Persistent/Non Volatile Memory

-       Non-Uniform Memory Access (NUMA) Awareness and Load Balancing in Distributed Query Processing (Work Stealing – Pull vs. Work Sharing – Push)

-       Indexing Techniques Suitable for Big Data Systems

o   LSM and how to make it more efficient

o   Skiplists and their use in big data systems

o   Adaptive Indexing

o   Latch-free Data Indexing

-       NoSQL and Multi-Model Big Data Systems

-       Main-memory Big Data System Techniques

-       Learned Multi-Dimensional Indexing

-       Column Stores vs. Row Stores

-       Distributed Shared Storage for Cloud Databases

-       Relaxed Notions of Concurrency

o   Linearizability/Strong Consistency

o   Eventual Consistency

o   Snapshot Isolation

o   Causal Consistency

-       Multi-version Concurrency Control

-       Vectorized Database Techniques

-       Big Graph Data Systems

-       Document Data Systems

-       Query Optimization Frameworks

 

Tentative Project Topics:

Students can pick from the tentative list of projects below or propose their own projects (has to be approved by the instructor)

1.     Studying how documents are stored and are indexed in big data systems that adopt the document model and study indexing and query processing techniques over these documents

·      In this project, you will be studying the papers, documentation, and code (if open source) of real big data systems that support documents, and explore how documents are stored, indexed, and what operations are performed on them.

 

2.     Studying Indexing Techniques in Non-Volatile Memory

·      In this project, you will be studying the characteristics of the various technologies and highlights of how they work, and how these characteristics affect index design.

 

3.     Studying SSD and Persistent Memory Storage Technologies

·      In this projects, you will be studying the characteristics of SSD and various PT storage technologies, their properties, and how they affect the design of Big Data Storage and Retrieval systems.

 

4.     Studying Main-memory Indexing Techniques

·      In this project, you will be studying how indexes are realized in main-memory based big data systems, and how they differ from traditional disk-based indexes and how their concurrency control is handled.

 

5.     Studying NUMA-aware query processing techniques

·      In this project, you will be studying the NJUMA Architecture (Non-Uniform Memory Access), how it affects performance on various big data system components, e.g., query processing and indexing.

 

6.     Studying LSM-based storage techniques in big data systems

·      In this project, you will be studying LSM and what makes it popular in NoSQL and big data systems.

 

7.     Studying the use of skiplists in big data systems

·      In this project, you will be searching for and studying big data systems that use skiplists and trying to identify why they are popular, their advantages, and how they are used, e.g., as in Pebbles DB.

 

8.     Survey of NoSQL Stores

·      In this project, you will study several NoSQL stores that support key-value, document, graph, wide-column, location, etc.

 

9.     Survey of Learned Multi-dimensional Indexes

·      In this project, you will be collaborating to write a survey article about learned indexes for the multi-dimensional space.

 

10.  Survey of Distance Oracle Techniques

·      In this project, you will be collaborating to write a survey article about distance oracles.

 

11.  Survey of Column-store Big Data Techniques

·      In this project, you will be studying column-store data techniques in terms of storage, indexing, and query processing.

 

12.  Survey of Graph Data Engines

·      In this project, you will be studying and contrasting various graph data engines, their query languages, and their processing capabilities.

 

13.  Studying Distributed Shared Memory and how it relates to Distributed Shared Storage in Cloud Databases

·      In this project, you will be studying distributed shared storage in cloud setups, e.g., Amazon S3 and the effect of adding a distributed shared memory layer.

 

14.  Studying Relaxed Notions of Concurrency for Big Data Systems

·      In this project, you will be studying relaxed notions of concurrency protocols that are suitable for big data systems.

 

15.  Studying Multi-version Concurrency Control Techniques

·      In this project, you will be studying MVCC and how it is implemented in various systems, e.g., PostgreSQL, MySQL, and SQLLite and how to make MVCC more efficient and how it can support various isolation levels.

 

16.  Studying and Comparing Multi-Model Big Data Systems

·      In this project, you will be studying how multi-model databases are designed by comparing the design decisions of several multi-model big data systems and how they work, their supported data models, programming languages, storage models, cross-model optimizations.

 

17.  Survey of Cloud Database Techniques

·      In this project, you will be studying how cloud database techniques differ from regular database systems.

 

18.  Survey of Vectorized Database Techniques

·      In this project, you will be studying how database systems speed up query processing by using vectorization.

 

19.  Write Ahead Log Techniques for Main Memory Databases

·      In this project, you will be studying techniques for Write Ahead Log in main memory databases.

 

20.  How to Support GDPR in Big Location Data Systems

·      General Data Protection Regulation (GDPR)

·      In this project, you will study GDPR and provide a design of how big location data systems can support the various features of GDPR.

 

21.  Realize a Bw-tree and Bz-tree indexes

·      In this project, you will realize these two indexes

 

22.  Survey of the use of Epoch-based techniques in database systems

·      In this project, you will study how Epoch-based techniques can be used in database systems, e.g., for garbage collection, along with multi-version concurrency control, and for group commit, among other usages, along with a study of the advantages and disadvantages of Epoch-based techniques.

 

Project Deliverables:

Each group is expected to meet with the instructor during Zoom office hours (Mondays 2-5PM) on weekly/biweekly basis to discuss project progress. By the end of the project, each group is expected to deliver a final report including the following:

    1. A detailed ppt deck of slides reflecting their project from begin to end.
    2. A report (in a conference paper format that the instructor will provide) that reflects all the details of the project and its findings, e.g., experimental setup and results, discussion, etc. You need to provide the source documents and figures and not only the pdf.
    3. Clearly documented source code (if applicable) with scripts and data sets for the instructor to be able to replicate the results

 

 

Paper Presentations in Class:

Students are expected to present around 3 papers in class (in ppt) preferably but not necessarily related to their project.

 

Evaluation:

Students will be evaluated based on their project progress and accomplishments as well as their class presentations.

-       Project (70%): Distributed as follows:

o   (35%) Regular bi-weekly progress using ppt presentations

o   (20%) Final Project Presentation

o   (15%) Final Project Report + Documented Source Code (if applicable)

 

-       Class Presentations (30%)

o   (10% each) 3 in-class presentations

o   (10%) in-class discussions

 

 

Tentative List of Papers for Project and Course Presentations

1.     Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. J. ACM 46, 5 (Sept. 1999), 720–748. https: //doi.org/10.1145/324133.324234

 

2.     David Chase and Yossi Lev. 2005. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures. 21–28.

 

3.     James Dinan, D Brian Larkins, Ponnuswamy Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. 2009. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. IEEE, 1–11.

 

4.     Nhat Minh Lê, Antoniu Pop, Albert Cohen, and Francesco Zappa Nardelli. 2013. Correct and efficient work-stealing for weak memory models. ACM SIGPLAN Notices 48, 8 (2013), 69–80.

 

5.     *Puya Memarzia, Suprio Ray, and Virendra C Bhavsar. 2020. The Art of Efficient In-memory Query Processing on NUMA Systems: A Systematic Approach. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 781–792.

 

6.     *Suprio Ray, Catherine Higgins, Vaishnavi Anupindi, and Saransh Gautam. 2020. Enabling NUMA-aware Main Memory Spatial Join Processing: An Experimental Study. ADMS@ VLDB (2020)

 

7.     https://linux.die.net/man/2/mbind for setting the NUMA policies in Linux (and 𝑠𝑒𝑡_𝑚𝑒𝑚𝑝𝑜𝑙𝑖𝑐𝑦)

 

8.     *T Li, B Chandramouli, S Madden. 2022. Performant Almost-Latch-Free Data Structures Using Epoch Protection. Data Management on New Hardware, 1-10. https://dl.acm.org/doi/fullHtml/10.1145/3533737.3535091

 

9.     *M Athanassoulis, MS Kester, LM Maas, R Stoica, S Idreos, A Ailamaki. Designing Access Methods: The RUM Conjecture. EDBT 2016, 461-466 https://stratos.seas.harvard.edu/files/stratos/files/rum.pdf

 

10.  *G.Graefe, F.Halim, S.Idreos, H.Kuno, and S.Manegold. Concurrency control for adaptive indexing. PVLDB, 5(7):656–667, 2012.

 

11.  *G.Graefe,F.Halim, S.Idreos,H.A.Kuno,S.Manegold,andB.Seeger. Transactional support for adaptive indexing. VLDBJ, 23(2):303–328, 2014.

 

12.  *J.Rao and K.A.Ross. Making B+-trees cache conscious in main memory. In SIGMOD, 2000.

 

13.  *Joy Arulraj, Justin Levandoski, Umar Farooq Minhas, and Per-Ake Larson. 2018. Bztree: A High-Performance Latch-Free Range Index for Non-Volatile Memory. Proc. VLDB Endow. 11, 5 (Jan. 2018), 553–565. https://doi.org/10.1145/3164135.3164147  

 

14.  *Shimin Chen and Qin Jin.2015. Persistent B+-Trees in Non-Volatile Main Memory. Proc. VLDB Endow. 8, 7 (Feb. 2015), 786–797. https://doi.org/10.14778/2752939. 2752947

 

15.  Deukyeon Hwang, Wook-Hee Kim, Youjip Won, and Beomseok Nam. 2018. Endurable Transient Inconsistency in Byte-Addressable Persistent B+-Tree. In 16th USENIX Conference on File and Storage Technologies (FAST 18). USENIX As- sociation, Oakland, CA, 187–200. https://www.usenix.org/conference/fast18/ presentation/hwang 

 

16.  Se Kwon Lee, K. Hyun Lim, Hyunsub Song, Beomseok Nam, and Sam H. Noh. 2017. WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems. In 15th USENIX Conference on File and Storage Technologies (FAST 17). USENIX Association, Santa Clara, CA, 257–270. https://www.usenix.org/conference/fast17/technical- sessions/presentation/lee-se-kwon    

 

17.  *Xinjing Zhou, Lidan Shou, Ke Chen, Wei Hu, and Gang Chen. 2019. DPTree: Differential Indexing for Persistent Memory. Proc. VLDB Endow. 13, 4 (Dec. 2019), 421–434. https://doi.org/10.14778/3372716.3372717

 

18.  Pengfei Zuo, Yu Hua, and Jie Wu. 2018. Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 461–476. https://www.usenix.org/conference/osdi18/presentation/zuo

 

19.  Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong, and Bingsheng He. 2015. NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems. In 13th USENIX Conference on File and Storage Technologies (FAST 15). USENIX Association, Santa Clara, CA, 167–181. https://www.usenix.org/conference/fast15/technical-sessions/presentation/yang

 

20.  *Jihang Liu, Shimin Chen, and Lujun Wang. 2020. LB+Trees: Optimizing Persistent Index Performance on 3DXPoint Memory. Proc. VLDB Endow. 13, 7 (March 2020), 1078–1090. https://doi.org/10.14778/3384345.3384355 

 

21.  Baotong Lu, Xiangpeng Hao, Tianzheng Wang, and Eric Lo. 2020. Dash. Pro- ceedings of the VLDB Endowment 13, 8 (Apr 2020), 1147–1161. https://doi.org/10. 14778/3389133.3389134 

 

22.  *Shaonan Ma, Kang Chen, Shimin Chen, Mengxing Liu, Jianglang Zhu, Hongbo Kang, and Yongwei Wu. 2021. ROART: Range-query Optimized Persistent ART. In 19th USENIX Conference on File and Storage Technologies (FAST 21). USENIX Associ- ation,Virtual,1–16. https://www.usenix.org/conference/fast21/presentation/ma 

 

23.  Moohyeon Nam, Hokeun Cha,Youngri Choi, Sam H. Noh, and Beomseok Nam. 2019. Write-Optimized Dynamic Hashing for Persistent Memory. In 17th USENIX Conference on File and Storage Technologies (FAST 19). USENIX Association, Boston, MA, 31–44. https://www.usenix.org/conference/fast19/presentation/nam 

 

24.  Ismail Oukid, Johan Lasperas, Anisoara Nica,Thomas Willhalm, and Wolfgang Lehner. 2016. FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 371–386. https://doi.org/10.1145/2882903.2915251 

 

25.  *Zixuan Wang, Xiao Liu, Jian Yang, Theodore Michailidis, Steven Swanson, and Jishen Zhao. 2020. Characterizing and Modeling Non-Volatile Memory Systems. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).496–508. https://doi.org/10.1109/MICRO50266.2020.00049

 

26.  X Yu, M Youill, M Woicik, A Ghanem, M Serafini, A Aboulnaga, Michael Stonebraker. PushdownDB: Accelerating a DBMS using S3 computation. 2020 IEEE 36th International Conference on Data Engineering (ICDE), 1802-1805, 2020.

 

27.  Y Yang, M Youill, M Woicik, Y Liu, X Yu, M Serafini, A Aboulnaga, Michael Stonebraker. Flexpushdowndb: Hybrid pushdown and caching in a cloud DBMS, Proceedings of the VLDB Endowment 14 (11), 2101-2113, 2021.

 

28.  https://www.cs.helsinki.fi/u/jilu/documents/Multi_model_Databases__A__New_Journey_to_Handle_the_Variety_of_DataFinal.pdf

 

29.  *Ali Davoudian, Liu Chen, and Mengchi Liu. 2018. A Survey on NoSQL Stores. ACM Comput. Surv. 51, 2, Article 40 (March 2019), 43 pages. https://doi.org/10.1145/3158661

 

30.  P. Raju, R. Kadekodi, V. Chidambaram, and I. Abraham, “PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees,” in Proceedings of the 26th Symposium on Operating Systems Principles. Shanghai China: ACM, 2017, pp. 497–514.

 

31.  Java’s ConcurrentSkipListMap. http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ConcurrentSkipListMap.html, Apr. 2014.

 

32.  Original skip list paper: PUGH, W. Skip lists: A probabilistic alternative to balanced trees. Commun. ACM 33, 6 (1990), 668–676.

 

33.  *LEHMAN, P. L., AND YAO, S. B. Efficient locking for concurrent operations on b-trees. ACM Trans. Database Syst. 6, 4 (1981), 650–670.

 

34.  *M.A. Bender, M.Farach-Colton, R.Johnson, S.Mauras, T.Mayer, C.A.Phillips, and H. Xu. Write-optimized skip lists. In PODS’17, 2017.

 

35.  *Kersten, Timo, et al. "Everything you always wanted to know about compiled and vectorized queries but were afraid to ask." Proceedings of the VLDB Endowment 11.13 (2018): 2209-2222.

 

36.  *Sepanta Zeighami,  Raymond Chi-Wing Wong. Bridging the Gap Between Theory and Practice on Insertion-Intensive Database. Proceedings of the VLDB Endowment 2020. https://arxiv.org/pdf/2003.01064.pdf

 

37.  *Chen Luo · Michael J. Carey. LSM-based Storage Techniques: A Survey, VLDBJ 2019. https://arxiv.org/pdf/1812.07527.pdf

 

38.  *The Cascades Framework for Query Optimization. http://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2017/2014/Papers/Cascades-graefe.pdf

 

39.  *Goetz Graefe. Volcano-An Extensible and Parallel Query Evaluation System. IEEE Transactions on Knowledge and Data Engineering Vol. 6, No. 1, Feb. 1994.

 

40.  *The Apache Kafka Streaming Engine. https://kafka.apache.org/  2021.

 

41.  *Arjun Singhvi, Aditya Akella, Maggie Anderson, Rob Cauble, Harshad Deshmukh, Dan Gibson, Milo M. K. Martin, Amanda Strominger, Thomas F. Wenisch, Amin Vahdat. 2021. CliqueMap: Productionizing an RMA- Based Distributed Caching System. In ACM SIGCOMM 2021 Conference (SIGCOMM ’21), August 23–28, 2021, Virtual Event, USA. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3452296.3472934

 

42.  Guy Golan-Gueta, Edward Bortnikov, Eshcar Hillel, and Idit Keidar. 2015. Scaling concurrent log-structured data stores. In Proceedings of the Tenth European Conference on Computer Systems (EuroSys '15). Association for Computing Machinery, New York, NY, USA, Article 32, 1–14. https://doi.org/10.1145/2741948.2741973   

 

43.  Lu, Y.; Yu, X.; Cao, L.; and Madden, S., Epoch-based Commit and Replication in Distributed OLTP Databases (pdf).  Proc. VLDB Endow., 14(5): 743–756. 2021. 

 

44.  Shatdal, Kant, Naughton. Cache Conscious Algorithms for Relational Query Processing.” VLDB 1994.

 

45.  Michael Stonebraker, Chuck Bear, Ugur Çetintemel, Mitch Cherniack, Tingjian Ge, Nabil Hachem, Stavros Harizopoulos, John Lifter, Jennie Rogers, and Stan Zdonik. One Size Fits All? – Part 2: Benchmarking Results. CIDR 2007.

 

46.  da Trindade, J. M. F.; Karanasos, K.; Curino, C.; Madden, S.; and Shun, J., Kaskade: Graph Views for Efficient Graph Analytics (pdf).  In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, pages 193–204, 2020.

 

47.  Boncz, Manegold, Kersten. Database Architecture Optimized for the New Bottleneck: Memory Access.” VLDB 1999.

 

48.  Stonebraker, M.; Madden, S.; Abadi, D. J.; Harizopoulos, S.; Hachem, N.; and Helland, P., The end of an architectural era: it's time for a complete rewrite (pdf).  In Brodie, M. L., editor(s), Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker, pages 463–489. ACM / Morgan & Claypool, 2019. 

 

49.  Ailamaki, DeWitt, Hill, Skounakis. Weaving Relations for Cache Performance. VLDB 2001.

 

50.  Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran, and Stan Zdonik. 2018. C-store: a column-oriented DBMS. Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker. Association for Computing Machinery and Morgan & Claypool, 491–518. https://doi.org/10.1145/3226595.3226638.

 

51.  Hankins, Patel. Data morphing: an adaptive, cache-conscious storage technique. VLDB 2003.

 

52.  Shao, Schindler, Schlosser, Ailamaki, and Ganger. Clotho: Decoupling Memory Page Layout from Storage Organization. VLDB 2004.

 

53.  Ramamurthy, DeWitt, Su. A Case For Fractured Mirrors. VLDB 2002.

 

54.  Harizopoulos, Liang, Abadi, Madden. Performance Tradeoffs in Read-Optimized Databases, VLDB 2006.

 

55.  Holloway, DeWitt. Read-Optimized Databases, In-Depth, VLDB 2008.

56.  Boncz, Zukowski, Nes. DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing, DaMoN 2008.

 

57.  Shah, Harizopoulos, Wiener, Graefe. Fast Scans and Joins Using Flash Drives. DaMoN 2008.

 

58.  Tsirogiannis, Harizopoulos, Shah, Wiener, Graefe. Query Processing Techniques for Solid State Drives, SIGMOD 2009.

 

59.   Graefe. Efficient columnar storage in B-trees. SIGMOD Record 03/2007.

 

60.  Jaluta, Sippu, and Soisalon-soininen. Concurrency control and recovery for balanced Blink trees. The VLDB Journal, 14(2):257-277, 2005. DOI: 10.1007/s00778-004-0140-6

 

Course Policy:  

In this class, we adopt the following policy: http://spaf.cerias.purdue.edu/cpolicy.html

Students are strongly advised that any act of cheating or plagiarism will result in a score of 0 for the entire assignment (project or paper presentations) and repeat offences will be reported to the Office of the Dean of Students and will result in an automatic F grade.

In the event of a major campus emergency, course requirements, deadlines, and grading percentages are subject to changes that may be necessitated by a revised semester calendar. If such unusual circumstances arise, students may determine any such changes by contacting me via email (aref@purdue.edu)

 

Class Schedule

Date

Topic

Presenter

Comments

8/23

Introduction

Walid

 

8/25

Multi-Dimensional Indexing 1

Walid

 

8/30

Multi-Dimensional Indexing 2

Walid

 

9/1

Big Data Streaming Systems and Techniques: Concepts

Walid

 

9/6

Big Data Streaming Systems and Techniques: Query Processing

Walid

 

9/8

Big Data Streaming Systems and Techniques: Micro-batching

Walid

 

9/13

Extensible Indexing

Walid

 

9/15

a)     Scheduling Multithreaded Computations by Work Stealing

b)    Dynamic circular work-stealing deque.

Ruihong

 

Ruihong

 

9/20

a)     Skip lists: A probabilistic alternative to balanced trees

b)    Write-optimized skip lists

Durga

Qiyang

 

9/22

a)     C-store: a column-oriented DBMS. Making Databases Work

b)    Persistent B+-Trees in Non-Volatile Main Memory

Pavan

Darwin

 

9/27

a)     Transactional support for adaptive indexing. 

b)    The Apache Kafka Streaming

Abdullah

Akshay

 

9/29

Concurrency control and recovery for balanced Blink trees

Yeasir

 

10/4

a)     Performant Almost-Latch-Free Data Structures Using Epoch Protection

b)    Making B+-trees cache conscious in main memory

Libin

 

Qiyang

 

10/6

LSM-based Storage Techniques: A Survey

Shubham

 

10/11

October Break – No Class

 

 

10/13

a)     PushdownDB: Accelerating a DBMS using S3 computation

b)    Flexpushdowndb: Hybrid pushdown and caching in a cloud DBMS

Yunxin

 

Yunxin

 

10/18

a)     Kaskade: Graph Views for Efficient Graph Analytics

b)    Performance Tradeoffs in Read-Optimized Databases

Pavan

Darwin

 

10/20

Database Architecture Optimized for the New Bottleneck: Memory Access

Durga

 

10/25

Fast Scans and Joins Using Flash Drives

Abdullah

 

10/27

a)     Designing Access Methods: The RUM Conjecture

b)    Data morphing: an adaptive, cache-conscious storage technique

Lu

Akshay

 

11/1

Volcano-An Extensible and Parallel Query Evaluation System

Yeasir

 

11/3

a)     Bztree: A High-Performance Latch-Free Range Index for Non-Volatile Memory

b)    Efficient columnar storage in B-trees

Libin

 

Yunxin

 

11/8

The Cascades Framework for Query Optimization

Shubham

 

11/10

PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees

Qiyang

 

11/15

a)     Bridging the Gap Between Theory and Practice on Insertion-Intensive Database

b)    Read-Optimized Databases, In-Depth

Lu

 

Darwin

 

11/17

Query Processing Techniques for Solid State Drives

Abdullah

 

11/22

Everything you always wanted to know about compiled and vectorized queries but were afraid to ask

Durga

 

11/24

Thanksgiving Break – No Class

 

 

11/29

a)     Characterizing and Modeling Non-Volatile Memory Systems

b)    Cache Conscious Algorithms for Relational Query Processing

Pavan

 

Akshay

 

12/1

The end of an architectural era: it's time for a complete rewrite

Yeasir

 

12/5

Final Project Presentations (Monday 2-5PM)

 

 

12/6

a)     FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory

b)    Scaling concurrent log-structured data stores. 

Libin

 

Shubham

 

12/8

a)     The Art of Efficient In-memory Query Processing on NUMA Systems: a Systematic Approach

b)    Enabling NUMA-aware Main Memory Spatial Join Processing: An Experimental Study

Ruihong

 

 

Lu

 

12/12

Final Project Presentations (Monday 2-5PM)

See schedule below

 

12/13

Final Project Presentations

See schedule below

 

12/15

Final Project Presentations

See schedule below

 

12/17

Semester End

 

 

 

 

Project Assignments

Student

Project

 

Abdullah

Survey of Learned Multi-dimensional Indexes

 

Akshay

Survey of NoSQL Stores

 

Darwin

Survey of NoSQL Stores

 

Durga

Survey of Graph Data Engines

 

Libin

The BW-Graph: A Graph Structure for New Hardware Platforms

 

Lu

Studying NUMA-aware query processing techniques

 

Pavan

Studying the use of skiplists in big data systems

 

Qiyang

Survey of Learned Multi-dimensional Indexes

 

Ruihong

Studying NUMA-aware query processing techniques

 

Shubham

Studying LSM-based storage techniques in big data systems

 

Yeasir

Studying NUMA-aware query processing techniques

 

Yunxin

Survey of Cloud Database Techniques

 

 

 

Project Meetings Schedule (with Walid)

Project meetings take place every Monday over zoom from 2pm-6pm according to the following schedule:

 

 

 

Meetings

Students

Notes

Fast Track (Weekly)

Libin, Pavan, Yeasir, Yunxin

Biweekly students will present first, and then Fast track students next

Biweekly (Group 1)

Abdullah, Akshay, Darwin, Qiyang

Students with other time commitments will be given priority to present first

Biweekly (Group 2)

Durga, Lu, Ruihong,

Shubham

Attending and listening to other students’ weekly or biweekly presentations is an excellent opportunity for other students to learn about the subject as well as to provide feedback and participate in the discussions

 

Final Project Presentations Schedule

Mon (2-5pm)

Mon (2-5pm)

Tue (1:30-2:45)

Thur (1:30-2:45)

Dec. 5

Dec. 12

Dec. 13

Dec. 15

13:30

Yunxin

Darwin

14:00

Shubham

Yeasir

Abdullah

14:30

Libin

Akshay

Lu

15:00

15:30

Ruihong

16:00

Qiyang

16:30

Pavan

 

Paper Assignments:

Durga

 

32.  Original skip list paper: PUGH, W. Skip lists: A probabilistic alternative to balanced trees. Commun. ACM 33, 6 (1990), 668–676.

Durga

 

35.  *Kersten, Timo, et al. "Everything you always wanted to know about compiled and vectorized queries but were afraid to ask." Proceedings of the VLDB Endowment 11.13 (2018): 2209-2222.

Durga

 

47.  BonczManegold, Kersten. Database Architecture Optimized for the New Bottleneck: Memory Access.” VLDB 1999.

 

 

Pavan

 

da Trindade, J. M. F.; Karanasos, K.; Curino, C.; Madden, S.; and Shun, J., Kaskade: Graph Views for Efficient Graph Analytics (pdf).  In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, pages 193–204, 2020.

Pavan

 

C-store: a column-oriented DBMS. Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker.

Pavan

Zixuan Wang, Xiao Liu, Jian Yang, Theodore Michailidis, Steven Swanson, and Jishen Zhao. 2020.

 

Characterizing and Modeling Non-Volatile Memory Systems. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).496–508. https://doi.org/10.1109/MICRO50266.2020.00049

Yeasir

 

60. Jaluta, Sippu, and Soisalon-soininen. Concurrency control and recovery for balanced Blink trees. The VLDB Journal, 14(2):257-277, 2005.

Yeasir

 

39.  *Goetz Graefe. Volcano-An Extensible and Parallel Query Evaluation System. IEEE Transactions on Knowledge and Data Engineering Vol. 6, No. 1, Feb. 1994.

Yeasir

 

48.  Stonebraker, M.; Madden, S.; Abadi, D. J.; Harizopoulos, S.; Hachem, N.; and Helland, P., The end of an architectural era: it's time for a complete rewrite (pdf).  In Brodie, M. L., editor(s), Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker, pages 463–489. ACM / Morgan & Claypool, 2019.

Darwin

 

Holloway, DeWitt. Read-Optimized Databases, In-Depth, VLDB 2008.

Darwin

 

Harizopoulos, Liang, Abadi, Madden. Performance Tradeoffs in Read-Optimized Databases, VLDB 2006.

Darwin

 

 

 

Libin

 

*Shimin Chen and Qin Jin.2015. Persistent B+-Trees in Non-Volatile Main Memory. Proc. VLDB Endow. 8, 7 (Feb. 2015), 786–797. https://doi.org/10.14778/2752939. 2752947

 

Bztree: A High-Performance Latch-Free Range Index for Non-Volatile Memory.

Libin

 

Performant Almost-Latch-Free Data Structures Using Epoch Protection. 

 

Libin

 

FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory. 

Lu

 

Enabling NUMA-aware Main Memory Spatial Join Processing: An Experimental Study

 

Lu

 

Designing Access Methods: The RUM Conjecture

Lu

 

Bridging the Gap Between Theory and Practice on Insertion-Intensive Database

Shubham

 

*Chen Luo · Michael J. Carey. LSM-based Storage Techniques: A Survey

Shubham

 

*The Cascades Framework for Query Optimization

Shubham

Guy Golan-Gueta, Edward Bortnikov, Eshcar Hillel, and Idit Keidar. 2015.

 

Scaling concurrent log-structured data stores. 

Yunxin

 

Graefe. Efficient columnar storage in B-trees. SIGMOD Record 03/2007.

Yunxin

 

X Yu, M Youill, M Woicik, A Ghanem, M Serafini, A Aboulnaga, Michael StonebrakerPushdownDB: Accelerating a DBMS using S3 computation. 2020 IEEE 36th International Conference on Data Engineering (ICDE), 1802-1805, 2020.

Yunxin

 

Y Yang, M Youill, M Woicik, Y Liu, X Yu, M Serafini, A Aboulnaga, Michael StonebrakerFlexpushdowndb: Hybrid pushdown and caching in a cloud DBMS, Proceedings of the VLDB Endowment 14 (11), 2101-2113, 2021.

Qiyang

 

30. PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees

Qiyang

 

34. Write-optimized skip lists

Qiyang

 

*J.Rao and K.A.Ross. Making B+-trees cache conscious in main memory. In SIGMOD, 2000.

Akshay

 

The Apache Kafka Streaming Engine. https://kafka.apache.org/  2021.

Akshay

 

Hankins, Patel. Data morphing: an adaptive, cache-conscious storage technique. VLDB 2003.

Akshay

 

Shatdal, Kant, Naughton. Cache Conscious Algorithms for Relational Query Processing.” VLDB 1994

 

Ruihong

 

Scheduling Multithreaded Computations by Work Stealing

Ruihong

 

Dynamic circular work-stealing deque.

Ruihong

 

The Art of Efficient In-memory Query Processing on NUMA Systems: a Systematic Approach, ICDE 2020

Abdullah

 

11.  *G.Graefe, F.Halim, S.Idreos, H.A.Kuno, S.Manegold, and B.Seeger. Transactional support for adaptive indexing. VLDBJ, 23(2):303–328, 2014

Abdullah

 

Shah, Harizopoulos, Wiener, Graefe. Fast Scans and Joins Using Flash Drives. DaMoN 2008. 

Abdullah

 

58.  TsirogiannisHarizopoulos, Shah, Wiener, Graefe. Query Processing Techniques for Solid State Drives, SIGMOD 2009