III: Small: Native Compilation, Query Processing, and Indexing for In-memory Graph Relational Data Systems*

 

Contact Information

Walid G. Aref

Department of Computer Science

Purdue University

305 N. University Street

West Lafayette, Indiana 47907

Phone: (765) 494-1997

Fax : (765) 494-0739

Email: aref@cs.purdue.edu

URL: http://www.cs.purdue.edu/faculty/aref.html

 

This material is based upon work supported by the National Science Foundation under Grant No.  IIS-1910216.

 

Project Award Information

NSF Award Number: IIS-1910216

Duration: 8/1/2019 -- 7/31/2022

Title: Native Compilation, Query Processing, and Indexing for In-memory Graph Relational Data Systems

PI: Walid G. Aref

Co-PI: Tiark Rompf

Project Web Page: http://www.cs.purdue.edu/homes/aref/CGR

 

Project Focus:

A wide variety of applications spanning various domains have graphs as first-class citizens, e.g., communication networks, road networks, social networks, and biological networks. The nodes and edges of these graphs are often associated with descriptors, e.g., labels and properties, or more generally, attributes. Many of these applications need efficient and real-time processing of the graph data. Because relational data systems are very mature and ubiquitous, extending these systems to support graph data is a natural choice. However, there is an impedance mismatch between the relational model and the graph model at the various levels that makes extending relational systems to efficiently support graph data very challenging. This project will address this impedance mismatch and the hurdles that face graph applications that run over graph-enabled relational systems in order to function properly and efficiently. More specifically, this project will address the following research challenges: (1) The expressiveness challenge to address the mismatch between the declarative nature in querying relational data and the navigational nature in querying graph data, (2) the scalability challenge to support large amounts of graph and relational data and queries in real-time, and (3) the performance challenge to address the complexity in answering graph and relational queries and the real-time processing needs of graph applications. Addressing these challenges in the focus of this project.

 

This project addresses how to overcome the impedance mismatch between the relational and graph models by addressing the above challenges. Techniques are proposed to seamlessly and natively process large graph databases inside relational systems without negatively affecting the graph query performance. The techniques to be developed include: (1) Graph query compilation techniques: State-of-art query compilation mechanisms will be developed to mixes of graph and relational query evaluation pipelines to efficiently execute compiled query processing plans that include both graph and relational operators, (2) Graph-as-an-index: In-memory graph indexing techniques that will facilitate the navigation of the graph relational data using the graph topology. The graph indexes will efficiently support sub-graph selection based on the attribute data of both the graph nodes and edges, and performing graph operations on the selected sub-graphs. The techniques to be developed will support dynamic graphs where both the graph topology as well as the graph attributes can be updated. The introduced techniques will tolerate updates that would otherwise invalidate graph intermediate representations that are typically prepared offline to speedup graph query processing. (3) Native graph+relational query execution: Introduce graph navigation operators that operate over graph data in native mode, yet seamlessly integrate with relational algebra operators inside query evaluation pipelines. The developed query processing techniques will permit bidirectional navigation over the relational and the graph data within the same query evaluation pipeline to permit further query optimization strategies that are infeasible otherwise, and to efficiently evaluate interleaved graph and relational operators in query evaluation pipelines, and (4) Costing of the interleaved relational and graph operations for query optimization purposes.

 

2019-2020 Project Activities

2020-2021 Project Activities

2021-2022 Project Activities

 

2019-2020 Project Activities:

 

2019/2020: This year marks the first year of this project. We have studied several aspects of native compilation, query processing, and indexing for in-memory graph-relational data systems. These include compiling symbolic execution with staging and algebraic effects [OOPSLA 2020], cluster-based data partitioning and elastic scheduling [SIGMOD 2020], reinforcement-learning-based data partitioning in big data systems [aiDM 2020], shared execution for business data analytics to address scalability [SSDBM 2020], and Grid-enabled data structures for efficient query processing [SIGSPATIAL 2019].

 

Details about these research and education activities will be given below.

 

1. Prototype Realization of a cluster-based in-memory graph + relational engine.

 

The target is to realize a platform where we can investigate and test the scalability, query optimization techniques, native query processing of interleaved graph and relational predicates, query compilation of combined graph and relational queries, and graph-as-an-index. We built on top of GRFusion 1.0 that has been designed and prototyped by Walid’s group and has been demonstrated in SIGMOD 2018 and with a full paper presented in EDBT 2018. We ported GRFusion 1.0 into the most recent publicly available version of VoltDB. Then, we extended the GRFusion platform for use in conducting the project’s proposed research. More specifically, we prototyped the following features into GRFusion 2.0:

 

-       Extended SQL to support the notion of graph views that can handle graphs with multiple vertex and edge types.

-       Extended SQL to abstract graphs as streams of vertexes, edges, paths, and subgraphs

-       At the data organization and storage level, realized graph-as-an-index that represent a graph in native graph data structures

-       In addition to traditional index-to-data references (e.g., tuple identifiers that are stored in indexes to point to relational tables), in graph-as-an-index, we support two types of references:

1.     g2r references that are similar to the traditional tuple identifiers.

2.     r2g references that point from the relational side to the graph side

-       Using both reference types, one can navigate bidirectionally between the graph and the relational sides, and hence allow more flexibility during the query processing face. Mainly, in a query evaluation pipeline, one can interleave seamlessly graph and relational operators that can operate on their native environments.

 

-       Supported query operators that operate on the native graph structures and that can seamlessly interleave with the relational operators while making use of g2r and r2g pointers.

 

-       Provided a programming interface and an API to define graph operations and then offer these operations to be declaratively queried as hints from within SQL while using relational predicates as pre- or post-filters (i.e., to be applied before or after performing the graph operations) that can facilitate the graph operations

 

-       Extended SQL to declaratively support applying complex graph functions using the concept of “hints”

 

-       Extended SQL to support graph-to-graph operations. GRFusion 1.0 supported only graph-to-relational operators. In GRFusion 2.0, we can use the extended SQL to filter graphs and produce new subgraphs in a declarative fashion.

A journal version that reflects all the above extensions to GRFusion 2.0 along with their performance evaluation is in preparation.

 

Currently, we are working on the following several fronts to extend GRFusion 2.0.

-       Supporting concurrency control on memory-based graph indexes

-       Supporting graph query compilation from within GRFusion

-       Supporting graph query optimization and developing new optimization heuristics for queries with interleaved graph and relational predicates

-       Supporting efficient graph pattern-matching query predicates

 

2. Compiling Symbolic Execution with Staging and Algebraic Effects [OOPSLA 2020] Building effective symbolic execution engines poses challenges in multiple dimensions: an engine must correctly model the program semantics, provide flexibility in symbolic execution strategies, and execute them efficiently. In this research, we propose a principled approach to building correct, flexible, and efficient symbolic execution engines, directly rooted in the semantics of the underlying language in terms of a high-level definitional interpreter. The definitional interpreter induces algebraic effects to abstract over semantic variants of symbolic execution, e.g., collecting path conditions as a state effect and path exploration as a nondeterminism effect. Different handlers of these effects give rise to different symbolic execution strategies, making execution strategies orthogonal to the symbolic execution semantics, thus improving flexibility. Furthermore, by annotating the symbolic definitional interpreter with binding-times and specializing it to the input program via the first Futamura projection, we obtain a symbolic compiler, generating efficient instrumented code having the symbolic execution semantics.

 

3. Shared Execution Techniques for Business Data Analytics over Big Data Streams. [SSDBM 2020] Data Analytics require processing of large numbers of data streams and create materialized views in order to provide near real-time answers to user queries. Materializing the view of each query and refreshing it continuously as a separate query execution plan is not efficient and is not scalable. In this research, we present a global query execution plan to simultaneously support multiple queries, and minimize the number of input scans, operators, and tuples flowing between the operators. We propose shared-execution techniques for creating and maintaining materialized views in support of business data analytics queries as an example. We utilize commonalities in multiple business data analytics queries to support scalable and efficient processing of big data streams. We analyze the cost and elasticity of various shared-execution query-processing techniques in a distributed environment. This research highlights shared execution techniques for select predicates, group, and aggregate calculations. We present how global query execution plans are run in a distributed stream processing system that is built on top of cluster-based data streaming engine.

 

4. Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems. [SIGMOD 2020] Micro-batching has been proposed to support real-world applications that require high-throughput processing over data streams. In micro-batching, the processing and batching of data are interleaved. The incoming data is buffered as blocks, and then is processed using parallel constructs, e.g., Map-Reduce. A micro-batch size is set to guarantee a certain response-time latency mandated by the application. Existing micro-batch stream processing systems use basic data-partitioning techniques that do not handle data skew and variable data rates. Load-awareness is necessary to maintain performance and to enhance resource utilization. Prompt is a new data partitioning scheme that leverages the characteristics of the micro-batching model. In the batching phase, a frequency-aware buffering mechanism is introduced that progressively maintains run-time statistics, and provides on-line key sorting as tuples arrive. Optimal data partitioning is NP-Hard in this context. We introduce a workload-aware greedy algorithm to partition the buffered data efficiently in the Map stage. In the processing phase, a load-aware distribution mechanism is presented that balances the size of the input to the Reduce stage without incurring inter-task communication overhead. Moreover, Prompt elastically adapts resource consumption according to workload changes.

 

5. PartLy: Learning Data Partitioning for Distributed Data Stream Processing [aiDM 2020] In this research, we expand on our well-optimized Prompt online data partitioning mechanism by using reinforcement learning to learn proper data partitioning in a micro-batched data streaming setup. We realize that data partitioning plays a critical role in data stream processing. Current data partitioning techniques use simple, static heuristics that do not incorporate feedback about the quality of the partitioning decision (i.e., fire and forget strategy). Hence, the data partitioner often repeatedly chooses the same decision. In this paper, we argue that reinforcement learning techniques can be applied to address this problem. The use of artificial neural networks can facilitate learning of efficient partitioning policies. We identify the challenges that emerge when applying machine learning techniques to the data partitioning problem for distributed data stream processing. Furthermore, we introduce PartLy, a proof-of-concept data partitioner, and present preliminary results that indicate PartLy’s potential to match the performance of state-of-the-art techniques in terms of partitioning quality, while minimizing storage and processing overheads.

 

6. Local Trend Discovery on Real-time Microblogs with Uncertain Locations in Tight Memory Environments. [GeoInformatica 2019] In this research, we developed GeoTrend+; a system approach to support scalable local trend discovery on recent microblogs, e.g., tweets, comments, online reviews, and check-ins, that come in real time. GeoTrend+ discovers top-k trending keywords in arbitrary spatial regions from recent microblogs that continuously arrive with high rates and a significant portion has uncertain geolocations. GeoTrend+ distinguishes itself from existing techniques in different aspects: (1) Discovering trends in arbitrary spatial regions, e.g., city blocks. (2) Considering both exact geolocations, e.g., accurate latitude/longitude coordinates, and uncertain geolocations, e.g., district-level or city-level, that represents a significant portion of past years microblogs. (3) Promoting recent microblogs as first-class citizens and optimizes different components to digest a continuous flow of fast data in main-memory while removing old data efficiently. (4) Providing various main-memory optimization techniques that are able to distinguish useful from useless data to effectively utilize tight memory resources while maintaining accurate query results on relatively large amounts of data. (5) Supporting various trending measures that effectively capture trending items under a variety of definitions that suit different applications. GeoTrend+ limits its scope to real-time data that is posted during the last T time units. To support its queries efficiently, GeoTrend+ employs an in-memory spatial index that is able to efficiently digest incoming data and expire data that is beyond the last T time units. The index also materializes top-k keywords in different spatial regions so that incoming queries can be processed with low latency. In peak times, the main-memory optimization techniques are employed to shed less important data to sustain high query accuracy with limited memory resources.

 

7. STAR: A Distributed Stream Warehouse System for Spatial Data. [SIGMOD 2020 Demo] Location services produce large streams of spatial and textual data. To enable spatial and textual data analytics, spatial and textual data is streamed into a data warehouse system that provides online analytics over the incomingl data. A spatial data stream warehouse system (DSWS) should efficiently ingest the incoming data, and enable analytical processing over the streamed data. Existing DSWSs are not tailored for spatial data. We introduce STAR; a distributed in-memory spatial data stream warehouse system. STAR provides low-latency and online analytics over a fast-arriving spatial data stream. STAR supports both snapshot and continuous queries that handle aggregate functions and ad hoc query constraints over spatial, textual, and temporal data attributes. STAR offers a view materialization algorithm that facilitates analytical processing over the streamed data.

 

8. SSTD: A Distributed System on Streaming Spatio-Textual Data [PVLDB 2020] Streaming spatio-textual data that contains geolocations and textual contents, e.g., geo-tagged tweets, is becoming increasingly available. Users can register continuous queries to receive up-to-date results continuously, or pose snapshot queries to receive results instantly. The large scale of spatio-textual data streams and huge amounts of queries pose great challenges to the current location-based services, and call for more efficient data management systems. In this research, we realize SSTD Streaming Spatio-Textual Data, a distributed in-memory system supporting both continuous and snapshot queries with spatial, textual, and temporal constraints over data streams. Compared with existing distributed data stream management systems, SSTD has three novel aspects: (1) It supports many types of queries over streamed spatio-textual data; (2) SSTD adopts a new workload partitioning method, termed QT (Quad-Text) tree, that utilizes the joint distribution of queries and spatio-textual data to reduce query latency and enhance system throughput. (3) To achieve load balance and robustness, we develop three new workload adjustment methods for SSTD to fit the changes in the distributions of data or queries. Extensive experiments on real-life datasets demonstrate the superior performance of SSTD.

 

9. An Investigation of Grid-enabled Tree Indexes for Spatial Query Processing. [ACM SIGSPATIAL 2019] Two-dimensional tree-based spatial indexes (e.g., the quadtree or the k-d tree) are commonly used for indexing spatial data. However, both types of spatial indexes have limitations. Although two-dimensional trees can handle skewed data, index traversal and tree maintenance can be expensive. In contrast, spatial grids have low update overhead, but is not suitable for skewed data. In this research, we investigate the augmentation of a grid into tree-based indexing for spatial query processing. For this purpose, we introduce the Grid-Enabled Tree index (GE-Tree, for short); a hybrid spatial tree structure that augments a spatial grid to two-dimensional tree indexes. In particular, we investigate the use of a grid at the leaf level of a two-dimensional tree to facilitate tree navigation and maintenance.

 

Training and Professional Development

In the context of this project, the PIs have offered research opportunities during the 2019/2020 academic year for both graduate and undergraduate students.

 

Graduate Research Opportunities: In the first-level graduate database systems course (CS-54100), and given the COVID-19 situation and the need to offer courses and project online, the PI has worked to modify the project component of CS54100. Traditionally, students would be involved in a semester-long project to realize the engine of a relational database system layer at a time; starting from the SQL parser all the way down to the disk, buffer, table, and index layers, passing by the query processor layer, and the query optimization and query rewrite layers. The nature of this project mandates a heavy lab component and heavy interaction with the teaching assistants and the course instructor.

 

Given the COVID situation, the PI has introduced a research-oriented project instead of the traditional course project. The students have been divided into groups of three (for a total of thirteen groups), and they select from a list of projects that the PI has provided. Students meet the PI biweekly over Zoom to present their findings in the project of choice via slides that they prepare, receive feedback, and go back to continue with their project. Projects span from a survey-style project to a systems-oriented project. In a systems-oriented project, typically, the students would be given a research paper as a starter to understand it, present it, and then realize it to reproduce the results in the paper. Then, they extend on the techniques in the paper as per the project’s instructions and the feedback they get from the PI. In addition to a final project presentation, the students submit a formal technical report explaining their findings.

 

Two of the groups have picked to realize a graph query compilation project, where one of the groups is now preparing to submit their results to a conference (still in preparation). They have compared their graph query compilation results against GRFusion and were able to significantly outperform GRFusion.

 

Other Graduate Research Training: For graduate students and outside of the scope of the graduate-level database systems course (CS54100), during the academic year 2019/2020, the PIs have offered several research training opportunities as graduate-level PhD research and independent study courses that involve various project-related topics, where several students got training on presenting and discussing research papers and in conducting a semester-long project. These students include Amira Mamoun (Dataset Discovery Techniques), Jaewoo Shin (Update-tolerant LSM-based Spatial Indexing), Abdullah Al Mamun (Learned Spatial Indexing), Ahmed Abdelhamid (Intelligent Data Partitioning in Micro-batched Big Data Streaming Systems), Lu Xing (Concurrency Control in Graph Data Management Systems), and Ruihong Wang (RDMA-based Big Data Systems).

 

Undergraduate Research Opportunities: The PIs have offered research opportunities for the following five undergraduate students (Two female and three male students):

 

Piyush Juneja (Graph Partitioning Algorithms), Nameer A. Qureshi (Big Spatial Data Systems), Dhanushikka Ravichandiran (Database Concurrency Control Algorithms), Hao Wu (ML-based Learned Data Indexing), Shotobhisha Sinha Ray (Big Data Systems).

 

The students worked in four groups:

 

- Group 1: Piyush Juneja worked in surveying graph partitioning algorithms.

 

- Group 2: Nameer Qureshi developed a web site for all spatio-temporal access methods and their corresponding published papers.

 

- Group 3: Shotobhisha Sinha Ray and Dhanushikka Ravichandiran studied and surveyed various Database Concurrency Control Algorithms.

 

- Group 4: Hao We surveyed the topic of learned indexes with focus on the multi-dimensional case along with one Ph.D. from Walid’s group.

 

Of mention is the survey on multi-dimensional learned indexes which has been accepted as a tutorial in the ACM SIGSPATIAL 2020 Conference. Both Hao Wu (Undergraduate) and Abdullah Al Mamun (Ph.D. student) will be presenting this tutorial along with Walid at the conference.

 

 

2020-2021 Project Activities:

In 2020/2021, we had several research and education activities for this project. In the context of query compilation for graph database query, we have studied the approach of tracking reachability sets in types to enable ownership-style reasoning for higher-order functional programs [OOPSLA 2021]. We introduced LLSC, a prototype compiler for nondeterministic parallel symbolic execution of the LLVM intermediate representation (IR) [ESEC/FSE 2021]. Walid participated in developing a visionary community view on graph processing systems [ACM CACM 2021]. We have developed Guard, an attack-resilient adaptive load-balancing in distributed streaming systems [IEEE Transactions on Dependable and Secure Computing 2021]. We have developed SWARM, a light-weight adaptivity protocol that continuously monitors location data and query workloads in distributed streaming systems [ACM Transactions of Spatial Algorithms and Systems 2021]. We developed an unbiased online sampling technique for the visualization of large spatiotemporal data [IEEE VAST 2020]. We have developed an LSM-based R-tree with update memos to support frequent updates in moving-object databases [IEEE ICDE 2021]. We developed an online workload estimation technique that relies on a probabilistic model for estimating the workload of partitions and machines in a distributed spatial data streaming system [ACM SIGSPATIAL 2020b]. We have developed and studied the performance of LocationSpark, a Spark-based query executor and optimizer to improve the query execution plan generated for spatial queries [Frontiers 2020]. We presented a tutorial on the subject of learned multi-dimensional indexes [ACM SIGSPATIAL 2020a].

 

Walid provided research training in a graduate-level database systems course at Purdue that has benefitted over 35 students, and trained seven other Ph.D. students in Walid’s group in project-related research He mentored an undergraduate student to conduct undergraduate research in topics related to this project. One Ph.D. student graduated in the topic of “Efficient Distributed Processing over Micro-Batched Data Streams” who was partially supported by this grant. Walid gave invited lectures on results of research developed under this grant. Below, we highlight each of these contributions and other project’s research, education, and training activities.

 

10. Reachability Types: Tracking Aliasing and Separation in Higher-Order Functional Programs [OOPSLA 2021] Ownership type systems, based on the idea of enforcing unique access paths, have been primarily focused on objects and top-level classes. However, existing models do not as readily reflect the finer aspects of nested lexical scopes, capturing, or escaping closures in higher-order functional programming patterns, which are increasingly adopted even in mainstream object-oriented languages. We present a new type system, 𝜆∗, which enables expressive ownership-style reasoning across higher-order functions. It tracks sharing and separation through reachability sets, and layers additional mechanisms for selectively enforcing uniqueness on top of it. Based on reachability sets, we extend the type system with an expressive flow-sensitive effect system, which enables flavors of move semantics and ownership transfer. In addition, we present several case studies and extensions, including applications to capabilities for algebraic effects, one-shot continuations, and safe parallelization. 

 

11. LLSC: A Parallel Symbolic Execution Compiler for LLVM IR [ESEC/FSE 2021] We present LLSC, a prototype compiler for nondeterministic parallel symbolic execution of the LLVM intermediate representation (IR). Given an LLVM IR program, LLSC generates code preserving the symbolic execution semantics and orchestrating solver invocations. The generated code runs efficiently, since the code has eliminated the interpretation overhead and explores multiple paths in parallel. To the best of our knowledge, LLSC is the first compiler for fork-based symbolic execution semantics that can generate parallel execution code. In this demonstration paper, we present the current development and preliminary evaluation of LLSC. The principle behind LLSC is to automatically specialize a symbolic interpreter via the 1st Futamura projection, a fundamental connection between interpreters and compilers. The symbolic interpreter is written in an expressive high-level language equipped with a multi-stage programming facility. We demonstrate the run time performance through a set of benchmark programs, showing that LLSC outperforms interpretation-based symbolic execution engines in significant ways. 

 

12. Guard: Attack-Resilient Adaptive Load Balancing in Distributed Streaming Systems [IEEE TDSC 2021] The performance of distributed streaming systems relies on how even the workload is distributed among their machines. However, data and query workloads are skewed and change rapidly. Therefore, multiple adaptive load-balancing mechanisms have been proposed in the literature to rebalance distributed streaming systems according to the changes in their workloads. This paper introduces a novel attack model that targets adaptive load-balancing mechanisms of distributed streaming systems. The attack reduces the throughput and the availability of the system by making it stay in a continuous state of rebalancing. In this research, we develop Guard, a component that detects and blocks attacks that target the adaptive load balancing of distributed streaming systems. Guard uses an unsupervised machine-learning technique to detect malicious users that are involved in the attack. Guard does not block any user unless it detects that the user is malicious. Guard does not depend on a specific application. 

 

13. SWARM: Adaptive Load Balancing in Distributed Streaming Systems for Big Spatial Data. [ACM TSAS 2021] As clearly from the scale of location services, their ubiquity, and the massive amounts of spatial data being generated in real-time, the current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial streaming systems. Existing systems use static spatial partitioning to distribute the workload. In contrast, the real-time streamed spatial data follows non-uniform spatial distributions that are continuously changing over time. Distributed spatial streaming systems need to react to the changes in the distribution of spatial data and queries. This research introduces SWARM, a light-weight adaptivity protocol that continuously monitors the data and query workloads across the distributed processes of the spatial data streaming system, and redistribute and rebalance the workloads soon as performance bottlenecks get detected. SWARM is able to handle multiple query-execution and data-persistence models. A distributed streaming system can directly use SWARM to adaptively rebalance the system's workload among its machines with minimal changes to the original code of the underlying spatial application. 

 

14. A Tutorial on Learned Multi-dimensional Indexes [ACM SIGSPATIAL 2020] Recently, Machine Learning (ML, for short) has been successfully applied to database indexing. Initial experimentation on Learned Indexes has demonstrated better search performance and lower space requirements than their traditional database counterparts. Numerous attempts have been explored to extend learned indexes to the multi-dimensional space. This makes learned indexes potentially suitable for spatial databases. The goal of this tutorial is to provide up-to-date coverage of learned indexes both in the single and multi-dimensional spaces. The tutorial covers over 25 learned indexes. The tutorial navigates through the space of learned indexes through a taxonomy that helps classify the covered learned indexes both in the single and multi-dimensional spaces. 

 

15. Scalable Relational Query Processing on Big Matrix Data [Submitted 2021] Graphs can be represented by matrixes, where graph operations can be represented by operations over matrix data. The use of large-scale machine learning methods is becoming ubiquitous in many applications ranging from business intelligence to self-driving cars. These methods require a complex computation pipeline consisting of various types of operations, e.g., relational operations for pre-processing or post-processing the dataset, and matrix operations for core model computations. Many existing systems focus on efficiently processing matrix-only operations, and assume that the inputs to the relational operators are already pre-computed and are materialized as intermediate matrices. However, the input to a relational operator may be complex in machine learning pipelines, and may involve various combinations of matrix operators. Hence, it is critical to realize scalable and efficient relational query processors that directly operate on big matrix data. This research develops new efficient and scalable relational query processing techniques on big matrix data for in- memory distributed clusters. The proposed techniques leverage algebraic transformation rules to rewrite query execution plans into ones with lower computation costs. A distributed query plan optimizer exploits the sparsity-inducing property of merge functions as well as Bloom join strategies for efficiently evaluating various flavors of the join operation. Furthermore, optimized partitioning schemes for the input matrices are developed to facilitate the performance of join operations based on a cost model that minimizes the communication overhead. The proposed relational query processing techniques are prototyped in Apache Spark. 

 

16. The LSM RUM-Tree: A Log Structured Merge R-Tree for Update-intensive Spatial Workloads [ICDE 2021] Many applications require update-intensive workloads on spatial objects, e.g., social-network services and shared-riding services that track moving objects (devices). By buffering insert and delete operations in memory, the Log Structured Merge Tree (LSM) has been used widely in various systems because of its ability to handle insert-intensive workloads. While the focus on LSM has been on key-value stores and their optimizations, there is a need to study how to efficiently support LSM-based secondary indexes. We investigate the augmentation of a main-memory-based memo structure into an LSM secondary index structure to handle update-intensive workloads efficiently. We conduct this study in the context of an R-tree-based secondary index. In particular, we introduce the LSM RUM-tree that demonstrates the use of an Update Memo in an LSM-based R-tree to enhance the performance of the R-tree’s insert, delete, update, and search operations. The LSM RUM-tree introduces novel strategies to reduce the size of the Update Memo to be a light-weight in-memory structure that is suitable for handling update-intensive workloads without introducing significant overhead. 

 

17. TrioStat: Online Workload Estimation in Distributed Spatial Data Streaming Systems [ACM SIGSPATIAL 2020] The wide spread of GPS-enabled devices and the Internet of Things (IoT) has increased the amount of spatial data being generated every second. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial data streaming systems that scale to process in real-time large amounts of streamed spatial data. The performance of distributed streaming systems relies on how even the workload is distributed among their machines. However, it is challenging to estimate the workload of each machine because spatial data and query streams are skewed and rapidly change with time and users’ interests. Moreover, a distributed spatial streaming system often does not maintain a global system workload state because it requires high network and processing overheads to be collected from the machines in the system. In this research, we introduce TrioStat; an online workload estimation technique that relies on a probabilistic model for estimating the workload of partitions and machines in a distributed spatial data streaming system. It is infeasible to collect and exchange statistics with a centralized unit because it requires high network overhead. Instead, TrioStat uses a decentralized technique to collect and maintain the required statistics in real-time locally in each machine. TrioStat enables distributed spatial data streaming systems to com- pare the workloads of machines as well as the workloads of data partitions. TrioStat requires minimal network and storage overhead. Moreover, the required storage is distributed across the system’s machines. 

 

18. STULL: Unbiased Online Sampling for Visual Exploration of Large Spatiotemporal Data [VAST 2020] Online sampling-supported visual analytics is increasingly important, as it allows users to explore large datasets with acceptable approximate answers at interactive rates. However, existing online spatiotemporal sampling techniques are often biased, as most researchers have primarily focused on reducing computational latency. Biased sampling approaches select data with unequal probabilities and produce results that do not match the exact data distribution, leading end users to incorrect interpretations. In this research, we propose a novel approach to perform unbiased online sampling of large spatiotemporal data. The proposed approach ensures the same prob- ability of selection to every point that qualifies the specifications of a user’s multidimensional query. To achieve unbiased sampling for accurate representative interactive visualizations, we design a novel data index and an associated sample retrieval plan. Our proposed sampling approach is suitable for a wide variety of visual analytics tasks, e.g., tasks that run aggregate queries of spatiotemporal data. Extensive experiments confirm the superiority of our approach over a state-of-the-art spatial online sampling technique, demonstrating that within the same computational time, data samples generated in our approach are at least 50% more accurate in representing the actual spatial distribution of the data and enable approximate visualizations to present closer visual appearances to the exact ones. 

 

19. LocationSpark: In-memory Distributed Spatial Query Processing and Optimization [Frontiers 2020] Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this research, we develop new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for handling query skew that commonly happens in practice, and minimizes communication costs accordingly. We propose a distributed query scheduler that uses a new cost model to minimize the cost of spatial query processing. The scheduler generates query execution plans that minimize the effect of query skew. The query scheduler utilizes new spatial indexing techniques based on bitmap filters to forward queries to the appropriate local nodes. Each local computation node is responsible for optimizing and selecting its best local query execution plan based on the indexes and the nature of the spatial queries in that node. All the proposed spatial query processing and optimization techniques are prototyped inside Spark, a distributed memory-based computation system. Our prototype system is termed LocationSpark. The experimental study is based on real datasets and demonstrates that LocationSpark can enhance distributed spatial query processing by up to an order of magnitude over existing in-memory and distributed spatial systems.

 

Training and Professional Development

In the context of this project, the PIs have offered research opportunities during the 2020/2021 academic year for both graduate and undergraduate students.

 

Graduate Research Opportunities:

One Ph.D. student, Ahmed S. Abdelhamid, has graduated who was partially funded under this project. His dissertation topic is: Efficient Distributed Processing over Micro-Batched Data Streams. Student Nameer Qureshi took a graduate independent study course on topics related to this project. 

In Fall 2020, Walid revised the project component of the graduate-level database systems course (CS541) to allow for new research opportunities and training of graduate students be semester-long group research projects of three students each. Each group agrees on a project from a list of potential projects that Walid provided to them (List is given below). Projects vary in nature to match the various orientations of the students. Some projects are survey-oriented while others are research-oriented with heavy programming components. Projects cover a broad spectrum of research topics. 39 students benefitted from this opportunity and formed 13 group projects that met with Walid on weekly and biweekly bases. Two of these projects have resulted in submitted conference papers that are under review, and another paper is currently being prepared. We had some collaborators from industrial companies, e.g., Google, Uber, and Facebook, to help collaborate in the projects, and provide the students an industrial twist. Student feedback was positive as many of them appreciated this research experience and the close follow-up and feedback Walid provided on their biweekly or weekly progressions. 

Projects provided included:  (1) A survey of 50 years of database models, (2) Studying and comparing the performance of the disk page and record layouts of PostgreSQL, MySQL, and SQLite, (3) Studying and comparing the performance of the table directory structures and free-space management for PostgreSQL, MySQL, and SQLite, (4) Surveying of SSD and persistent memory storage technologies, their properties, and how they impact on query processing techniques, index design, concurrency, and recovery, (5) Query compilation techniques for graph data systems, (6) Handling bulk-loading anomalies in spatial indexing techniques (is leading to a conference paper publication), (7) Realizing an updatable compressed bitmap index, (8) Implementing different concurrency control protocols in RDMA and evaluating their tradeoffs, (9) Studying RDMA-based distributed transaction management and realizing an RDMA-based two phase commit protocol, (10), Surveying multi-model database system techniques, (11) Studying and comparing techniques for supporting high-dimensional vector databases, (12) Decoupling the storage engine from the compute engine in MySQL, (13) Realizing a new algorithm for answering k-nearest-neighbor queries based on a newly published external merge sort in spatial data systems (is leading to a conference publication), and (14) Designing and implementing various memory-based indexing techniques over RDMA that support concurrency.

There were 39 students in class that formed 13 groups. Walid met biweekly with every group for 30 minutes each over Zoom. Some groups wanted to be more productive by meeting weekly in contrast to biweekly and Walid accommodated that. The students demonstrated their weekly progress in the project in the form of slide-show presentations that the students have prepared for every meeting, and/or if need be, showed program code or performance studies and results.

For graduate students, in Fall 2020 and Spring 2021, Walid has offered several research training opportunities as graduate-level PhD research and independent study courses that involve various project-related topics, where several students got training on presenting and discussing research papers and in conducting a semester-long project. These students include Jaewoo Shin (Update-tolerant LSM-based Spatial Indexing – resulted in an ICDE 2021 paper), Abdullah Al Mamun (Learned Spatial Indexing – resulted in a conference paper submission), Ahmed Abdelhamid (Intelligent Data Partitioning in Micro-batched Big Data Streaming Systems – resulted in a conference paper submission), Lu Xing (two projects: Concurrency Control in Graph Data Management Systems, and Waves of Misery in R-trees), Ruihong Wang (RDMA-based Big Data Systems – resulted in a conference paper submission), Libin Zhou (Distance Oracles for dynamic query constraints), and Yeasir Rayhan (Learned spatial data partitioning).

 

Undergraduate Research Opportunities:

In the context of this project, Co-PI Tiark has offered research opportunities for Undergraduate Student Shangyin Tan. Tan got training in the context of this project where he contributed to the design and implementation of a symbolic execution compiler using multistage programming and algebraic effects. He was a co-author in the ESEC/FSE ’21 and the OOPSLA 2020 papers.

Walid has offered research opportunities during Summer and Fall 2020 for undergraduate students. Most notably is Undergraduate Student Hao Wu who was interested in the topic of ML-based Learned Multidimensional Data Indexing. His research in the topic materialized into a co-authored tutorial in the ACM SIGSPATIAL 2020 Conference, where co-authors Hao Wu and Graduate Student Abdullah Al-Mamun participated in the presentation of the tutorial at the conference. A survey article on the topic is currently underway.

 

2021-2022 Project Activities

 

In 2021/2022, we performed important research and education activities for this project. In the context of data streaming, in our SIGMOD 2022 paper, we have studied incrementalization of nested queries to achieve low latency using novel Relative Partial Aggregate Indexes. We investigated the presence of non-deterministic performance and “waves of misery” in update-intensive workloads over multi-dimensional data structures that store graphs. We studied how to mitigate this issue. This research was published in Proceedings of the VLDB 2021. In an ACM SIGSPATIAL 2021 conference paper, we presented STAR, a distributed in-memory data stream warehouse system, that provides low-latency and up-to-date analytical results over a fast-arriving spatial data stream. STAR was selected as one of the best papers in the Conference, and was invited in an extended form for journal publication. In our ECOOP 2022 paper, we introduce a new type system that features a stack delayed-popping strategy hence allowing variable-size data, e.g., closures, graph reachability results, to be returned from function calls as stack-allocated data from a callee. In our VLDB 2022 vision paper, we studied the impact of new hardware architectures on the design of database systems and separating memory from compute in a disaggregated setup would impact database system techniques. We studied learned multi-dimensional indexes in the context of an instance-optimized R-tree that was published in the IEEE Mobile Data Management Conference 2022.

 

20. Efficient Incrementialization of Correlated Nested Aggregate Queries using Relative Partial Aggregate Indexes [SIGMOD 2022]

Incrementalization of queries is imperative in cases where data arrives as streams and output is latency-critical and/or desired before the full data has been received. Incremental execution computes the output at a given time by reusing the previously computed outputs or maintained views rather than re-evaluating the query from scratch. There are various approaches to perform this incrementalization ranging from query-specific algorithms and data structures (e.g., DYN, AJU) to general systems (e.g., DBToaster, Materialize). DBToaster is a state-of-the-art system that comes with an appealing theoretical background based on the idea of applying Incremental View Maintenance (IVM) recursively, maintaining a hierarchy of materialized views via delta queries. However, one key limitation of this approach is its inability to efficiently incrementalize correlated nested-aggregate queries due to an inefficient delta rule for such queries. Moreover, none of the other specialized approaches have shown efficient ways to optimize such queries either. Nonetheless, these types of queries can be found in many real-world application domains (e.g., finance), for which efficient incrementalization re- mains a crucial open problem. In this work, we propose an approach to incrementalize such queries based on a novel tree-based index structure called Relative Partial Aggregate Indexes (RPAI). Our approach is asymptotically faster than other systems and shows up to 1100× speedups in workloads of practical importance. 

 

21. What If We Don’t Pop the Stack? The Return of 2nd-Class Values [ECOOP 2022]

Using a stack for managing the local state of procedures as popularized by Algol is a simple but effective way to achieve a primitive form of automatic memory management. Hence, the call stack remains the backbone of most programming language runtimes to the present day. However, the appealing simplicity of the call stack model comes at the price of strictly enforced limitations: since every function return pops the stack, it is difficult to return stack-allocated data from a callee upwards to its caller – especially variable-size data such as closures. This research introduces a solution by introducing a small tweak to the usual stack semantics. We design a type system that tracks the underlying storage mode of values, and when a function returns a stack-allocated value, we just don’t pop the stack! Instead, the stack frame is de-allocated together with a parent the next time a heap-allocated value or primitive is returned. We identify a range of use cases where this delayed-popping strategy is beneficial, ranging from closures to trait objects to other types of variable-size data. Our evaluation shows that this execution model reduces heap and GC pressure and recovers spatial locality of programs improving execution time between 10% and 25% with respect to standard execution. 

 

22. Achieving Deterministic Performance for Multi-dimensional Data Structures [VLDB 2021]

Having deterministic performance for graph data systems and data systems in general, is very important. This is especially true for data systems that support navigational services in real time over road network graphs. Waves of misery is a phenomenon where spikes of many node splits occur over short periods of time in tree indexes. Waves of misery negatively affect the performance of tree indexes in insertion-heavy workloads. Waves of misery have been first observed in the context of the B-tree, where these waves cause unpredictable index performance. In particular, the performance of search and index-update operations deteriorate when a wave of misery takes place, but is more predictable between the waves. This research investigates the presence or lack of waves of misery in multi-dimensional structures, especially ones that can embed road network graphs. Our focus here is on R-tree variants, where we study the extent of which these waves impact the performance of each R-tree variant. Interestingly, although having poorer query performance, the Linear and Quadratic R-trees are found to be more resilient to waves of misery than both the Hilbert and R*-trees. This research introduces several techniques to reduce the impact in performance of the waves of misery for the Hilbert and R*-trees. One way to eliminate waves of misery is to force node splits to take place at regular times before nodes become full to achieve deterministic performance. The other way is that upon splitting a node, do not split it evenly but rather at different node utilization factors. This allows leaf nodes not to fill at the same pace. We study the impact of two new techniques to mitigate waves of misery after the tree index has been constructed, namely Regular Elective Splits (RES, for short) and Unequal Random Splits (URS, for short). Our experimental investigation highlights the trade-offs in performance of the introduced techniques and the pros and cons of each technique.

 

23. STAR: A Cache-based Stream Warehouse System for Spatial Data [ACM SIGSPATIAL 2021]

The proliferation of mobile phones and location-based services has given rise to an explosive growth in spatial data. In order to enable spatial data analytics, spatial data needs to be streamed into a data stream warehouse system that can provide real-time analytical results over the most recent and historical spatial data in the warehouse. Existing data stream warehouse systems are not tailored for spatial data. In this paper, we introduce the STAR system. STAR is a distributed in-memory data stream warehouse system that provides low-latency and up-to-date analytical results over a fast-arriving spatial data stream. STAR supports both snapshot and continuous queries that are composed of aggregate functions and ad hoc query constraints over spatial, textual, and temporal data attributes. STAR implements a cache-based mechanism to facilitate the processing of snapshot queries that collectively utilizes the techniques of query-based caching (i.e., view materialization) and object-based caching. Moreover, to speed-up processing continuous queries, STAR proposes a novel index structure that achieves high efficiency in both object checking and result updating. Extensive experiments over real data sets demonstrate the superior performance of STAR over existing systems.

 

24. The “AI+R”-tree: An Instance-optimized Learned R-tree [IEEE MDM 2022]

The emerging class of instance-optimized systems has shown potential to achieve high performance by specializing to a specific data and query workloads. Particularly, Machine Learning (ML) techniques have been applied successfully to build various instance-optimized components (e.g., learned indexes). This research leverages ML techniques to enhance the performance of spatial indexes, particularly the R-tree, for a given data and query workloads. As the areas covered by the R-tree index nodes overlap in space, upon searching for a specific point in space, multiple paths from root to leaf may potentially be explored. In the worst case, the entire R-tree could be searched. We define and use the overlap ratio to quantify the degree of extraneous leaf node accesses required by a range query. The goal is to enhance the query performance of a traditional R-tree for high-overlap range queries as they tend to incur long running-times. We introduce a new AI-tree that transforms the search operation of an R-tree into a multi-label classification task to exclude the extraneous leaf node accesses. Then, we augment a traditional R-tree to the AI-tree to form a hybrid “AI+R”- tree. The “AI+R”-tree can automatically differentiate between the high- and low-overlap queries using a learned model. Thus, the “AI+R”-tree processes high-overlap queries using the AI- tree, and the low-overlap queries using the R-tree. Experiments on real datasets demonstrate that the “AI+R”-tree can enhance the query performance over a traditional R-tree by up to 500%.

 

25. The Case for Distributed Shared-Memory Databases with RDMA-Enabled Memory Disaggregation [VLDB 2022 Vision Paper]

Memory disaggregation (MD) allows for scalable and elastic data center design by separating compute (CPU) from memory. With MD, compute and memory are no longer coupled into the same server box. Instead, they are connected to each other via ultra-fast networking such as RDMA. MD can bring many advantages, e.g., higher memory utilization, better independent scaling (of compute and memory), and lower cost of ownership. This research makes the case that MD can fuel the next wave of innovation on database systems. We observe that MD revives the great debate of "shared what" in the database community. We envi- sion that distributed shared-memory databases (DSM-DB, for short) – that have not received much attention before – can be promising in the future with MD. We study a list of challenges and opportunities that can inspire next steps in system design making the case for DSM-DB.

 

* Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

 

Date of Last Update: December 14, 2023.