III: Small: Native Compilation, Query Processing, and
Indexing for In-memory Graph Relational Data Systems*
Contact Information
Walid G. Aref
Department of Computer Science
Purdue University
305 N. University Street
West Lafayette, Indiana 47907
Phone: (765) 494-1997
Fax : (765) 494-0739
Email: aref@cs.purdue.edu
URL: http://www.cs.purdue.edu/faculty/aref.html
This
material is based upon work supported by the National Science Foundation under
Grant No. IIS-1910216.
Project Award Information
NSF Award Number: IIS-1910216
Duration: 8/1/2019 -- 7/31/2022
Title: Native Compilation, Query Processing, and
Indexing for In-memory Graph Relational Data Systems
PI: Walid
G. Aref
Co-PI: Tiark Rompf
Project Web Page: http://www.cs.purdue.edu/homes/aref/CGR
Project Focus:
A wide variety of
applications spanning various domains have graphs as first-class citizens,
e.g., communication networks, road networks, social networks, and biological
networks. The nodes and edges of these graphs are often associated with
descriptors, e.g., labels and properties, or more generally, attributes. Many
of these applications need efficient and real-time processing of the graph
data. Because relational data systems are very mature and ubiquitous, extending
these systems to support graph data is a natural choice. However, there is an
impedance mismatch between the relational model and the graph model at the
various levels that makes extending relational systems to efficiently support
graph data very challenging. This project will
address this impedance mismatch and the hurdles that face graph applications
that run over graph-enabled relational systems in order to function properly
and efficiently. More specifically, this project will address the
following research challenges: (1) The expressiveness challenge to address the mismatch between the
declarative nature in querying relational data and the navigational nature in
querying graph data, (2) the
scalability challenge to support large amounts of graph and
relational data and queries in real-time, and (3) the performance challenge to address the complexity in answering
graph and relational queries and the real-time processing needs of graph
applications. Addressing these challenges in the focus of this project.
This project addresses how
to overcome the impedance mismatch between the relational and graph models by
addressing the above challenges. Techniques are proposed to seamlessly and
natively process large graph databases inside relational systems without negatively
affecting the graph query performance. The techniques to be developed include:
(1) Graph query compilation techniques: State-of-art query compilation
mechanisms will be developed to mixes of graph and relational query evaluation
pipelines to efficiently execute compiled query processing plans that include
both graph and relational operators, (2) Graph-as-an-index: In-memory graph
indexing techniques that will facilitate the navigation of the graph relational
data using the graph topology. The graph indexes will efficiently support
sub-graph selection based on the attribute data of both the graph nodes and
edges, and performing graph operations on the selected sub-graphs. The
techniques to be developed will support dynamic graphs where both the graph topology
as well as the graph attributes can be updated. The introduced techniques will
tolerate updates that would otherwise invalidate graph intermediate
representations that are typically prepared offline to speedup graph query
processing. (3) Native graph+relational query
execution: Introduce graph navigation operators that operate over graph data in
native mode, yet seamlessly integrate with relational algebra operators inside
query evaluation pipelines. The developed query processing techniques will permit
bidirectional navigation over the relational and the graph data within the same
query evaluation pipeline to permit further query optimization strategies that
are infeasible otherwise, and to efficiently evaluate interleaved graph and
relational operators in query evaluation pipelines, and (4) Costing of the
interleaved relational and graph operations for query optimization purposes.
2019/2020: This year marks
the first year of this project. We have studied several aspects of native
compilation, query processing, and indexing for in-memory graph-relational data
systems. These include compiling symbolic execution with staging and algebraic
effects [OOPSLA 2020], cluster-based data partitioning and elastic scheduling
[SIGMOD 2020], reinforcement-learning-based data partitioning in big data
systems [aiDM 2020], shared execution for business
data analytics to address scalability [SSDBM 2020], and Grid-enabled data
structures for efficient query processing [SIGSPATIAL 2019].
Details about these research
and education activities will be given below.
1. Prototype Realization
of a cluster-based in-memory graph + relational engine.
The target is to realize a
platform where we can investigate and test the scalability, query optimization
techniques, native query processing of interleaved graph and relational
predicates, query compilation of combined graph and relational queries, and graph-as-an-index.
We built on top of GRFusion 1.0 that has been
designed and prototyped by Walid’s group and has been demonstrated in SIGMOD
2018 and with a full paper presented in EDBT 2018. We ported GRFusion 1.0 into the most recent publicly available version
of VoltDB. Then, we extended the GRFusion
platform for use in conducting the project’s proposed research. More
specifically, we prototyped the following features into GRFusion
2.0:
-
Extended SQL to support the notion of graph views that can handle
graphs with multiple vertex and edge types.
-
Extended SQL to abstract graphs as streams of vertexes, edges, paths,
and subgraphs
-
At the data organization and storage level, realized graph-as-an-index
that represent a graph in native graph data structures
-
In addition to traditional index-to-data references (e.g., tuple
identifiers that are stored in indexes to point to relational tables), in
graph-as-an-index, we support two types of references:
1. g2r references that are similar
to the traditional tuple identifiers.
2. r2g references that point from
the relational side to the graph side
-
Using both reference types, one can navigate bidirectionally between
the graph and the relational sides, and hence allow more flexibility during the
query processing face. Mainly, in a query evaluation pipeline, one can
interleave seamlessly graph and relational operators that can operate on their
native environments.
-
Supported query operators that operate on the native graph structures
and that can seamlessly interleave with the relational operators while making
use of g2r and r2g pointers.
-
Provided a programming interface and an API to define graph operations
and then offer these operations to be declaratively queried as hints from
within SQL while using relational predicates as pre- or post-filters (i.e., to
be applied before or after performing the graph operations) that can facilitate
the graph operations
-
Extended SQL to declaratively support applying complex graph functions
using the concept of “hints”
-
Extended SQL to support graph-to-graph operations. GRFusion
1.0 supported only graph-to-relational operators. In GRFusion
2.0, we can use the extended SQL to filter graphs and produce new subgraphs in
a declarative fashion.
A journal version that
reflects all the above extensions to GRFusion 2.0
along with their performance evaluation is in preparation.
Currently, we are working on
the following several fronts to extend GRFusion 2.0.
-
Supporting concurrency control on memory-based graph indexes
-
Supporting graph query compilation from within GRFusion
-
Supporting graph query optimization and developing new optimization
heuristics for queries with interleaved graph and relational predicates
-
Supporting efficient graph pattern-matching query predicates
2. Compiling
Symbolic Execution with Staging and Algebraic Effects [OOPSLA 2020] Building effective symbolic execution engines poses
challenges in multiple dimensions: an engine must correctly model the program
semantics, provide flexibility in symbolic execution strategies, and execute
them efficiently. In this research, we propose a principled approach to
building correct, flexible, and efficient symbolic execution engines, directly
rooted in the semantics of the underlying language in terms of a high-level
definitional interpreter. The definitional interpreter induces algebraic effects
to abstract over semantic variants of symbolic execution, e.g., collecting path
conditions as a state effect and path exploration as a nondeterminism effect.
Different handlers of these effects give rise to different symbolic execution
strategies, making execution strategies orthogonal to the symbolic execution
semantics, thus improving flexibility. Furthermore, by annotating the symbolic
definitional interpreter with binding-times and specializing it to the input
program via the first Futamura projection, we obtain
a symbolic compiler, generating efficient instrumented code having the symbolic
execution semantics.
3. Shared
Execution Techniques for Business Data Analytics over Big Data Streams. [SSDBM
2020] Data Analytics require
processing of large numbers of data streams and create materialized views in
order to provide near real-time answers to user queries. Materializing the view
of each query and refreshing it continuously as a separate query execution plan
is not efficient and is not scalable. In this research, we present a global
query execution plan to simultaneously support multiple queries, and minimize
the number of input scans, operators, and tuples flowing between the operators.
We propose shared-execution techniques for creating and maintaining
materialized views in support of business data analytics queries as an example.
We utilize commonalities in multiple business data analytics queries to support
scalable and efficient processing of big data streams. We analyze the cost and
elasticity of various shared-execution query-processing techniques in a
distributed environment. This research highlights shared execution techniques
for select predicates, group, and aggregate calculations. We present how global
query execution plans are run in a distributed stream processing system that is
built on top of cluster-based data streaming engine.
4. Prompt:
Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing
Systems. [SIGMOD 2020] Micro-batching has been
proposed to support real-world applications that require high-throughput
processing over data streams. In micro-batching, the processing and batching of
data are interleaved. The incoming data is buffered as blocks, and then is processed
using parallel constructs, e.g., Map-Reduce. A micro-batch size is set to
guarantee a certain response-time latency mandated by the application. Existing
micro-batch stream processing systems use basic data-partitioning techniques
that do not handle data skew and variable data rates. Load-awareness is
necessary to maintain performance and to enhance resource utilization. Prompt
is a new data partitioning scheme that leverages the characteristics of the
micro-batching model. In the batching phase, a frequency-aware buffering
mechanism is introduced that progressively maintains run-time statistics, and
provides on-line key sorting as tuples arrive. Optimal data partitioning is
NP-Hard in this context. We introduce a workload-aware greedy algorithm to partition
the buffered data efficiently in the Map stage. In the processing phase, a
load-aware distribution mechanism is presented that balances the size of the
input to the Reduce stage without incurring inter-task communication overhead.
Moreover, Prompt elastically adapts resource consumption according to workload
changes.
5. PartLy: Learning Data Partitioning for Distributed Data
Stream Processing [aiDM 2020] In this research, we expand on our well-optimized
Prompt online data partitioning mechanism by using reinforcement learning to
learn proper data partitioning in a micro-batched data streaming setup. We
realize that data partitioning plays a critical role in data stream processing.
Current data partitioning techniques use simple, static heuristics that do not
incorporate feedback about the quality of the partitioning decision (i.e., fire
and forget strategy). Hence, the data partitioner often repeatedly chooses the
same decision. In this paper, we argue that reinforcement learning techniques
can be applied to address this problem. The use of artificial neural networks
can facilitate learning of efficient partitioning policies. We identify the
challenges that emerge when applying machine learning techniques to the data
partitioning problem for distributed data stream processing. Furthermore, we
introduce PartLy, a proof-of-concept data
partitioner, and present preliminary results that indicate PartLy’s
potential to match the performance of state-of-the-art techniques in terms of
partitioning quality, while minimizing storage and processing overheads.
6. Local
Trend Discovery on Real-time Microblogs with Uncertain Locations in Tight
Memory Environments. [GeoInformatica 2019] In this research, we developed GeoTrend+;
a system approach to support scalable local trend discovery on recent
microblogs, e.g., tweets, comments, online reviews, and check-ins, that come in
real time. GeoTrend+ discovers top-k trending
keywords in arbitrary spatial regions from recent microblogs that continuously
arrive with high rates and a significant portion has uncertain geolocations. GeoTrend+ distinguishes itself from existing techniques in
different aspects: (1) Discovering trends in arbitrary spatial regions, e.g.,
city blocks. (2) Considering both exact geolocations, e.g., accurate
latitude/longitude coordinates, and uncertain geolocations, e.g.,
district-level or city-level, that represents a significant portion of past
years microblogs. (3) Promoting recent microblogs as first-class citizens and
optimizes different components to digest a continuous flow of fast data in
main-memory while removing old data efficiently. (4) Providing various
main-memory optimization techniques that are able to distinguish useful from
useless data to effectively utilize tight memory resources while maintaining
accurate query results on relatively large amounts of data. (5) Supporting
various trending measures that effectively capture trending items under a
variety of definitions that suit different applications. GeoTrend+
limits its scope to real-time data that is posted during the last T time units.
To support its queries efficiently, GeoTrend+ employs
an in-memory spatial index that is able to efficiently digest incoming data and
expire data that is beyond the last T time units. The index also materializes
top-k keywords in different spatial regions so that incoming queries can be
processed with low latency. In peak times, the main-memory optimization
techniques are employed to shed less important data to sustain high query
accuracy with limited memory resources.
7. STAR:
A Distributed Stream Warehouse System for Spatial Data. [SIGMOD 2020 Demo] Location services produce large streams of spatial
and textual data. To enable spatial and textual data analytics, spatial and
textual data is streamed into a data warehouse system that provides online
analytics over the incomingl data. A spatial data
stream warehouse system (DSWS) should efficiently ingest the incoming data, and
enable analytical processing over the streamed data. Existing DSWSs are not
tailored for spatial data. We introduce STAR; a distributed in-memory spatial
data stream warehouse system. STAR provides low-latency and online analytics
over a fast-arriving spatial data stream. STAR supports both snapshot and
continuous queries that handle aggregate functions and ad hoc query constraints
over spatial, textual, and temporal data attributes. STAR offers a view
materialization algorithm that facilitates analytical processing over the
streamed data.
8. SSTD: A
Distributed System on Streaming Spatio-Textual Data
[PVLDB 2020] Streaming spatio-textual data that contains geolocations and textual
contents, e.g., geo-tagged tweets, is becoming increasingly available. Users
can register continuous queries to receive up-to-date results continuously, or
pose snapshot queries to receive results instantly. The large scale of spatio-textual data streams and huge amounts of queries
pose great challenges to the current location-based services, and call for more
efficient data management systems. In this research, we realize SSTD Streaming Spatio-Textual Data, a distributed in-memory system
supporting both continuous and snapshot queries with spatial, textual, and
temporal constraints over data streams. Compared with existing distributed data
stream management systems, SSTD has three novel aspects: (1) It supports many
types of queries over streamed spatio-textual data;
(2) SSTD adopts a new workload partitioning method, termed QT (Quad-Text) tree,
that utilizes the joint distribution of queries and spatio-textual
data to reduce query latency and enhance system throughput. (3) To achieve load
balance and robustness, we develop three new workload adjustment methods for
SSTD to fit the changes in the distributions of data or queries. Extensive
experiments on real-life datasets demonstrate the superior performance of SSTD.
9. An
Investigation of Grid-enabled Tree Indexes for Spatial Query Processing. [ACM
SIGSPATIAL 2019] Two-dimensional
tree-based spatial indexes (e.g., the quadtree or the k-d tree) are commonly used
for indexing spatial data. However, both types of spatial indexes have
limitations. Although two-dimensional trees can handle skewed data, index
traversal and tree maintenance can be expensive. In contrast, spatial grids
have low update overhead, but is not suitable for skewed data. In this
research, we investigate the augmentation of a grid into tree-based indexing
for spatial query processing. For this purpose, we introduce the Grid-Enabled
Tree index (GE-Tree, for short); a hybrid spatial tree structure that augments
a spatial grid to two-dimensional tree indexes. In particular, we investigate
the use of a grid at the leaf level of a two-dimensional tree to facilitate
tree navigation and maintenance.
Training and Professional Development
In the context of this
project, the PIs have offered research opportunities during the 2019/2020
academic year for both graduate and undergraduate students.
Graduate Research
Opportunities: In the first-level
graduate database systems course (CS-54100), and given the COVID-19 situation
and the need to offer courses and project online, the PI has worked to modify
the project component of CS54100. Traditionally, students would be involved in
a semester-long project to realize the engine of a relational database system
layer at a time; starting from the SQL parser all the way down to the disk,
buffer, table, and index layers, passing by the query processor layer, and the
query optimization and query rewrite layers. The nature of this project
mandates a heavy lab component and heavy interaction with the teaching
assistants and the course instructor.
Given the COVID situation,
the PI has introduced a research-oriented project instead of the traditional
course project. The students have been divided into groups of three (for a
total of thirteen groups), and they select from a list of projects that the PI
has provided. Students meet the PI biweekly over Zoom to present their findings
in the project of choice via slides that they prepare, receive feedback, and go
back to continue with their project. Projects span from a survey-style project
to a systems-oriented project. In a systems-oriented project, typically, the
students would be given a research paper as a starter to understand it, present
it, and then realize it to reproduce the results in the paper. Then, they
extend on the techniques in the paper as per the project’s instructions and the
feedback they get from the PI. In addition to a final project presentation, the
students submit a formal technical report explaining their findings.
Two of the groups have picked
to realize a graph query compilation project, where one of the groups is now
preparing to submit their results to a conference (still in preparation). They
have compared their graph query compilation results against GRFusion
and were able to significantly outperform GRFusion.
Other Graduate Research
Training: For graduate students and
outside of the scope of the graduate-level database systems course (CS54100),
during the academic year 2019/2020, the PIs have offered several research
training opportunities as graduate-level PhD research and independent study
courses that involve various project-related topics, where several students got
training on presenting and discussing research papers and in conducting a
semester-long project. These students include Amira Mamoun
(Dataset Discovery Techniques), Jaewoo Shin
(Update-tolerant LSM-based Spatial Indexing), Abdullah Al Mamun (Learned
Spatial Indexing), Ahmed Abdelhamid (Intelligent Data Partitioning in
Micro-batched Big Data Streaming Systems), Lu Xing (Concurrency Control in
Graph Data Management Systems), and Ruihong Wang
(RDMA-based Big Data Systems).
Undergraduate Research
Opportunities: The PIs have offered
research opportunities for the following five undergraduate students (Two
female and three male students):
Piyush Juneja
(Graph Partitioning Algorithms), Nameer A. Qureshi
(Big Spatial Data Systems), Dhanushikka Ravichandiran (Database Concurrency Control Algorithms),
Hao Wu (ML-based Learned Data Indexing), Shotobhisha
Sinha Ray (Big Data Systems).
The students worked in four
groups:
- Group 1: Piyush Juneja worked in surveying graph partitioning algorithms.
- Group 2: Nameer Qureshi developed a web site for all spatio-temporal access methods and their corresponding
published papers.
- Group 3: Shotobhisha Sinha Ray and Dhanushikka
Ravichandiran studied and surveyed various Database
Concurrency Control Algorithms.
- Group 4: Hao We surveyed
the topic of learned indexes with focus on the multi-dimensional case along
with one Ph.D. from Walid’s group.
Of mention is the survey on
multi-dimensional learned indexes which has been accepted as a tutorial in the
ACM SIGSPATIAL 2020 Conference. Both Hao Wu (Undergraduate) and Abdullah Al
Mamun (Ph.D. student) will be presenting this tutorial along with Walid at the
conference.
In 2020/2021, we had several research and education activities
for this project. In the context of query compilation for graph database query,
we have studied the approach of tracking reachability sets in types to enable
ownership-style reasoning for higher-order functional programs [OOPSLA
2021]. We introduced LLSC, a prototype compiler for nondeterministic
parallel symbolic execution of the LLVM intermediate representation (IR)
[ESEC/FSE 2021]. Walid participated in developing a
visionary community view on graph processing systems [ACM CACM 2021]. We
have developed Guard, an attack-resilient adaptive load-balancing in
distributed streaming systems [IEEE Transactions on Dependable and Secure
Computing 2021]. We have developed SWARM, a light-weight adaptivity protocol
that continuously monitors location data and query workloads in
distributed streaming systems [ACM Transactions of Spatial Algorithms and
Systems 2021]. We developed an unbiased online sampling technique for the
visualization of large spatiotemporal data [IEEE VAST 2020]. We have developed
an LSM-based R-tree with update memos to support frequent updates in
moving-object databases [IEEE ICDE 2021]. We developed an online workload
estimation technique that relies on a probabilistic model for estimating the
workload of partitions and machines in a distributed spatial data streaming
system [ACM SIGSPATIAL 2020b]. We have developed and studied the
performance of LocationSpark, a Spark-based query
executor and optimizer to improve the query execution plan generated for
spatial queries [Frontiers 2020]. We presented a tutorial on the subject
of learned multi-dimensional indexes [ACM SIGSPATIAL 2020a].
Walid provided research training in a graduate-level
database systems course at Purdue that has benefitted over 35 students, and
trained seven other Ph.D. students in Walid’s group in project-related research
He mentored an undergraduate student to conduct undergraduate research in
topics related to this project. One Ph.D. student graduated in the topic of
“Efficient Distributed Processing over Micro-Batched Data Streams” who was
partially supported by this grant. Walid gave invited lectures on results
of research developed under this grant. Below, we highlight each of these
contributions and other project’s research, education, and training activities.
10. Reachability
Types: Tracking Aliasing and Separation in Higher-Order Functional Programs
[OOPSLA 2021] Ownership type
systems, based on the idea of enforcing unique access paths, have been
primarily focused on objects and top-level classes. However, existing models do
not as readily reflect the finer aspects of nested lexical scopes, capturing,
or escaping closures in higher-order functional programming patterns, which are
increasingly adopted even in mainstream object-oriented languages. We present a
new type system, 𝜆∗, which enables expressive ownership-style reasoning across
higher-order functions. It tracks sharing and separation through reachability
sets, and layers additional mechanisms for selectively enforcing uniqueness on
top of it. Based on reachability sets, we extend the type system with an
expressive flow-sensitive effect system, which enables flavors of move
semantics and ownership transfer. In addition, we present several case studies
and extensions, including applications to capabilities for algebraic effects,
one-shot continuations, and safe parallelization.
11. LLSC: A
Parallel Symbolic Execution Compiler for LLVM IR [ESEC/FSE 2021] We present LLSC, a prototype compiler for
nondeterministic parallel symbolic execution of the LLVM intermediate
representation (IR). Given an LLVM IR program, LLSC generates code preserving
the symbolic execution semantics and orchestrating solver invocations. The
generated code runs efficiently, since the code has eliminated the
interpretation overhead and explores multiple paths in parallel. To the best of
our knowledge, LLSC is the first compiler for fork-based symbolic execution
semantics that can generate parallel execution code. In this demonstration
paper, we present the current development and preliminary evaluation of LLSC.
The principle behind LLSC is to automatically specialize a symbolic interpreter
via the 1st Futamura projection, a fundamental
connection between interpreters and compilers. The symbolic interpreter is
written in an expressive high-level language equipped with a multi-stage
programming facility. We demonstrate the run time performance through a set of benchmark
programs, showing that LLSC outperforms interpretation-based symbolic execution
engines in significant ways.
12. Guard:
Attack-Resilient Adaptive Load Balancing in Distributed Streaming
Systems [IEEE TDSC 2021] The
performance of distributed streaming systems relies on how even the workload is
distributed among their machines. However, data and query workloads are skewed
and change rapidly. Therefore, multiple adaptive load-balancing mechanisms have
been proposed in the literature to rebalance distributed streaming systems
according to the changes in their workloads. This paper introduces a novel
attack model that targets adaptive load-balancing mechanisms of distributed
streaming systems. The attack reduces the throughput and the availability of
the system by making it stay in a continuous state of rebalancing. In this
research, we develop Guard, a component that detects and blocks attacks that
target the adaptive load balancing of distributed streaming systems. Guard uses
an unsupervised machine-learning technique to detect malicious users that are
involved in the attack. Guard does not block any user unless it detects that
the user is malicious. Guard does not depend on a specific application.
13. SWARM: Adaptive
Load Balancing in Distributed Streaming Systems for Big Spatial Data. [ACM TSAS
2021] As clearly from the
scale of location services, their ubiquity, and the massive amounts of spatial
data being generated in real-time, the current scale of spatial data cannot be
handled using centralized systems. This has led to the development of distributed
spatial streaming systems. Existing systems use static spatial partitioning to
distribute the workload. In contrast, the real-time streamed spatial data
follows non-uniform spatial distributions that are continuously changing over
time. Distributed spatial streaming systems need to react to the changes in the
distribution of spatial data and queries. This research introduces SWARM, a
light-weight adaptivity protocol that continuously monitors the data and query
workloads across the distributed processes of the spatial data streaming
system, and redistribute and rebalance the workloads soon as performance
bottlenecks get detected. SWARM is able to handle multiple query-execution and
data-persistence models. A distributed streaming system can directly use SWARM
to adaptively rebalance the system's workload among its machines with minimal
changes to the original code of the underlying spatial application.
14. A
Tutorial on Learned Multi-dimensional Indexes [ACM SIGSPATIAL 2020] Recently, Machine Learning (ML, for short) has been
successfully applied to database indexing. Initial experimentation on Learned
Indexes has demonstrated better search performance and lower space requirements
than their traditional database counterparts. Numerous attempts have been
explored to extend learned indexes to the multi-dimensional space. This makes
learned indexes potentially suitable for spatial databases. The goal of this
tutorial is to provide up-to-date coverage of learned indexes both in the
single and multi-dimensional spaces. The tutorial covers over 25 learned
indexes. The tutorial navigates through the space of learned indexes through a
taxonomy that helps classify the covered learned indexes both in the single and
multi-dimensional spaces.
15. Scalable
Relational Query Processing on Big Matrix Data [Submitted 2021] Graphs can be represented by matrixes, where graph
operations can be represented by operations over matrix data. The use of
large-scale machine learning methods is becoming ubiquitous in many
applications ranging from business intelligence to self-driving cars. These
methods require a complex computation pipeline consisting of various types of
operations, e.g., relational operations for pre-processing or post-processing
the dataset, and matrix operations for core model computations. Many existing
systems focus on efficiently processing matrix-only operations, and assume that
the inputs to the relational operators are already pre-computed and are
materialized as intermediate matrices. However, the input to a relational
operator may be complex in machine learning pipelines, and may involve various
combinations of matrix operators. Hence, it is critical to realize scalable and
efficient relational query processors that directly operate on big matrix data.
This research develops new efficient and scalable relational query processing
techniques on big matrix data for in- memory distributed clusters. The proposed
techniques leverage algebraic transformation rules to rewrite query execution
plans into ones with lower computation costs. A distributed query plan optimizer
exploits the sparsity-inducing property of merge functions as well as Bloom
join strategies for efficiently evaluating various flavors of the join operation. Furthermore, optimized partitioning schemes
for the input matrices are developed to facilitate the performance of join
operations based on a cost model that minimizes the communication overhead. The
proposed relational query processing techniques are prototyped in Apache
Spark.
16. The LSM
RUM-Tree: A Log Structured Merge R-Tree for Update-intensive Spatial Workloads
[ICDE 2021] Many
applications require update-intensive workloads on spatial objects, e.g.,
social-network services and shared-riding services that track moving objects
(devices). By buffering insert and delete operations in memory, the Log
Structured Merge Tree (LSM) has been used widely in various systems because of
its ability to handle insert-intensive workloads. While the focus on LSM has
been on key-value stores and their optimizations, there is a need to study how
to efficiently support LSM-based secondary indexes. We
investigate the augmentation of a main-memory-based memo structure into an LSM
secondary index structure to handle update-intensive workloads efficiently. We
conduct this study in the context of an R-tree-based secondary index. In
particular, we introduce the LSM RUM-tree that demonstrates the use of an
Update Memo in an LSM-based R-tree to enhance the performance of the R-tree’s
insert, delete, update, and search operations. The LSM RUM-tree introduces
novel strategies to reduce the size of the Update Memo to be a light-weight
in-memory structure that is suitable for handling update-intensive workloads
without introducing significant overhead.
17. TrioStat: Online Workload Estimation in Distributed Spatial
Data Streaming Systems [ACM
SIGSPATIAL 2020] The wide
spread of GPS-enabled devices and the Internet of Things (IoT) has increased
the amount of spatial data being generated every second. The current scale of
spatial data cannot be handled using centralized systems. This has led to the
development of distributed spatial data streaming systems that scale to process
in real-time large amounts of streamed spatial data. The performance of
distributed streaming systems relies on how even the workload is distributed
among their machines. However, it is challenging to estimate the workload of
each machine because spatial data and query streams are skewed and rapidly
change with time and users’ interests. Moreover, a distributed spatial
streaming system often does not maintain a global system workload state because
it requires high network and processing overheads to be collected from the
machines in the system. In this research, we introduce TrioStat;
an online workload estimation technique that relies on a probabilistic model
for estimating the workload of partitions and machines in a distributed spatial
data streaming system. It is infeasible to collect and exchange statistics with
a centralized unit because it requires high network overhead. Instead, TrioStat uses a decentralized technique to collect and
maintain the required statistics in real-time locally in each machine. TrioStat enables distributed spatial data streaming systems
to com- pare the workloads of machines as well as the workloads of data
partitions. TrioStat requires minimal network and
storage overhead. Moreover, the required storage is distributed across the
system’s machines.
18. STULL:
Unbiased Online Sampling for Visual Exploration of Large Spatiotemporal Data
[VAST 2020] Online
sampling-supported visual analytics is increasingly important, as it allows
users to explore large datasets with acceptable approximate answers at
interactive rates. However, existing online spatiotemporal sampling techniques
are often biased, as most researchers have primarily focused on reducing
computational latency. Biased sampling approaches select data with unequal
probabilities and produce results that do not match the exact data
distribution, leading end users to incorrect interpretations. In this research,
we propose a novel approach to perform unbiased online sampling of large
spatiotemporal data. The proposed approach ensures the same prob- ability of
selection to every point that qualifies the specifications of a user’s
multidimensional query. To achieve unbiased sampling for accurate
representative interactive visualizations, we design a novel data index and an
associated sample retrieval plan. Our proposed sampling approach is suitable
for a wide variety of visual analytics tasks, e.g., tasks that run aggregate
queries of spatiotemporal data. Extensive experiments confirm the superiority
of our approach over a state-of-the-art spatial online sampling technique,
demonstrating that within the same computational time, data samples generated
in our approach are at least 50% more accurate in representing the actual
spatial distribution of the data and enable approximate visualizations to
present closer visual appearances to the exact ones.
19. LocationSpark: In-memory Distributed Spatial Query
Processing and Optimization [Frontiers 2020] Due to the ubiquity of spatial data applications and
the large amounts of spatial data that these applications generate and process,
there is a pressing need for scalable spatial query processing. In this
research, we develop new techniques for spatial query processing and
optimization in an in-memory and distributed setup to address scalability. More
specifically, we introduce new techniques for handling query skew that commonly
happens in practice, and minimizes communication costs accordingly. We propose
a distributed query scheduler that uses a new cost model to minimize the cost
of spatial query processing. The scheduler generates query execution plans that
minimize the effect of query skew. The query scheduler utilizes new spatial
indexing techniques based on bitmap filters to forward queries to the
appropriate local nodes. Each local computation node is responsible for
optimizing and selecting its best local query execution plan based on the
indexes and the nature of the spatial queries in that node. All the proposed
spatial query processing and optimization techniques are prototyped inside
Spark, a distributed memory-based computation system. Our prototype system is
termed LocationSpark. The experimental study is based
on real datasets and demonstrates that LocationSpark
can enhance distributed spatial query processing by up to an order of magnitude
over existing in-memory and distributed spatial systems.
Training and Professional Development
In the context of this
project, the PIs have offered research opportunities during the 2020/2021
academic year for both graduate and undergraduate students.
Graduate Research Opportunities:
One Ph.D. student, Ahmed S. Abdelhamid, has graduated
who was partially funded under this project. His dissertation topic
is: Efficient Distributed Processing over Micro-Batched Data Streams.
Student Nameer Qureshi took a graduate independent
study course on topics related to this project.
In Fall 2020, Walid revised the project component of
the graduate-level database systems course (CS541) to allow for new research
opportunities and training of graduate students be semester-long group research
projects of three students each. Each group agrees on a project from a list of
potential projects that Walid provided to them (List is given below). Projects
vary in nature to match the various orientations of the students. Some projects
are survey-oriented while others are research-oriented with heavy programming
components. Projects cover a broad spectrum of research topics. 39 students
benefitted from this opportunity and formed 13 group projects that met with
Walid on weekly and biweekly bases. Two of these projects have resulted in
submitted conference papers that are under review, and another paper is
currently being prepared. We had some collaborators from industrial companies,
e.g., Google, Uber, and Facebook, to help collaborate in the projects, and
provide the students an industrial twist. Student feedback was positive as many
of them appreciated this research experience and the close follow-up and
feedback Walid provided on their biweekly or weekly progressions.
Projects provided included: (1) A survey of
50 years of database models, (2) Studying and comparing the performance of the
disk page and record layouts of PostgreSQL, MySQL, and SQLite, (3) Studying and
comparing the performance of the table directory structures and free-space
management for PostgreSQL, MySQL, and SQLite, (4) Surveying of SSD and
persistent memory storage technologies, their properties, and how they impact
on query processing techniques, index design, concurrency, and recovery, (5)
Query compilation techniques for graph data systems, (6) Handling bulk-loading
anomalies in spatial indexing techniques (is leading to a conference paper
publication), (7) Realizing an updatable compressed bitmap index, (8)
Implementing different concurrency control protocols in RDMA and evaluating
their tradeoffs, (9) Studying RDMA-based distributed transaction management and
realizing an RDMA-based two phase commit protocol, (10), Surveying multi-model
database system techniques, (11) Studying and comparing techniques for
supporting high-dimensional vector databases, (12) Decoupling the storage
engine from the compute engine in MySQL, (13) Realizing a new algorithm for
answering k-nearest-neighbor queries based on a newly published external merge
sort in spatial data systems (is leading to a conference publication), and (14)
Designing and implementing various memory-based indexing techniques over RDMA
that support concurrency.
There were 39 students in class that formed 13 groups.
Walid met biweekly with every group for 30 minutes each over Zoom.
Some groups wanted to be more productive by meeting weekly in contrast to
biweekly and Walid accommodated that. The students demonstrated their
weekly progress in the project in the form of slide-show presentations that the
students have prepared for every meeting, and/or if need be, showed
program code or performance studies and results.
For graduate students, in Fall 2020 and Spring 2021,
Walid has offered several research training opportunities as graduate-level PhD
research and independent study courses that involve various project-related
topics, where several students got training on presenting and discussing
research papers and in conducting a semester-long project. These students
include Jaewoo Shin (Update-tolerant LSM-based
Spatial Indexing – resulted in an ICDE 2021 paper), Abdullah Al Mamun (Learned
Spatial Indexing – resulted in a conference paper submission), Ahmed Abdelhamid
(Intelligent Data Partitioning in Micro-batched Big Data Streaming Systems –
resulted in a conference paper submission), Lu Xing (two projects: Concurrency
Control in Graph Data Management Systems, and Waves of Misery in R-trees), Ruihong Wang (RDMA-based Big Data Systems – resulted in a
conference paper submission), Libin Zhou (Distance
Oracles for dynamic query constraints), and Yeasir Rayhan (Learned spatial data partitioning).
Undergraduate Research Opportunities:
In the context of this project, Co-PI Tiark has offered research opportunities
for Undergraduate Student Shangyin Tan. Tan
got training in the context of this project where he contributed to the
design and implementation of a symbolic execution compiler using multistage
programming and algebraic effects. He was a co-author in the ESEC/FSE ’21
and the OOPSLA 2020 papers.
Walid has offered research opportunities during Summer
and Fall 2020 for undergraduate students. Most notably is Undergraduate Student
Hao Wu who was interested in the topic of ML-based Learned Multidimensional
Data Indexing. His research in the topic materialized into a co-authored
tutorial in the ACM SIGSPATIAL 2020 Conference, where co-authors Hao Wu and
Graduate Student Abdullah Al-Mamun participated in the presentation of the
tutorial at the conference. A survey article on the topic is currently underway.
In 2021/2022, we performed important research and education
activities for this project. In the context of data streaming, in our SIGMOD
2022 paper, we have studied incrementalization of
nested queries to achieve low latency using novel Relative Partial Aggregate
Indexes. We investigated the presence of non-deterministic performance and
“waves of misery” in update-intensive workloads over multi-dimensional data
structures that store graphs. We studied how to mitigate this issue. This
research was published in Proceedings of the VLDB 2021. In an ACM
SIGSPATIAL 2021 conference paper, we presented STAR, a distributed
in-memory data stream warehouse system, that provides low-latency and
up-to-date analytical results over a fast-arriving spatial data stream. STAR
was selected as one of the best papers in the Conference, and was invited in an
extended form for journal publication. In our ECOOP 2022 paper, we introduce a
new type system that features a stack delayed-popping strategy hence allowing variable-size
data, e.g., closures, graph reachability results, to be returned from function
calls as stack-allocated data from a callee. In our VLDB 2022 vision paper, we
studied the impact of new hardware architectures on the design of database
systems and separating memory from compute in a disaggregated setup would
impact database system techniques. We studied learned multi-dimensional indexes
in the context of an instance-optimized R-tree that was published in the IEEE
Mobile Data Management Conference 2022.
Incrementalization of
queries is imperative in cases where data arrives as streams and output is
latency-critical and/or desired before the full data has been received.
Incremental execution computes the output at a given time by reusing the
previously computed outputs or maintained views rather than re-evaluating the
query from scratch. There are various approaches to perform this incrementalization ranging from query-specific algorithms
and data structures (e.g., DYN, AJU) to general systems (e.g., DBToaster, Materialize). DBToaster
is a state-of-the-art system that comes with an appealing theoretical
background based on the idea of applying Incremental View Maintenance (IVM)
recursively, maintaining a hierarchy of materialized views via delta queries.
However, one key limitation of this approach is its inability to efficiently incrementalize correlated nested-aggregate queries due to
an inefficient delta rule for such queries. Moreover, none of the other
specialized approaches have shown efficient ways to optimize such queries
either. Nonetheless, these types of queries can be found in many real-world
application domains (e.g., finance), for which efficient incrementalization
re- mains a crucial open problem. In this work, we propose an approach to incrementalize such queries based on a novel tree-based
index structure called Relative Partial Aggregate Indexes (RPAI). Our approach
is asymptotically faster than other systems and shows up to 1100× speedups in
workloads of practical importance.
21. What If We
Don’t Pop the Stack? The Return of 2nd-Class Values [ECOOP 2022]
Using a stack for managing the local state of procedures as
popularized by Algol is a simple but effective way to achieve a primitive form
of automatic memory management. Hence, the call stack remains the backbone of
most programming language runtimes to the present day. However, the appealing
simplicity of the call stack model comes at the price of strictly enforced
limitations: since every function return pops the stack, it is difficult to
return stack-allocated data from a callee upwards to its caller – especially variable-size data
such as closures. This research introduces a solution by introducing a small
tweak to the usual stack semantics. We design a type system that tracks the
underlying storage mode of values, and when a function returns
a stack-allocated value, we just don’t pop the stack! Instead, the
stack frame is de-allocated together with a parent the next time a
heap-allocated value or primitive is returned. We identify a range of use cases
where this delayed-popping strategy is beneficial, ranging from closures to
trait objects to other types of variable-size data. Our evaluation shows that
this execution model reduces heap and GC pressure and recovers spatial locality
of programs improving execution time between 10% and 25% with respect to
standard execution.
22. Achieving
Deterministic Performance for Multi-dimensional Data Structures [VLDB 2021]
Having deterministic
performance for graph data systems and data systems in general, is very
important. This is especially true for data systems that support navigational
services in real time over road network graphs. Waves of misery is a phenomenon
where spikes of many node splits occur over short periods of time in tree
indexes. Waves of misery negatively affect the performance of tree indexes in
insertion-heavy workloads. Waves of misery have been first observed in the
context of the B-tree, where these waves cause unpredictable index performance.
In particular, the performance of search and index-update operations
deteriorate when a wave of misery takes place, but is more predictable between
the waves. This research investigates the presence or lack of waves of misery
in multi-dimensional structures, especially ones that can embed road network
graphs. Our focus here is on R-tree variants, where we study the extent of
which these waves impact the performance of each R-tree variant. Interestingly,
although having poorer query performance, the Linear and Quadratic R-trees are
found to be more resilient to waves of misery than both the Hilbert and
R*-trees. This research introduces several techniques to reduce the impact in
performance of the waves of misery for the Hilbert and R*-trees. One way to
eliminate waves of misery is to force node splits to take place at regular
times before nodes become full to achieve deterministic performance. The other
way is that upon splitting a node, do not split it evenly but rather at
different node utilization factors. This allows leaf nodes not to fill at the
same pace. We study the impact of two new techniques to mitigate waves of
misery after the tree index has been constructed, namely Regular Elective
Splits (RES, for short) and Unequal Random Splits (URS, for short). Our
experimental investigation highlights the trade-offs in performance of the
introduced techniques and the pros and cons of each technique.
23. STAR: A
Cache-based Stream Warehouse System for Spatial Data [ACM SIGSPATIAL 2021]
The proliferation of mobile
phones and location-based services has given rise to an explosive growth in
spatial data. In order to enable spatial data analytics, spatial data needs to
be streamed into a data stream warehouse system that can provide real-time
analytical results over the most recent and historical spatial data in the
warehouse. Existing data stream warehouse systems are not tailored for spatial
data. In this paper, we introduce the STAR system. STAR is
a distributed in-memory data stream warehouse system that provides low-latency
and up-to-date analytical results over a fast-arriving spatial data
stream. STAR supports both snapshot and continuous queries that are
composed of aggregate functions and ad hoc query constraints over spatial,
textual, and temporal data attributes. STAR implements a cache-based
mechanism to facilitate the processing of snapshot queries that collectively
utilizes the techniques of query-based caching (i.e., view materialization) and
object-based caching. Moreover, to speed-up processing continuous
queries, STAR proposes a novel index structure that achieves high
efficiency in both object checking and result updating. Extensive experiments
over real data sets demonstrate the superior performance of STAR over
existing systems.
24. The “AI+R”-tree: An
Instance-optimized Learned R-tree [IEEE MDM 2022]
The emerging class of
instance-optimized systems has shown potential to achieve high performance by
specializing to a specific data and query workloads. Particularly, Machine
Learning (ML) techniques have been applied successfully to build various instance-optimized
components (e.g., learned indexes). This research leverages ML techniques to
enhance the performance of spatial indexes, particularly the R-tree, for a
given data and query workloads. As the areas covered by the R-tree index nodes
overlap in space, upon searching for a specific point in space, multiple paths
from root to leaf may potentially be explored. In the worst case, the entire
R-tree could be searched. We define and use the overlap ratio to quantify the
degree of extraneous leaf node accesses required by a range query. The goal is
to enhance the query performance of a traditional R-tree for high-overlap range
queries as they tend to incur long running-times. We introduce a new AI-tree
that transforms the search operation of an R-tree into a multi-label
classification task to exclude the extraneous leaf node accesses. Then, we
augment a traditional R-tree to the AI-tree to form a hybrid “AI+R”- tree. The
“AI+R”-tree can automatically differentiate between the high- and low-overlap
queries using a learned model. Thus, the “AI+R”-tree processes high-overlap
queries using the AI- tree, and the low-overlap queries using the R-tree.
Experiments on real datasets demonstrate that the “AI+R”-tree can enhance the
query performance over a traditional R-tree by up to 500%.
Memory disaggregation (MD)
allows for scalable and elastic data center design by separating compute (CPU)
from memory. With MD, compute and memory are no longer coupled into the same
server box. Instead, they are connected to each other via ultra-fast networking
such as RDMA. MD can bring many advantages, e.g., higher memory utilization,
better independent scaling (of compute and memory), and lower cost of
ownership. This research makes the case that MD can fuel the next wave of
innovation on database systems. We observe that MD revives the great debate of
"shared what" in the database community. We envi-
sion that distributed shared-memory databases
(DSM-DB, for short) – that have not received much attention before – can
be promising in the future with MD. We study a list of challenges and
opportunities that can inspire next steps in system design making the case for
DSM-DB.
* Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Date of Last Update: December 14, 2023.