III: Small:
In-memory, Distributed, and Adaptive Spatio-textual
Query Processing
Contact
Information
Walid G. Aref
Department of
Computer Science
Purdue University
305 N. University
Street
West Lafayette,
Indiana 47907
Phone: (765)
494-1997
Fax
:
(765) 494-0739
Email:
aref@cs.purdue.edu
URL:
http://www.cs.purdue.edu/faculty/aref.html
This
material is based upon work supported by the National Science Foundation under
Grant No. III-1815796.
Project Award
Information
NSF Award Number:
III-1815796
Duration: 8/1/2018
-- 7/31/2022
Title: In-memory,
Distributed, and Adaptive Spatio-textual Query
Processing
PI: Walid G. Aref
Project Web Page:
http://www.cs.purdue.edu/homes/aref/IDAS
Project Focus:
The widespread use
of GPS-enabled smartphones along with the popularity of microblogging and
social networking, e.g., Twitter and Facebook, has resulted in producing large
amounts of text data. Typically, this text data, e.g., the tweets, are
geo-tagged by the location in which the text data has been produced. Many
applications make good use of this stream of geo-tagged text data (also termed
spatial-keyword or spatio-textual data), and provide
services to users based on the textual and the spatial components of the data.
Applications need to process large number of user queries against spatio-textual data. For example, in location-aware ad
targeting publish/subscribe systems, it is required to disseminate millions of
ads and promotions to millions of users based on the users' locations and
textual profiles. This project will address the hurdles that face these
applications and their underlying systems in order to function properly.
More specifically,
this project will address the following research challenges that face spatio-textual servers:
(1) The
scalability challenge to support large amounts of spatio-textual
data streams and queries
in real-time,
(2) The
expressiveness challenge that is exemplified in the lack of mechanisms that
adequately express complex spatio-textual queries. Querying
capabilities need to match the growing sophistication and complexity of the
continuously evolving location services, and
(3) The adaptivity
challenge,
where systems need to adapt to changes in data distribution over time.
In location
services, location-data distribution, users' interests, and hot keyword topics
change over time. A scalable spatio-textual server
needs to continuously adapt to these changes. The project will address these
scalability, expressiveness, and adaptivity challenges when processing large
amounts of queries on continuously-streamed spatio-textual
data. The project will investigate how to support spatio-textual
data and queries as first-class citizens in an in-memory distributed data
system. Scalable architectures for handling large amounts of spatio-textual data and continuous queries will be
investigated. In contrast to tailored solutions, relational-like spatio-textual building-block operators will be developed
to express extended-SQL spatio-textual queries along
with costing, algebraic transformation rules, and query optimization
techniques. To address scalability and the variation in the workload, adaptive
and frequency-aware in-memory distributed indexing and query processing
techniques will be developed to dynamically organize and process the
continuously-evolving spatio-textual data. The spatio-textual indexing and query processing techniques to
be developed will dynamically account for the changes and differences in the
frequencies of keywords within the various spatial regions to automatically
choose the best spatio-textual data organization that
optimizes the system performance.
For further
information see the project web page: https://www.cs.purdue.edu/homes/aref/IDAS
During the 2018-2019 Academic year, the first
year of this project, we started with two tutorial and survey activities. Along
with my former Ph.D. student, we published a monograph on the subject of
scalable processing of spatial keyword queries. The monograph was published
with the Morgan and Claypool Publishers in the series Synthesis Lectures on
Data Management. Also, we published a survey on spatial access methods that
appeared in the Springer GeoInformatica Journal early
in the year. We developed and prototyped new capabilities for adaptive query
processing in our spatial-keyword system, Tornado. This work was reported at
the ACM SIGSPATIAL Conference. We developed in-memory distributed spatial query
processing and optimization techniques, our LocationSpark
system. A detailed version of LocationSpark is
currently under review and is available in the ArXiv.
We developed the ExplainER prototype system to
understand and explain entity resolution classifiers with different granularity
levels of explanations. ExplainER was
demonstrated at the IEEE ICDE 2019 conference and was awarded the best demo
award at the conference. We developed a vision paper reflecting an
end-to-end human-centric data cleaning framework that was presented in the ACM
HILDA Workshop in July 2019.
Over the Fall 2018 and Spring 2019 semesters,
Walid led a group of undergraduate students to conduct undergraduate research
in topics related to this project. Details about these research and education
activities will be given below.
This year, Walid gave a keynote speech titled:
“The Dos and Don’ts of Location+X Data Management: A
Systems Perspective” at the 20th IEEE
International Conference on Mobile Data Management in Hong Kong in June 2019.
Also, Walid is the 2019 co-chair for the program committee of the 16th
International Symposium on Spatial and Temporal Databases that will take place
in Vienna later this year.
Scalable
Processing of Spatial-Keyword Queries [Monograph 2019]. Within the scope of this project, my former
student and I have developed an educational milestone for this project. We
wrote a manuscript for scalable query processing techniques for spatial keyword
data. This 110+ pages monograph was published with Morgan and Claypool in
their series Synthesis Lectures on Data Management. The monograph
addresses text data that is associated with location data and that has
become ubiquitous. A tweet is an example of this type of data, where the text
in a tweet is associated with the location where the tweet has been issued. We
use the term spatial-keyword data to refer to this type of data.
Spatial-keyword data is being generated at massive scale. Almost all online
transactions have an associated spatial-trace. The spatial trace is derived
from GPS coordinates, IP addresses, or cell-phone-tower locations. Hundreds of
millions or even billions of spatial-keyword objects are being generated daily.
Spatial-keyword data has numerous applications that require efficient processing
and management of massive amounts of spatial-keyword data. The monograph starts
by overviewing some important applications of spatial-keyword data, and
demonstrates the scale at which spatial-keyword data is being generated. Then,
it formalizes and classifies the various types of queries that execute over
spatial-keyword data. Then, it discusses important and desirable properties of
spatial-keyword query languages that are needed to express queries over
spatial-keyword data. Existing spatial-keyword query languages vary in the
types of spatial-keyword queries that they can support. There are many systems
that process spatial-keyword queries. Systems differ from each other in various
aspects, e.g., whether the system is batch-oriented or stream-based, and whether
the system is centralized or distributed. Moreover, spatial-keyword systems
vary in the types of queries that they support. Finally, systems vary in the
types of indexing techniques that they adopt. The monograph overviews the main
spatial-keyword data-management systems (SK-DMS, for short), and classifies
them according to their features. Moreover, the monograph describes the main
approaches adopted when indexing spatial-keyword data in the centralized and
distributed settings. Several case-studies of SK-DMSs are presented along with
the applications and query types that these SK-DMSs are targeted for and the
indexing techniques they utilize for processing their queries. Optimizing the
performance and the query processing of SK-DMSs still has many research
challenges and open problems. The monograph concludes with a discussion about
several important and open research-problems in the domain of scalable
spatial-keyword processing.
Adaptive
Processing of Spatial-Keyword Data Over a Distributed Streaming Cluster
(SIGSPATIAL 2018). The
widespread use of GPS-enabled smartphones along with the popularity of
micro-blogging and social networking applications, e.g., Twitter and Facebook,
has resulted in the generation of huge streams of geo-tagged textual data. Many
applications require real-time processing of these streams. For example,
location-based ad targeting systems enable advertisers to register millions of
ads to millions of users based on the users’ location and textual profile.
Existing streaming systems are either centralized or are not spatial-keyword
aware, and hence these systems cannot efficiently support the processing of
rapidly arriving spatial-keyword data streams. In this research, we introduce a
two-layered indexing scheme for the distributed processing of spatial-keyword
data streams. We realize this indexing scheme in Tornado, a distributed
spatial-keyword streaming system. The first layer, termed the routing layer, is
used to fairly distribute the workload, and furthermore, co-locate the data
objects and the corresponding queries at the same processing units. The routing
layer uses the Augmented-Grid, a novel structure that is equipped with an
efficient search algorithm for distributing the data objects and queries. The
second layer, termed the evaluation layer, resides within each processing unit
to reduce the processing overhead. The two-layered index adapts to changes in
the workload by applying a cost formula that continuously represents the
processing overhead at each processing unit. Extensive experimental evaluation
using real Twitter data indicates that Tornado achieves high scalability and
more than 2x improvement over the baseline approach in terms of the overall
system throughput.
Spatio-Temporal Access Methods: A Survey (2010 - 2017) (GeoInformatica 2019). The volume of spatio-temporal
data is growing at a rapid pace due to advances in location-aware devices,
e.g., smartphones, and the popularity of location-based services, e.g.,
navigation services. A number of spatio-temporal
access methods have been proposed to support efficient processing of queries
over the spatio-temporal data. Spatio-temporal
access methods can be classified according to the type of data being indexed
into the following categories: (1) indexes for historical spatio-temporal
data, (2) indexes for current and recent spatio-temporal
data, (3) indexes for future spatio-temporal data,
(4) indexes for past, present, and future spatio-temporal
data, (5) indexes for spatio-temporal data with
associated textual data, and (6) parallel and distributed spatio-temporal
systems and indexes. This survey is Part 3 of our two previous surveys on the
same subject that were both published in the IEEE Data Engineering Bulletin. In
this survey, we present an overview and a broad classification of the spatio-temporal access methods published between 2010 and
2017.
ExplainER: Entity Resolution Explanations (IEEE
ICDE 2019 Demo). Entity Resolution is a fundamental data cleaning and integration
problem that has received considerable attention in the past few decades. While
rule-based methods have been used in many practical scenarios and are often
easy to understand, machine-learning-based methods provide the best accuracy.
However, the state-of-the-art classifiers are very opaque. There has been some
work towards understanding and debugging the early stages of the entity
resolution pipeline, e.g. blocking and generating features (similarity scores).
However, there are no such efforts for explaining the model or its predictions.
In this demo, we propose ExplainER, a tool to
understand and explain entity resolution classifiers with different granularity
levels of explanations. Using several benchmark datasets, we will demonstrate
how ExplainER can handle different scenarios for a
variety of classifiers.
Towards
an End-to-End Human-Centric Data Cleaning Framework [HILDA 2019]. Data Cleaning refers to the process of detecting
and fixing errors in the data. Human involvement is instrumental at several
stages of this process such as providing rules or validating computed repairs.
There is a plethora of data cleaning algorithms addressing a wide range of data
errors (e.g., detecting duplicates, violations of integrity constraints, and
missing values). Many of these algorithms involve a human in the loop, however,
this latter is usually coupled to the underlying cleaning algorithms. In a real
data cleaning pipeline, several data cleaning operations are performed using
different tools. A high-level reasoning on these tools, when combined to repair
the data, has the potential to unlock useful use cases to involve humans in the
cleaning process. Additionally, we believe there is an opportunity to benefit
from recent advances in active learning methods to minimize the effort humans
have to spend to verify data items produced by tools or humans. There is
currently no end-to-end data cleaning framework that systematically involves
humans in the cleaning pipeline regardless of the underlying cleaning
algorithms. In this research, we present opportunities that this framework
could offer, and highlight key challenges that need to be addressed to realize
this vision. We present a design vision and discuss scenarios that motivate the
need for this framework to judiciously assist humans in the cleaning
process.
Undergraduate Research Projects
Undergraduate students are gaining research and
development as well as systems-oriented experience in the context of this
project. One Ph.D. student is gaining training in conducted systems-oriented
research.
In the context of this project, Walid has
offered research opportunities during the Fall 2018 and Spring 2019 for the
following eight undergraduate students:
Daniel Hu, Daksh Jotwani,
Piyush Juneja (two semesters), Aaron Nuestedter (two semesters), Harsh Patel, Peyton Puckett,
Logesh Ramadoss, and Ruoyu
Song.
The students worked in four groups:
- Group 1:
Daniel Hu and Peyton Puckett worked in the Tornado distributed in-memory
spatial keyword system along with a Ph.D. student from Walid’s group to develop
a running prototype of a demo system for Tornado. This involved introducing
Kafka between the front-end interfacing client of the system and the back-end
Apache Storm distributed streaming engine.
- Group 2:
Piyush Juneja, Aaron Nuestedter,
and Harsh Patel worked in the LIMO system along with a Ph.D. student from
Walid’s group to develop various aspects of LIMO. They introduced testing
capabilities and enforced coding conventions for Tornado, added publicly
available elevation data to the maps in the LIMO system, and developed video
documentation on how to use the LIMO system. Piyush later worked in developing
a generalized spatial-keyword index inside of PostgreSQL.
- Group 3: Ruoyu Song worked with two graduate students from Walid’s
group, A. Mamun and M. Hassan to add new graph functions into the GRFusion system.
- Group 4:
Daksh Jotwani (in Fall 2018) and Logesh Ramadoss (in Spring 2019) working with one Ph.D. from
Walid’s group to realize an ML-based Spatial Index.
To ensure continuity, a graduate student, funded
under this project, followed up along with Walid the work of each the four
groups on weekly basis, and has been continuing this research in the hopes to
finish and publish this work.
The students maintained all their
project-related activities on GitHub under PurdueDB,
where they gained excellent experience working on relatively large
systems-oriented code bases.
In Fall 2018, Walid offered a graduate-level
seminar course on various project-related topics, where five students got
training on presenting and discussing research papers and in conducting a
semester-long project.
In 2019-2020, we have studied several aspects of
in-memory, distributed, and adaptive spatio-textual
query processing systems. These include spatio-textual
data analytics [SIGMOD 2020 Demo], on-line data partitioning [SIGMOD 2020 full
paper], ML-based on-line data partitioning using Reinforcement Learning [aiDM 2020], a distributed system for spatio-textual
stream processing [PVLDB 2020], shared execution data analytics to address
scalability [SSDBM 2020], Grid/tree augmentation data structures for efficient
spatial query processing [ACM SIGSPATIAL 2020], and trend discovery in
micro-blogs over various space and time resolutions [GeoInformatica
2019]. Walid participated in a Dagstuhl Seminar in
December 2020.
Over the Fall 2019 and Spring 2020 semesters Walid led
a group of undergraduate students to conduct undergraduate research in topics
related to this project. Walid graduated one Ph.D. student in the topic of
“Attack-Resilient Adaptive Load-Balancing in Distributed Spatial Data Streaming
Systems”, and trained six other Ph.D. students in project-related research.
Details about these research and education activities
will be given below.
An
Investigation of Grid-enabled Tree Indexes for Spatial Query Processing. [ACM
SIGSPATIAL 2019] Two-dimensional tree-based spatial indexes (e.g., the
quadtree or the k-d tree) are commonly used for indexing spatial data. However,
both types of spatial indexes have limitations. Although two-dimensional trees
can handle skewed data, index traversal and tree maintenance can be expensive.
In contrast, spatial grids have low update overhead, but is not suitable for
skewed data. In this research, we investigate the augmentation of a grid into
tree-based indexing for spatial query processing. For this purpose, we
introduce the Grid-Enabled Tree index (GE-Tree, for short); a hybrid spatial
tree structure that augments a spatial grid to two-dimensional tree indexes. In
particular, we investigate the use of a grid at the leaf level of a
two-dimensional tree to facilitate tree navigation and maintenance.
STAR:
A Distributed Stream Warehouse System for Spatial Data. [SIGMOD 2020 Demo] Mobile and location-based services produce massive streams of
spatial and textual data. In order to enable spatial and textual data
analytics, spatial and textual data need to be streamed into a data stream
warehouse system that can provide real-time analytics over the incoming spatial
and textual data in the warehouse. A spatial data stream warehouse system
(DSWS) should be able to efficiently ingest the incoming data, and enable
online analytical processing (OLAP) over the streamed data. Existing DSWSs are
not tailored for spatial data. In this research, we introduce STAR; a spatial
data stream warehouse system. STAR is a distributed in-memory data stream
warehouse system that provides low-latency and up-to-date analytical results
over a fast-arriving spatial data stream. STAR supports both snapshot and
continuous queries that are composed of algebraic or holistic aggregate
functions and ad hoc query constraints over spatial, textual, and temporal data
attributes. STAR implements an effective view materialization algorithm, and
adopts a memory-efficient framework that facilitates the processing of streamed
data and the maintenance of materialized views.
Prompt: Dynamic
Data-Partitioning for Distributed Micro-batch Stream Processing Systems. [SIGMOD 2020] Advances in real-world applications require
high-throughput processing over large data streams. Micro-batching has been
proposed to support the needs of these applications. In micro-batching, the
processing and batching of the data are interleaved, where the incoming data
tuples are first buffered as data blocks, and then are processed collectively
using par- allel function constructs (e.g.,
Map-Reduce). The size of a micro-batch is set to guarantee a certain
response-time latency that is to conform to the application’s service-level
agreement. In contrast to tuple-at-a-time data stream processing,
micro-batching has the potential to sustain higher data rates. However,
existing micro-batch stream processing systems use basic data-partitioning
techniques that do not account for data skew and variable data rates.
Load-awareness is necessary to maintain performance and to enhance resource
utilization. A new data partitioning scheme, termed Prompt is presented that
leverages the characteristics of the micro-batch processing model. In the
batching phase, a frequency-aware buffering mechanism is introduced that
progressively maintains run-time statistics, and provides on-line key-based
sorting as data tuples arrive. Because achieving optimal data partitioning is NP-Hard
in this context, a workload-aware greedy algorithm is introduced that
partitions the buffered data tuples efficiently for the Map stage. In the
processing phase, a load-aware distribution mechanism is presented that
balances the size of the input to the Reduce stage without incurring inter-task
communication overhead. Moreover, Prompt elastically adapts resource
consumption according to workload changes.
PartLy: Learning Data Partitioning for Distributed Data
Stream Processing [aiDM 2020] In this research, we expand on our well-optimized Prompt online
data partitioning mechanism by using reinforcement learning to learn proper
data partitioning in a micro-batched data streaming setup. We realize that data
partitioning plays a critical role in data stream processing. Current data
partitioning techniques use simple, static heuristics that do not incorporate
feedback about the quality of the partitioning decision (i.e., fire and forget
strategy). Hence, the data partitioner often repeatedly chooses the same
decision. In this paper, we argue that reinforcement learning techniques can be
applied to address this problem. The use of artificial neural networks can
facilitate learning of efficient partitioning policies. We identify the
challenges that emerge when applying machine learning techniques to the data
partitioning problem for distributed data stream processing. Furthermore, we
introduce PartLy, a proof-of-concept data
partitioner, and present preliminary results that indicate PartLy’s
potential to match the performance of state-of-the-art techniques in terms of
partitioning quality, while minimizing storage and processing overheads.
SSTD: A
Distributed System on Streaming Spatio-Textual Data
[PVLDB 2020]. Streaming spatio-textual data that contains geolocations and textual
contents, e.g., geo-tagged tweets, is becoming increasingly available. Users
can register continuous queries to receive up-to-date results continuously, or
pose snapshot queries to receive results instantly. The large scale of spatio-textual data streams and huge amounts of queries
pose great challenges to the current location-based services, and call for more
efficient data management systems. In this research, we realize SSTD Streaming Spatio-Textual Data, a distributed in-memory system
supporting both continuous and snapshot queries with spatial, textual, and
temporal constraints over data streams. Compared with existing distributed data
stream management systems, SSTD has three novel aspects: (1) It supports many
types of queries over streamed spatio-textual
data; (2) SSTD adopts a new workload partitioning method, termed QT (Quad-Text)
tree, that utilizes the joint distribution of queries and spatio-textual
data to reduce query latency and enhance system throughput. (3) To achieve load
balance and robustness, we develop three new workload adjustment methods for
SSTD to fit the changes in the distributions of data or queries. Extensive
experiments on real-life datasets demonstrate the superior performance of SSTD.
Local
Trend Discovery on Real-time Microblogs with Uncertain Locations in Tight
Memory Environments. [GeoInformatica 2019] In this research, we developed GeoTrend+;
a system approach to support scalable local trend discovery on recent
microblogs, e.g., tweets, comments, online reviews, and check-ins, that come in
real time. GeoTrend+ discovers top-k trending
keywords in arbitrary spatial regions from recent microblogs that continuously
arrive with high rates and a significant portion has uncertain geolocations. GeoTrend+ distinguishes itself from existing techniques in
different aspects: (1) Discovering trends in arbitrary spatial regions, e.g.,
city blocks. (2) Considering both exact geolocations, e.g., accurate
latitude/longitude coordinates, and uncertain geolocations, e.g.,
district-level or city-level, that represents a significant portion of past
years microblogs. (3) Promoting recent microblogs as first-class citizens and
optimizes different components to digest a continuous flow of fast data in
main-memory while removing old data efficiently. (4) Providing various
main-memory optimization techniques that are able to distinguish useful from
useless data to effectively utilize tight memory resources while maintaining
accurate query results on relatively large amounts of data. (5) Supporting
various trending measures that effectively capture trending items under a
variety of definitions that suit different applications. GeoTrend+
limits its scope to real-time data that is posted during the last T time units.
To support its queries efficiently, GeoTrend+ employs
an in-memory spatial index that is able to efficiently digest incoming data and
expire data that is beyond the last T time units. The index also materializes
top-k keywords in different spatial regions so that incoming queries can be
processed with low latency. In peak times, the main-memory optimization
techniques are employed to shed less important data to sustain high query
accuracy with limited memory resources. Experimental results based on real data
and queries show the scalability of GeoTrend+ to
support high arrival rates and low query response time, and at least 90+% query
accuracy even under limited memory resources.
Shared Execution
Techniques for Business Data Analytics over Big Data Streams. [SSDBM 2020] Data Analytics require processing of large numbers of data streams
and create materialized views in order to provide near real-time answers to
user queries. Materializing the view of each query and refreshing it
continuously as a separate query execution plan is not efficient and is not
scalable. In this research, we present a global query execution plan to
simultaneously support multiple queries, and minimize the number of input
scans, operators, and tuples flowing between the operators. We propose shared-execution
techniques for creating and maintaining materialized views in support of
business data analytics queries as an example. We utilize commonalities in
multiple business data analytics queries to support scalable and efficient
processing of big data streams. We analyze the cost and elasticity of various
shared-execution query-processing techniques in a distributed environment. The
paper highlights shared execution techniques for select predicates, group, and
aggregate calculations. We present how global query execution plans are run in
a distributed stream processing system that is built on top of cluster-based
data streaming engine.
Training and Professional Development
1. Undergraduate Research Opportunities
In the context of this project, Walid has offered research
opportunities during the Fall 2019 and Spring 2020 for the following five
undergraduate students (Two female and three male students):
Piyush Juneja (Graph Partitioning
Algorithms), Nameer A. Qureshi (Big Spatial Data
Systems), Dhanushikka Ravichandiran
(Database Concurrency Control Algorithms), Hao Wu (ML-based Learned Data
Indexing), Shotobhisha Sinha Ray (Big Data Systems).
The students worked in four groups:
- Group 1: Piyush Juneja worked in
surveying graph partitioning algorithms.
- Group 2: Nameer Qureshi
developed a web site for all spatio-temporal access
methods and their corresponding published papers.
- Group 3: Shotobhisha Sinha Ray
and Dhanushikka Ravichandiran
studied and surveyed various Database Concurrency Control Algorithms.
- Group 4: Hao We surveyed the topic of learned indexes with
focus on the multi-dimensional case.
To ensure continuity, graduate students from Walid’s group
along with Walid, followed up the work of each of the four groups on weekly
basis (one graduate student per group), and has been continuing this research
in the hopes to finish and publish this work.
Of mention is the survey on multi-dimensional learned
indexes which has been accepted as a tutorial in the ACM SIGSPATIAL 2020
Conference. Both Hao Wu (Undergraduate) and Abdullah Al Mamun (Ph.D. student)
will be presenting this tutorial along with Walid at the conference.
Graduate Research Opportunities:
For graduate students, in Fall 2019 and Spring 2020, Walid
has offered several research training opportunities as graduate-level PhD
research and independent study courses that involve various project-related
topics, where several students got training on presenting and discussing
research papers and in conducting a semester-long project. These students
include:
·
Amira Mamoun (Dataset Discovery Techniques)
·
Jaewoo
Shin (Update-tolerant LSM-based Spatial Indexing)
·
Abdullah Al
Mamun (Learned Spatial Indexing)
·
Ahmed
Abdelhamid (Intelligent Data Partitioning in Micro-batched Big Data Streaming
Systems)
·
Lu Xing (Concurrency
Control and Query Optimization in Graph Data Management Systems)
·
Ruihong
Wang (RDMA-based Big Data Systems)
·
Serkan Uzunbaz (Shared Execution for Data Analytics over Big Data
Streams)
Walid has graduated one Ph.D.
student, Anas Daghistani, co-advised with Professor Arif Ghafoor (ECE) in the topic of “Attack-Resilient
Adaptive Load-Balancing in Distributed Spatial Data Streaming Systems”, in the
context of this project. This research is currently under review for journal
publication.
In 2020-2021, we had several
research and education activities in the context of this project. We have
addressed the scalability challenge in real-time location services by
developing SWARM, a light-weight adaptivity protocol that continuously monitors
the data and query workloads across the distributed processes of the spatial
data streaming system, and redistribute and rebalance the workloads soon as
performance bottlenecks get detected. SWARM has been published in the ACM
Transactions on Spatial Algorithms and Systems [ACM TSAS 2021]. We have
investigated the adaptation of Log Structured Merge trees to support frequent
updates in moving-object spatial databases that has been published in the IEEE
International Conference on Data Engineering [ICDE 2021]. Along with
undergraduate and graduate students, we presented a tutorial on the subject of
learned multi-dimensional indexes in the 2020 ACM SIGSPATIAL Conference [ACM
SIGSPATIAL 2020a]. We developed an online workload estimation technique that
relies on a probabilistic model for estimating the workload of partitions and
machines in a distributed spatial data streaming system [ACM SIGSPATIAL 2020b].
We developed an unbiased online sampling for the visual exploration of large
spatiotemporal data that was published in the IEEE Visualization Conference
[VAST 2020]. We have developed LocationSpark, a query executor, and an optimizer based on
Spark to improve the query execution plan generated for spatial queries. The
design and performance of LocationSpark were published
in the Frontiers in Big Data [Frontiers 2020].
Walid introduced a project-based research component in the
graduate-level database systems course at Purdue that has benefitted over 35
students. He mentored an undergraduate student to conduct undergraduate
research in topics related to this project. In 2020/2021 Walid graduated one
Ph.D. student in the topic of “Efficient Distributed Processing over
Micro-Batched Data Streams”, and trained seven other Ph.D. students in
project-related research.
Below, we highlight each of
these contributions and other project’s research, education, and training
activities.
SWARM: Adaptive
Load Balancing in Distributed Streaming Systems for Big Spatial Data. [ACM TSAS
2021] As clearly from the scale of location services, their
ubiquity, and the massive amounts of spatial data being generated in real-time,
the current scale of spatial data cannot be handled using centralized systems.
This has led to the development of distributed spatial streaming systems.
Existing systems use static spatial partitioning to distribute the workload. In
contrast, the real-time streamed spatial data follows non-uniform spatial
distributions that are continuously changing over time. Distributed spatial
streaming systems need to react to the changes in the distribution of spatial
data and queries. This research introduces SWARM, a light-weight adaptivity
protocol that continuously monitors the data and query workloads across the
distributed processes of the spatial data streaming system, and redistribute
and rebalance the workloads soon as performance bottlenecks get detected. SWARM
is able to handle multiple query-execution and data-persistence models. A
distributed streaming system can directly use SWARM to adaptively rebalance the
system's workload among its machines with minimal changes to the original code
of the underlying spatial application. Extensive experimental evaluation using
real and synthetic datasets illustrate that, on average, SWARM achieves 200%
improvement over a static grid partitioning that is determined based on
observing a limited history of the data and query workloads. Moreover, SWARM
reduces execution latency on average 4x compared with other existing
techniques.
Scalable
Relational Query Processing on Big Matrix Data [Submitted 2021] The
use of large-scale machine learning methods is becoming ubiquitous in many
applications ranging from business intelligence to self-driving cars. These
methods require a complex computation pipeline consisting of various types of
operations, e.g., relational operations for pre-processing or post-processing
the dataset, and matrix operations for core model computations. Many existing
systems focus on efficiently processing matrix-only operations, and assume that
the inputs to the relational operators are already pre-computed and are
materialized as intermediate matrices. However, the input to a relational
operator may be complex in machine learning pipelines, and may involve various
combinations of matrix operators. Hence, it is critical to realize scalable and
efficient relational query processors that directly operate on big matrix data.
This research develops new efficient and scalable relational query processing
techniques on big matrix data for in- memory distributed clusters. The proposed
techniques leverage algebraic transformation rules to rewrite query execution
plans into ones with lower computation costs. A distributed query plan
optimizer exploits the sparsity-inducing property of merge functions as well as
Bloom join strategies for efficiently evaluating various flavors of the join operation. Furthermore, optimized partitioning schemes
for the input matrices are developed to facilitate the performance of join
operations based on a cost model that minimizes the communication overhead. The
proposed relational query processing techniques are prototyped in Apache Spark.
Experiments on both real and synthetic data demonstrate that the proposed
techniques achieve up to two orders of magnitude performance improvement over
state-of-the-art systems on a wide range of applications.
The LSM
RUM-Tree: A Log Structured Merge R-Tree for Update-intensive Spatial Workloads
[ICDE 2021] Many applications require
update-intensive workloads on spatial objects, e.g., social-network services
and shared-riding services that track moving objects (devices). By buffering
insert and delete operations in memory, the Log Structured Merge Tree (LSM) has
been used widely in various systems because of its ability to handle
insert-intensive workloads. While the focus on LSM has been on key-value stores
and their optimizations, there is a need to study how to efficiently support
LSM-based secondary indexes. We investigate the augmentation of a
main-memory-based memo structure into an LSM secondary index structure to
handle update-intensive workloads efficiently. We conduct this study in the
context of an R-tree-based secondary index. In particular, we introduce the LSM
RUM-tree that demonstrates the use of an Update Memo in an LSM-based R-tree to
enhance the performance of the R-tree’s insert, delete, update, and search
operations. The LSM RUM-tree introduces novel strategies to reduce the size of
the Update Memo to be a light-weight in-memory structure that is suitable for
handling update-intensive workloads without introducing significant overhead.
A
Tutorial on Learned Multi-dimensional Indexes [ACM SIGSPATIAL 2020] Recently, Machine Learning (ML, for short) has been
successfully applied to database indexing. Initial experimentation on Learned
Indexes has demonstrated better search performance and lower space requirements
than their traditional database counterparts. Numerous attempts have been
explored to extend learned indexes to the multi-dimensional space. This makes
learned indexes potentially suitable for spatial databases. The goal of this
tutorial is to provide up-to-date coverage of learned indexes both in the
single and multi-dimensional spaces. The tutorial covers over 25 learned
indexes. The tutorial navigates through the space of learned indexes through a
taxonomy that helps classify the covered learned indexes both in the single and
multi-dimensional spaces.
TrioStat: Online Workload Estimation in Distributed Spatial
Data Streaming Systems [ACM SIGSPATIAL
2020] The wide spread of GPS-enabled devices and the Internet
of Things (IoT) has increased the amount of spatial data being generated every second.
The current scale of spatial data cannot be handled using centralized systems.
This has led to the development of distributed spatial data streaming systems
that scale to process in real-time large amounts of streamed spatial data. The
performance of distributed streaming systems relies on how even the workload is
distributed among their machines. However, it is challenging to estimate the
workload of each machine because spatial data and query streams are skewed and
rapidly change with time and users’ interests. Moreover, a distributed spatial
streaming system often does not maintain a global system workload state because
it requires high network and processing overheads to be collected from the
machines in the system. In this research, we introduce TrioStat;
an online workload estimation technique that relies on a probabilistic model
for estimating the workload of partitions and machines in a distributed spatial
data streaming system. It is infeasible to collect and exchange statistics with
a centralized unit because it requires high network overhead. Instead, TrioStat uses a decentralized technique to collect and
maintain the required statistics in real-time locally in each machine. TrioStat enables distributed spatial data streaming systems
to com- pare the workloads of machines as well as the workloads of data
partitions. TrioStat requires minimal network and
storage overhead. Moreover, the required storage is distributed across the
system’s machines.
STULL: Unbiased
Online Sampling for Visual Exploration of Large Spatiotemporal Data [VAST 2020] Online sampling-supported visual analytics is increasingly
important, as it allows users to explore large datasets with acceptable
approximate answers at interactive rates. However, existing online
spatiotemporal sampling techniques are often biased, as most re- searchers have
primarily focused on reducing computational latency. Biased sampling approaches
select data with unequal probabilities and produce results that do not match
the exact data distribution, leading end users to incorrect interpretations. In
this research, we propose a novel approach to perform unbiased online sampling
of large spatiotemporal data. The proposed approach ensures the same prob-
ability of selection to every point that qualifies the specifications of a
user’s multidimensional query. To achieve unbiased sampling for accurate
representative interactive visualizations, we design a novel data index and an
associated sample retrieval plan. Our proposed sampling approach is suitable
for a wide variety of visual analytics tasks, e.g., tasks that run aggregate
queries of spatiotemporal data. Extensive experiments confirm the superiority
of our approach over a state-of-the-art spatial online sampling technique,
demonstrating that within the same computational time, data samples generated
in our approach are at least 50% more accurate in representing the actual
spatial distribution of the data and enable approximate visualizations to
present closer visual appearances to the exact ones.
LocationSpark: In-memory Distributed Spatial Query
Processing and Optimization [Frontiers 2020] Due
to the ubiquity of spatial data applications and the large amounts of spatial
data that these applications generate and process, there is a pressing need for
scalable spatial query processing. In this research, we develop new techniques
for spatial query processing and optimization in an in-memory and distributed
setup to address scalability. More specifically, we introduce new techniques
for handling query skew that commonly happens in practice, and minimizes
communication costs accordingly. We propose a distributed query scheduler that
uses a new cost model to minimize the cost of spatial query processing. The
scheduler generates query execution plans that minimize the effect of query
skew. The query scheduler utilizes new spatial indexing techniques based on
bitmap filters to forward queries to the appropriate local nodes. Each local
computation node is responsible for optimizing and selecting its best local
query execution plan based on the indexes and the nature of the spatial queries
in that node. All the proposed spatial query processing and optimization
techniques are prototyped inside Spark, a distributed memory-based computation
system. Our prototype system is termed LocationSpark.
The experimental study is based on real datasets and demonstrates that LocationSpark can enhance distributed spatial query
processing by up to an order of magnitude over existing in-memory and
distributed spatial systems.
Training and Professional
Development
One Ph.D. student, Ahmed
S. Abdelhamid, has graduated who was partially funded under this project. His
dissertation topic is: Efficient Distributed Processing over Micro-Batched
Data Streams. Student Nameer Qureshi took a graduate
independent study course on topics related to this project.
In Fall 2020, I revised
the project component of the graduate-level database systems course (CS541) to
allow for new research opportunities and training of graduate students be
semester-long group research projects of three students each. Each group agrees
on a project from a list of potential projects that I provided to them (List is
given below). Projects vary in nature to match the various orientations of the
students. Some projects are survey-oriented while others are research-oriented
with heavy programming components. Projects cover a broad spectrum of research
topics. 39 students benefitted from this opportunity and formed 13 group
projects that met with Walid on weekly and biweekly bases. Two of these
projects have resulted in submitted conference papers that are under review,
and another paper is currently being prepared. I had some collaborators from
industrial companies, e.g., Google, Uber, and Facebook, to help collaborate in
the projects, and provide the students an industrial twist. Student feedback
was positive as many of them appreciated this research experience and the close
follow-up and feedback I provided on their biweekly or weekly
progressions.
Projects provided
included:
(1)
A survey of 50 years of database models
(2)
Studying and comparing the performance of the disk page and record layouts of
PostgreSQL, MySQL, and SQLite
(3)
Studying and comparing the performance of the table directory structures and
free-space management for PostgreSQL, MySQL, and SQLite
(4)
Surveying of SSD and persistent memory storage technologies, their properties,
and how they impact on query processing techniques, index design, concurrency,
and recovery
(5)
Query compilation techniques for graph data systems
(6)
Handling bulk-loading anomalies in spatial indexing techniques (is leading to a
conference paper publication)
(7)
Realizing an updatable compressed bitmap index
(8)
Implementing different concurrency control protocols in RDMA and evaluating
their tradeoffs
(9)
Studying RDMA-based distributed transaction management and realizing an
RDMA-based two phase commit protocol,
(10)
Surveying multi-model database system techniques
(11)
Studying and comparing techniques for supporting high-dimensional vector
databases
(12)
Decoupling the storage engine from the compute engine in MySQL
(13)
Realizing a new algorithm for answering k-nearest-neighbor queries based on a
newly published external merge sort in spatial data systems (is leading to a
conference publication)
(14)
Designing and implementing various memory-based indexing techniques over RDMA
that support concurrency.
There were 39 students
in class that formed 13 groups. I met biweekly with every group for 30 minutes
each over Zoom. Some groups wanted to be more productive by meeting weekly
in contrast to biweekly and I accommodated that. The students demonstrated
their weekly progress in the project in the form of slide-show presentations
that the students have prepared for every meeting, and/or if need be,
showed program code or performance studies and results.
For graduate students,
in Fall 2020 and Spring 2021, Walid has offered several research training
opportunities as graduate-level PhD research and independent study courses that
involve various project-related topics, where several students got training on
presenting and discussing research papers and in conducting a semester-long
project. These students include
1.
Jaewoo
Shin (Update-tolerant LSM-based Spatial Indexing – resulted in an ICDE 2021
paper)
2.
Abdullah Al Mamun (Learned
Spatial Indexing – resulted in a conference paper submission)
3.
Ahmed Abdelhamid
(Intelligent Data Partitioning in Micro-batched Big Data Streaming Systems –
resulted in a conference paper submission)
4.
Lu Xing (two projects:
Concurrency Control in Graph Data Management Systems, and Waves of Misery in
R-trees – resulted in a conference paper submission)
5.
Ruihong
Wang (RDMA-based Big Data Systems – resulted in a conference paper submission)
6.
Libin
Zhou (Distance Oracles for dynamic query constraints)
7.
Yeasir Rayhan (Learned spatial data partitioning and vectorized
spatial query processing).
Undergraduate Research
Opportunities: In the context of this project, Walid has offered research
opportunities during Summer and Fall 2020 for undergraduate students. Most
notably is Undergraduate Student Hao Wu who was interested in the topic of
ML-based Learned Multidimensional Data Indexing. His research in the topic
materialized into a co-authored tutorial in the ACM SIGSPATIAL 2020 Conference,
where co-authors Hao Wu and Graduate Student Abdullah Al-Mamun participated in
the presentation of the tutorial at the conference. A survey article on the
topic is currently underway.
In 2021/2022, the final year of the project, we
investigated supporting data analytics over fast arriving spatial data
streams. We developed STAR, a distributed in-memory data stream
warehouse system, that provides low-latency and up-to-date analytical results
over a fast-arriving spatial data stream. STAR was published in the 2021 ACM
SIGSPATIAL Conference and was selected as one of the best papers in the
Conference, and was invited in an extended form for journal publication, and
was accepted in the ACM Transactions in Spatial Algorithms and Systems. We
developed Guard, a component that detects and blocks attacks that
target the adaptive load balancing of distributed streaming systems. Guard uses
an unsupervised machine-learning technique to detect malicious users that are
involved in the attack. Guard was published in the IEEE Transactions on
Dependable and Secure Computing. We investigated the presence of
non-deterministic performance and “waves of misery” in update-intensive
workloads over several R-tree variants, and studied how to mitigate this issue.
This research was published in Proceedings of the VLDB 2022. We studied learned
multi-dimensional indexes in the context of an instance-optimized R-tree that
was published in the IEEE Mobile Data Management Conference 2022. We envisioned
how to design location data systems that have location as first-class data type
and not as an afterthought. Finally, we studied the impact of new hardware
architectures on the design of database systems and separating memory from
compute in a disaggregated setup would impact database system techniques. Our
vision paper was published in the Proceedings of the VLDB 2022.
Below, we highlight each of
these contributions and other project’s research, education, and training
activities.
STAR: A
Cache-based Stream Warehouse System for Spatial Data [SIGSPATIAL'2021], [ACM TSAS 2023]
The proliferation of mobile phones and
location-based services has given rise to an explosive growth in spatial data.
In order to enable spatial data analytics, spatial data needs to be streamed
into a data stream warehouse system that can provide real-time analytical
results over the most recent and historical spatial data in the warehouse.
Existing data stream warehouse systems are not tailored for spatial data. In
this paper, we introduce the STAR system. STAR is a
distributed in-memory data stream warehouse system that provides low-latency
and up-to-date analytical results over a fast-arriving spatial data
stream. STAR supports both snapshot and continuous queries that are
composed of aggregate functions and ad hoc query constraints over spatial,
textual, and temporal data attributes. STAR implements a cache-based
mechanism to facilitate the processing of snapshot queries that collectively
utilizes the techniques of query-based caching (i.e., view materialization) and
object-based caching. Moreover, to speed-up processing continuous
queries, STAR proposes a novel index structure that achieves high
efficiency in both object checking and result updating. Extensive experiments
over real data sets demonstrate the superior performance of STAR over
existing systems.
Guard:
Attack-Resilient Adaptive Load Balancing in Distributed Streaming Systems [IEEE
TDSC 2021] The performance of distributed streaming systems
relies on how even the workload is distributed among their machines. However,
data and query workloads are skewed and change rapidly. Therefore, multiple
adaptive load-balancing mechanisms have been proposed in the literature to
rebalance distributed streaming systems according to the changes in their
workloads. This paper introduces a novel attack model that targets adaptive
load-balancing mechanisms of distributed streaming systems. The attack reduces
the throughput and the availability of the system by making it stay in a
continuous state of rebalancing. This paper proposes Guard, a component that
detects and blocks attacks that target the adaptive load balancing of
distributed streaming systems. Guard uses an unsupervised machine-learning
technique to detect malicious users that are involved in the attack. Guard does
not block any user unless it detects that the user is malicious. Guard does not
depend on a specific application. Experimental evaluation for a high-intensity
attack illustrates that Guard improves the throughput and the availability of
the system by 85% and 86%, respectively. Moreover, Guard improves the minimum
availability that the attacker achieves by 325%.
An Experimental
Evaluation and Investigation of Waves of Misery in R-trees [VLDB 2021] Waves of misery is a phenomenon where spikes of many node splits occur
over short periods of time in tree indexes. Waves of misery negatively affect
the performance of tree indexes in insertion-heavy workloads. Waves of misery
have been first observed in the context of the B-tree, where these waves cause
unpredictable index performance. In particular, the performance of search and
index-update operations deteriorate when a wave of misery takes place, but is
more predictable between the waves. This paper investigates the presence or
lack of waves of misery in several R-tree variants, and studies the extent of
which these waves impact the performance of each variant. Interestingly,
although having poorer query performance, the Linear and Quadratic R-trees are
found to be more resilient to waves of misery than both the Hilbert and
R*-trees. This paper presents several techniques to reduce the impact in
performance of the waves of misery for the Hilbert and R*-trees. One way to
eliminate waves of misery is to force node splits to take place at regular
times before nodes become full to achieve deterministic performance. The other
way is that upon splitting a node, do not split it evenly but rather at
different node utilization factors. This allows leaf nodes not to fill at the
same pace. We study the impact of two new techniques to mitigate waves of
misery after the tree index has been constructed, namely Regular Elective
Splits (RES, for short) and Unequal Random Splits (URS, for short). Our
experimental investigation highlights the trade-offs in performance of the
introduced techniques and the pros and cons of each technique.
The “AI+R”-tree:
An Instance-optimized R-tree [MDM 2022] The
emerging class of instance-optimized systems has shown potential to achieve
high performance by specializing to a specific data and query workloads.
Particularly, Machine Learning (ML) techniques have been applied successfully
to build various instance-optimized components (e.g., learned indexes). This
paper investigates to leverage ML techniques to enhance the performance of
spatial indexes, particularly the R-tree, for a given data and query workloads.
As the areas covered by the R-tree index nodes overlap in space, upon searching
for a specific point in space, multiple paths from root to leaf may potentially
be explored. In the worst case, the entire R-tree could be searched. In this
paper, we define and use the overlap ratio to quantify the de- gree of extraneous leaf node accesses required by a range
query. The goal is to enhance the query performance of a traditional R-tree for
high-overlap range queries as they tend to incur long running-times. We
introduce a new AI-tree that transforms the search operation of an R-tree into
a multi-label classification task to exclude the extraneous leaf node accesses.
Then, we augment a traditional R-tree to the AI-tree to form a hybrid “AI+R”-
tree. The “AI+R”-tree can automatically differentiate between the high- and
low-overlap queries using a learned model. Thus, the “AI+R”-tree processes
high-overlap queries using the AI- tree, and the low-overlap queries using the
R-tree. Experiments on real datasets demonstrate that the “AI+R”-tree can
enhance the query performance over a traditional R-tree by up to 500%.
The Case for
Distributed Shared-Memory Databases with RDMA-Enabled Memory Disaggregation [VLDB’22]
Memory disaggregation (MD) allows for scalable and
elastic data center design by separating compute (CPU) from memory. With MD,
compute and memory are no longer coupled into the same server box. Instead,
they are connected to each other via ultra-fast networking such as RDMA. MD can
bring many advantages, e.g., higher memory utilization, better independent
scaling (of compute and memory), and lower cost of ownership. This paper makes
the case that MD can fuel the next wave of innovation on database systems. We
observe that MD revives the great debate of "shared what" in the
database community. We envision that distributed shared- memory databases
(DSM-DB, for short) – that have not received much attention before – can be
promising in the future with MD. We present a list of challenges and
opportunities that can inspire next steps in system design making the case for
DSM-DB.
ILX: Intelligent
"Location+X" Data Systems (Vision Paper)
Due to the ubiquity of mobile phones and
location-detection devices, location data is being generated in very large
volumes. Queries and operations that are performed on location data warrant the
use of database systems. Despite that, location data is being supported in data
systems as an afterthought. Typically, relational or NoSQL data systems that
are mostly designed with non-location data in mind get extended with spatial or
spatiotemporal indexes, some query op- erators, and higher level syntactic sugar in order to support location
data. The ubiquity of location data and location data services call for systems
that are solely designed and optimized for the efficient support of location
data. This paper envisions designing intelligent location+X
data systems, ILX for short, where location is treated as a first-class citizen
type. ILX is tailored with location data as the main data type
(location-first). Because location data is typically augmented with other data
types X, e.g., graphs, text data, click streams, annotations, etc., ILX needs
to be extensible to support other data types X along with location. This paper
envisions the main features that ILX should support, and highlights research
challenges in realizing and supporting ILX.
Training and Professional Development
During the 2021/2022 academic year, Walid has
offered several research training opportunities for graduate students as
graduate-level PhD research and independent study courses that involve various
project-related topics, where several students got training on presenting and
discussing research papers and in conducting a semester-long project. These
students include
1.
Jaewoo Shin (Concurrency Control for Update-tolerant
LSM-based Spatial and B-tee Indexing – resulted in a journal paper submission)
2.
Abdullah
Al Mamun (Learned R-tree Spatial Indexing – resulted in an MDM’22 conference
paper)
3.
Lu Xing
(two projects: Waves of Misery in R-trees – resulted in a VLDB’21 conference
paper, and adaptive tree indexing)
4.
Ruihong Wang (RDMA-based Big Data Systems – resulted in a
VLDB’22 vision paper)
5.
Libin Zhou (two projects: Lock-free Concurrency Control
for Graph Systems – resulted in a conference paper submission, and Distance
Oracles for dynamic query constraints)
6.
Yeasir Rayhan (SIMD-aware
R-tree index – resulted in a conference paper submission).
* Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Date of Last Update: December 10, 2023