III: Small: In-memory, Distributed, and Adaptive Spatio-textual Query Processing

 

Contact Information

Walid G. Aref

Department of Computer Science

Purdue University

305 N. University Street

West Lafayette, Indiana 47907

Phone: (765) 494-1997

Fax : (765) 494-0739

Email: aref@cs.purdue.edu

URL: http://www.cs.purdue.edu/faculty/aref.html

 

This material is based upon work supported by the National Science Foundation under Grant No. III-1815796.

 

Project Award Information

NSF Award Number: III-1815796

Duration: 8/1/2018 -- 7/31/2022

Title: In-memory, Distributed, and Adaptive Spatio-textual Query Processing

PI: Walid G. Aref

Project Web Page:

http://www.cs.purdue.edu/homes/aref/IDAS

 

Project Focus:

The widespread use of GPS-enabled smartphones along with the popularity of microblogging and social networking, e.g., Twitter and Facebook, has resulted in producing large amounts of text data. Typically, this text data, e.g., the tweets, are geo-tagged by the location in which the text data has been produced. Many applications make good use of this stream of geo-tagged text data (also termed spatial-keyword or spatio-textual data), and provide services to users based on the textual and the spatial components of the data. Applications need to process large number of user queries against spatio-textual data. For example, in location-aware ad targeting publish/subscribe systems, it is required to disseminate millions of ads and promotions to millions of users based on the users' locations and textual profiles. This project will address the hurdles that face these applications and their underlying systems in order to function properly.

 

More specifically, this project will address the following research challenges that face spatio-textual servers:

(1) The scalability challenge to support large amounts of spatio-textual data streams and queries

in real-time,

(2) The expressiveness challenge that is exemplified in the lack of mechanisms that adequately express complex spatio-textual queries. Querying capabilities need to match the growing sophistication and complexity of the continuously evolving location services, and

(3) The adaptivity challenge, where systems need to adapt to changes in data distribution over time.

In location services, location-data distribution, users' interests, and hot keyword topics change over time. A scalable spatio-textual server needs to continuously adapt to these changes. The project will address these scalability, expressiveness, and adaptivity challenges when processing large amounts of queries on continuously-streamed spatio-textual data. The project will investigate how to support spatio-textual data and queries as first-class citizens in an in-memory distributed data system. Scalable architectures for handling large amounts of spatio-textual data and continuous queries will be investigated. In contrast to tailored solutions, relational-like spatio-textual building-block operators will be developed to express extended-SQL spatio-textual queries along with costing, algebraic transformation rules, and query optimization techniques. To address scalability and the variation in the workload, adaptive and frequency-aware in-memory distributed indexing and query processing techniques will be developed to dynamically organize and process the continuously-evolving spatio-textual data. The spatio-textual indexing and query processing techniques to be developed will dynamically account for the changes and differences in the frequencies of keywords within the various spatial regions to automatically choose the best spatio-textual data organization that optimizes the system performance.

For further information see the project web page:  https://www.cs.purdue.edu/homes/aref/IDAS

 

2018-2019 Project Activities

2019-2020 Project Activities

2020-2021 Project Activities

2021-2022 Project Activities

 

2018-2019 Project Activities:

During the 2018-2019 Academic year, the first year of this project, we started with two tutorial and survey activities. Along with my former Ph.D. student, we published a monograph on the subject of scalable processing of spatial keyword queries. The monograph was published with the Morgan and Claypool Publishers in the series Synthesis Lectures on Data Management. Also, we published a survey on spatial access methods that appeared in the Springer GeoInformatica Journal early in the year. We developed and prototyped new capabilities for adaptive query processing in our spatial-keyword system, Tornado. This work was reported at the ACM SIGSPATIAL Conference. We developed in-memory distributed spatial query processing and optimization techniques, our LocationSpark system. A detailed version of LocationSpark is currently under review and is available in the ArXiv. We developed the ExplainER prototype system to understand and explain entity resolution classifiers with different granularity levels of explanations. ExplainER was demonstrated at the IEEE ICDE 2019 conference and was awarded the best demo award at the conference. We developed a vision paper reflecting an end-to-end human-centric data cleaning framework that was presented in the ACM HILDA Workshop in July 2019.

Over the Fall 2018 and Spring 2019 semesters, Walid led a group of undergraduate students to conduct undergraduate research in topics related to this project. Details about these research and education activities will be given below. 

This year, Walid gave a keynote speech titled: “The Dos and Don’ts of Location+X Data Management: A Systems Perspective” at the 20th IEEE International Conference on Mobile Data Management in Hong Kong in June 2019. Also, Walid is the 2019 co-chair for the program committee of the 16th International Symposium on Spatial and Temporal Databases that will take place in Vienna later this year. 

 

Scalable Processing of Spatial-Keyword Queries [Monograph 2019]. Within the scope of this project, my former student and I have developed an educational milestone for this project. We wrote a manuscript for scalable query processing techniques for spatial keyword data. This 110+ pages monograph was published with Morgan and Claypool in their series Synthesis Lectures on Data Management. The monograph addresses text data that is associated with location data and that has become ubiquitous. A tweet is an example of this type of data, where the text in a tweet is associated with the location where the tweet has been issued. We use the term spatial-keyword data to refer to this type of data. Spatial-keyword data is being generated at massive scale. Almost all online transactions have an associated spatial-trace. The spatial trace is derived from GPS coordinates, IP addresses, or cell-phone-tower locations. Hundreds of millions or even billions of spatial-keyword objects are being generated daily. Spatial-keyword data has numerous applications that require efficient processing and management of massive amounts of spatial-keyword data. The monograph starts by overviewing some important applications of spatial-keyword data, and demonstrates the scale at which spatial-keyword data is being generated. Then, it formalizes and classifies the various types of queries that execute over spatial-keyword data. Then, it discusses important and desirable properties of spatial-keyword query languages that are needed to express queries over spatial-keyword data. Existing spatial-keyword query languages vary in the types of spatial-keyword queries that they can support. There are many systems that process spatial-keyword queries. Systems differ from each other in various aspects, e.g., whether the system is batch-oriented or stream-based, and whether the system is centralized or distributed. Moreover, spatial-keyword systems vary in the types of queries that they support. Finally, systems vary in the types of indexing techniques that they adopt. The monograph overviews the main spatial-keyword data-management systems (SK-DMS, for short), and classifies them according to their features. Moreover, the monograph describes the main approaches adopted when indexing spatial-keyword data in the centralized and distributed settings. Several case-studies of SK-DMSs are presented along with the applications and query types that these SK-DMSs are targeted for and the indexing techniques they utilize for processing their queries. Optimizing the performance and the query processing of SK-DMSs still has many research challenges and open problems. The monograph concludes with a discussion about several important and open research-problems in the domain of scalable spatial-keyword processing. 

 

Adaptive Processing of Spatial-Keyword Data Over a Distributed Streaming Cluster (SIGSPATIAL 2018). The widespread use of GPS-enabled smartphones along with the popularity of micro-blogging and social networking applications, e.g., Twitter and Facebook, has resulted in the generation of huge streams of geo-tagged textual data. Many applications require real-time processing of these streams. For example, location-based ad targeting systems enable advertisers to register millions of ads to millions of users based on the users’ location and textual profile. Existing streaming systems are either centralized or are not spatial-keyword aware, and hence these systems cannot efficiently support the processing of rapidly arriving spatial-keyword data streams. In this research, we introduce a two-layered indexing scheme for the distributed processing of spatial-keyword data streams. We realize this indexing scheme in Tornado, a distributed spatial-keyword streaming system. The first layer, termed the routing layer, is used to fairly distribute the workload, and furthermore, co-locate the data objects and the corresponding queries at the same processing units. The routing layer uses the Augmented-Grid, a novel structure that is equipped with an efficient search algorithm for distributing the data objects and queries. The second layer, termed the evaluation layer, resides within each processing unit to reduce the processing overhead. The two-layered index adapts to changes in the workload by applying a cost formula that continuously represents the processing overhead at each processing unit. Extensive experimental evaluation using real Twitter data indicates that Tornado achieves high scalability and more than 2x improvement over the baseline approach in terms of the overall system throughput.

 

Spatio-Temporal Access Methods: A Survey (2010 - 2017) (GeoInformatica 2019). The volume of spatio-temporal data is growing at a rapid pace due to advances in location-aware devices, e.g., smartphones, and the popularity of location-based services, e.g., navigation services. A number of spatio-temporal access methods have been proposed to support efficient processing of queries over the spatio-temporal data. Spatio-temporal access methods can be classified according to the type of data being indexed into the following categories: (1) indexes for historical spatio-temporal data, (2) indexes for current and recent spatio-temporal data, (3) indexes for future spatio-temporal data, (4) indexes for past, present, and future spatio-temporal data, (5) indexes for spatio-temporal data with associated textual data, and (6) parallel and distributed spatio-temporal systems and indexes. This survey is Part 3 of our two previous surveys on the same subject that were both published in the IEEE Data Engineering Bulletin. In this survey, we present an overview and a broad classification of the spatio-temporal access methods published between 2010 and 2017.

 

ExplainER: Entity Resolution Explanations (IEEE ICDE 2019 Demo). Entity Resolution is a fundamental data cleaning and integration problem that has received considerable attention in the past few decades. While rule-based methods have been used in many practical scenarios and are often easy to understand, machine-learning-based methods provide the best accuracy. However, the state-of-the-art classifiers are very opaque. There has been some work towards understanding and debugging the early stages of the entity resolution pipeline, e.g. blocking and generating features (similarity scores). However, there are no such efforts for explaining the model or its predictions. In this demo, we propose ExplainER, a tool to understand and explain entity resolution classifiers with different granularity levels of explanations. Using several benchmark datasets, we will demonstrate how ExplainER can handle different scenarios for a variety of classifiers.

 

Towards an End-to-End Human-Centric Data Cleaning Framework [HILDA 2019]. Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process such as providing rules or validating computed repairs. There is a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, and missing values). Many of these algorithms involve a human in the loop, however, this latter is usually coupled to the underlying cleaning algorithms. In a real data cleaning pipeline, several data cleaning operations are performed using different tools. A high-level reasoning on these tools, when combined to repair the data, has the potential to unlock useful use cases to involve humans in the cleaning process. Additionally, we believe there is an opportunity to benefit from recent advances in active learning methods to minimize the effort humans have to spend to verify data items produced by tools or humans. There is currently no end-to-end data cleaning framework that systematically involves humans in the cleaning pipeline regardless of the underlying cleaning algorithms. In this research, we present opportunities that this framework could offer, and highlight key challenges that need to be addressed to realize this vision. We present a design vision and discuss scenarios that motivate the need for this framework to judiciously assist humans in the cleaning process. 

Undergraduate Research Projects

Undergraduate students are gaining research and development as well as systems-oriented experience in the context of this project. One Ph.D. student is gaining training in conducted systems-oriented research. 

In the context of this project, Walid has offered research opportunities during the Fall 2018 and Spring 2019 for the following eight undergraduate students:

Daniel Hu, Daksh Jotwani, Piyush Juneja (two semesters), Aaron Nuestedter (two semesters), Harsh Patel, Peyton Puckett, Logesh Ramadoss, and Ruoyu Song.

The students worked in four groups: 

-      Group 1: Daniel Hu and Peyton Puckett worked in the Tornado distributed in-memory spatial keyword system along with a Ph.D. student from Walid’s group to develop a running prototype of a demo system for Tornado. This involved introducing Kafka between the front-end interfacing client of the system and the back-end Apache Storm distributed streaming engine.

-      Group 2: Piyush Juneja, Aaron Nuestedter, and Harsh Patel worked in the LIMO system along with a Ph.D. student from Walid’s group to develop various aspects of LIMO. They introduced testing capabilities and enforced coding conventions for Tornado, added publicly available elevation data to the maps in the LIMO system, and developed video documentation on how to use the LIMO system. Piyush later worked in developing a generalized spatial-keyword index inside of PostgreSQL.

-      Group 3: Ruoyu Song worked with two graduate students from Walid’s group, A. Mamun and M. Hassan to add new graph functions into the GRFusion system.

-      Group 4: Daksh Jotwani (in Fall 2018) and Logesh Ramadoss (in Spring 2019) working with one Ph.D. from Walid’s group to realize an ML-based Spatial Index.

To ensure continuity, a graduate student, funded under this project, followed up along with Walid the work of each the four groups on weekly basis, and has been continuing this research in the hopes to finish and publish this work. 

The students maintained all their project-related activities on GitHub under PurdueDB, where they gained excellent experience working on relatively large systems-oriented code bases.

In Fall 2018, Walid offered a graduate-level seminar course on various project-related topics, where five students got training on presenting and discussing research papers and in conducting a semester-long project.

 

2019-2020 Project Activities:

In 2019-2020, we have studied several aspects of in-memory, distributed, and adaptive spatio-textual query processing systems. These include spatio-textual data analytics [SIGMOD 2020 Demo], on-line data partitioning [SIGMOD 2020 full paper], ML-based on-line data partitioning using Reinforcement Learning [aiDM 2020], a distributed system for spatio-textual stream processing [PVLDB 2020], shared execution data analytics to address scalability [SSDBM 2020], Grid/tree augmentation data structures for efficient spatial query processing [ACM SIGSPATIAL 2020], and trend discovery in micro-blogs over various space and time resolutions [GeoInformatica 2019]. Walid participated in a Dagstuhl Seminar in December 2020.

Over the Fall 2019 and Spring 2020 semesters Walid led a group of undergraduate students to conduct undergraduate research in topics related to this project. Walid graduated one Ph.D. student in the topic of “Attack-Resilient Adaptive Load-Balancing in Distributed Spatial Data Streaming Systems”, and trained six other Ph.D. students in project-related research.

Details about these research and education activities will be given below. 

An Investigation of Grid-enabled Tree Indexes for Spatial Query Processing. [ACM SIGSPATIAL 2019] Two-dimensional tree-based spatial indexes (e.g., the quadtree or the k-d tree) are commonly used for indexing spatial data. However, both types of spatial indexes have limitations. Although two-dimensional trees can handle skewed data, index traversal and tree maintenance can be expensive. In contrast, spatial grids have low update overhead, but is not suitable for skewed data. In this research, we investigate the augmentation of a grid into tree-based indexing for spatial query processing. For this purpose, we introduce the Grid-Enabled Tree index (GE-Tree, for short); a hybrid spatial tree structure that augments a spatial grid to two-dimensional tree indexes. In particular, we investigate the use of a grid at the leaf level of a two-dimensional tree to facilitate tree navigation and maintenance. 

STAR: A Distributed Stream Warehouse System for Spatial Data. [SIGMOD 2020 Demo] Mobile and location-based services produce massive streams of spatial and textual data. In order to enable spatial and textual data analytics, spatial and textual data need to be streamed into a data stream warehouse system that can provide real-time analytics over the incoming spatial and textual data in the warehouse. A spatial data stream warehouse system (DSWS) should be able to efficiently ingest the incoming data, and enable online analytical processing (OLAP) over the streamed data. Existing DSWSs are not tailored for spatial data. In this research, we introduce STAR; a spatial data stream warehouse system. STAR is a distributed in-memory data stream warehouse system that provides low-latency and up-to-date analytical results over a fast-arriving spatial data stream. STAR supports both snapshot and continuous queries that are composed of algebraic or holistic aggregate functions and ad hoc query constraints over spatial, textual, and temporal data attributes. STAR implements an effective view materialization algorithm, and adopts a memory-efficient framework that facilitates the processing of streamed data and the maintenance of materialized views. 

Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems. [SIGMOD 2020] Advances in real-world applications require high-throughput processing over large data streams. Micro-batching has been proposed to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using par- allel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application’s service-level agreement. In contrast to tuple-at-a-time data stream processing, micro-batching has the potential to sustain higher data rates. However, existing micro-batch stream processing systems use basic data-partitioning techniques that do not account for data skew and variable data rates. Load-awareness is necessary to maintain performance and to enhance resource utilization. A new data partitioning scheme, termed Prompt is presented that leverages the characteristics of the micro-batch processing model. In the batching phase, a frequency-aware buffering mechanism is introduced that progressively maintains run-time statistics, and provides on-line key-based sorting as data tuples arrive. Because achieving optimal data partitioning is NP-Hard in this context, a workload-aware greedy algorithm is introduced that partitions the buffered data tuples efficiently for the Map stage. In the processing phase, a load-aware distribution mechanism is presented that balances the size of the input to the Reduce stage without incurring inter-task communication overhead. Moreover, Prompt elastically adapts resource consumption according to workload changes. 

PartLy: Learning Data Partitioning for Distributed Data Stream Processing [aiDM 2020] In this research, we expand on our well-optimized Prompt online data partitioning mechanism by using reinforcement learning to learn proper data partitioning in a micro-batched data streaming setup. We realize that data partitioning plays a critical role in data stream processing. Current data partitioning techniques use simple, static heuristics that do not incorporate feedback about the quality of the partitioning decision (i.e., fire and forget strategy). Hence, the data partitioner often repeatedly chooses the same decision. In this paper, we argue that reinforcement learning techniques can be applied to address this problem. The use of artificial neural networks can facilitate learning of efficient partitioning policies. We identify the challenges that emerge when applying machine learning techniques to the data partitioning problem for distributed data stream processing. Furthermore, we introduce PartLy, a proof-of-concept data partitioner, and present preliminary results that indicate PartLy’s potential to match the performance of state-of-the-art techniques in terms of partitioning quality, while minimizing storage and processing overheads. 

SSTD: A Distributed System on Streaming Spatio-Textual Data [PVLDB 2020]Streaming spatio-textual data that contains geolocations and textual contents, e.g., geo-tagged tweets, is becoming increasingly available. Users can register continuous queries to receive up-to-date results continuously, or pose snapshot queries to receive results instantly. The large scale of spatio-textual data streams and huge amounts of queries pose great challenges to the current location-based services, and call for more efficient data management systems. In this research, we realize SSTD Streaming Spatio-Textual Data, a distributed in-memory system supporting both continuous and snapshot queries with spatial, textual, and temporal constraints over data streams. Compared with existing distributed data stream management systems, SSTD has three novel aspects: (1) It supports many types of queries over streamed  spatio-textual data; (2) SSTD adopts a new workload partitioning method, termed QT (Quad-Text) tree, that utilizes the joint distribution of queries and spatio-textual data to reduce query latency and enhance system throughput. (3) To achieve load balance and robustness, we develop three new workload adjustment methods for SSTD to fit the changes in the distributions of data or queries. Extensive experiments on real-life datasets demonstrate the superior performance of SSTD.

Local Trend Discovery on Real-time Microblogs with Uncertain Locations in Tight Memory Environments. [GeoInformatica 2019] In this research, we developed GeoTrend+; a system approach to support scalable local trend discovery on recent microblogs, e.g., tweets, comments, online reviews, and check-ins, that come in real time. GeoTrend+ discovers top-k trending keywords in arbitrary spatial regions from recent microblogs that continuously arrive with high rates and a significant portion has uncertain geolocations. GeoTrend+ distinguishes itself from existing techniques in different aspects: (1) Discovering trends in arbitrary spatial regions, e.g., city blocks. (2) Considering both exact geolocations, e.g., accurate latitude/longitude coordinates, and uncertain geolocations, e.g., district-level or city-level, that represents a significant portion of past years microblogs. (3) Promoting recent microblogs as first-class citizens and optimizes different components to digest a continuous flow of fast data in main-memory while removing old data efficiently. (4) Providing various main-memory optimization techniques that are able to distinguish useful from useless data to effectively utilize tight memory resources while maintaining accurate query results on relatively large amounts of data. (5) Supporting various trending measures that effectively capture trending items under a variety of definitions that suit different applications. GeoTrend+ limits its scope to real-time data that is posted during the last T time units. To support its queries efficiently, GeoTrend+ employs an in-memory spatial index that is able to efficiently digest incoming data and expire data that is beyond the last T time units. The index also materializes top-k keywords in different spatial regions so that incoming queries can be processed with low latency. In peak times, the main-memory optimization techniques are employed to shed less important data to sustain high query accuracy with limited memory resources. Experimental results based on real data and queries show the scalability of GeoTrend+ to support high arrival rates and low query response time, and at least 90+% query accuracy even under limited memory resources.

Shared Execution Techniques for Business Data Analytics over Big Data Streams. [SSDBM 2020] Data Analytics require processing of large numbers of data streams and create materialized views in order to provide near real-time answers to user queries. Materializing the view of each query and refreshing it continuously as a separate query execution plan is not efficient and is not scalable. In this research, we present a global query execution plan to simultaneously support multiple queries, and minimize the number of input scans, operators, and tuples flowing between the operators. We propose shared-execution techniques for creating and maintaining materialized views in support of business data analytics queries as an example. We utilize commonalities in multiple business data analytics queries to support scalable and efficient processing of big data streams. We analyze the cost and elasticity of various shared-execution query-processing techniques in a distributed environment. The paper highlights shared execution techniques for select predicates, group, and aggregate calculations. We present how global query execution plans are run in a distributed stream processing system that is built on top of cluster-based data streaming engine. 

Training and Professional Development

1. Undergraduate Research Opportunities

In the context of this project, Walid has offered research opportunities during the Fall 2019 and Spring 2020 for the following five undergraduate students (Two female and three male students):

 

Piyush Juneja (Graph Partitioning Algorithms), Nameer A. Qureshi (Big Spatial Data Systems), Dhanushikka Ravichandiran (Database Concurrency Control Algorithms), Hao Wu (ML-based Learned Data Indexing), Shotobhisha Sinha Ray (Big Data Systems).

 

The students worked in four groups:

 

- Group 1: Piyush Juneja worked in surveying graph partitioning algorithms.

 

- Group 2: Nameer Qureshi developed a web site for all spatio-temporal access methods and their corresponding published papers.

 

- Group 3: Shotobhisha Sinha Ray and Dhanushikka Ravichandiran studied and surveyed various Database Concurrency Control Algorithms.

 

- Group 4: Hao We surveyed the topic of learned indexes with focus on the multi-dimensional case.

 

To ensure continuity, graduate students from Walid’s group along with Walid, followed up the work of each of the four groups on weekly basis (one graduate student per group), and has been continuing this research in the hopes to finish and publish this work.

 

Of mention is the survey on multi-dimensional learned indexes which has been accepted as a tutorial in the ACM SIGSPATIAL 2020 Conference. Both Hao Wu (Undergraduate) and Abdullah Al Mamun (Ph.D. student) will be presenting this tutorial along with Walid at the conference.

 

Graduate Research Opportunities:

 

For graduate students, in Fall 2019 and Spring 2020, Walid has offered several research training opportunities as graduate-level PhD research and independent study courses that involve various project-related topics, where several students got training on presenting and discussing research papers and in conducting a semester-long project. These students include:

·      Amira Mamoun (Dataset Discovery Techniques)

·      Jaewoo Shin (Update-tolerant LSM-based Spatial Indexing)

·      Abdullah Al Mamun (Learned Spatial Indexing)

·      Ahmed Abdelhamid (Intelligent Data Partitioning in Micro-batched Big Data Streaming Systems)

·      Lu Xing (Concurrency Control and Query Optimization in Graph Data Management Systems)

·      Ruihong Wang (RDMA-based Big Data Systems)

·      Serkan Uzunbaz (Shared Execution for Data Analytics over Big Data Streams)

 

 

Walid has graduated one Ph.D. student, Anas Daghistani, co-advised with Professor Arif Ghafoor (ECE) in the topic of “Attack-Resilient Adaptive Load-Balancing in Distributed Spatial Data Streaming Systems”, in the context of this project. This research is currently under review for journal publication.

 

 

2020-2021 Project Activities:

 

In 2020-2021, we had several research and education activities in the context of this project. We have addressed the scalability challenge in real-time location services by developing SWARM, a light-weight adaptivity protocol that continuously monitors the data and query workloads across the distributed processes of the spatial data streaming system, and redistribute and rebalance the workloads soon as performance bottlenecks get detected. SWARM has been published in the ACM Transactions on Spatial Algorithms and Systems [ACM TSAS 2021]. We have investigated the adaptation of Log Structured Merge trees to support frequent updates in moving-object spatial databases that has been published in the IEEE International Conference on Data Engineering [ICDE 2021]. Along with undergraduate and graduate students, we presented a tutorial on the subject of learned multi-dimensional indexes in the 2020 ACM SIGSPATIAL Conference [ACM SIGSPATIAL 2020a]. We developed an online workload estimation technique that relies on a probabilistic model for estimating the workload of partitions and machines in a distributed spatial data streaming system [ACM SIGSPATIAL 2020b]. We developed an unbiased online sampling for the visual exploration of large spatiotemporal data that was published in the IEEE Visualization Conference [VAST 2020]. We have developed LocationSpark, a query executor, and an optimizer based on Spark to improve the query execution plan generated for spatial queries. The design and performance of LocationSpark were published in the Frontiers in Big Data [Frontiers 2020].

 

Walid introduced a project-based research component in the graduate-level database systems course at Purdue that has benefitted over 35 students. He mentored an undergraduate student to conduct undergraduate research in topics related to this project. In 2020/2021 Walid graduated one Ph.D. student in the topic of “Efficient Distributed Processing over Micro-Batched Data Streams”, and trained seven other Ph.D. students in project-related research.

 

Below, we highlight each of these contributions and other project’s research, education, and training activities.

 

SWARM: Adaptive Load Balancing in Distributed Streaming Systems for Big Spatial Data. [ACM TSAS 2021] As clearly from the scale of location services, their ubiquity, and the massive amounts of spatial data being generated in real-time, the current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial streaming systems. Existing systems use static spatial partitioning to distribute the workload. In contrast, the real-time streamed spatial data follows non-uniform spatial distributions that are continuously changing over time. Distributed spatial streaming systems need to react to the changes in the distribution of spatial data and queries. This research introduces SWARM, a light-weight adaptivity protocol that continuously monitors the data and query workloads across the distributed processes of the spatial data streaming system, and redistribute and rebalance the workloads soon as performance bottlenecks get detected. SWARM is able to handle multiple query-execution and data-persistence models. A distributed streaming system can directly use SWARM to adaptively rebalance the system's workload among its machines with minimal changes to the original code of the underlying spatial application. Extensive experimental evaluation using real and synthetic datasets illustrate that, on average, SWARM achieves 200% improvement over a static grid partitioning that is determined based on observing a limited history of the data and query workloads. Moreover, SWARM reduces execution latency on average 4x compared with other existing techniques.

 

Scalable Relational Query Processing on Big Matrix Data [Submitted 2021] The use of large-scale machine learning methods is becoming ubiquitous in many applications ranging from business intelligence to self-driving cars. These methods require a complex computation pipeline consisting of various types of operations, e.g., relational operations for pre-processing or post-processing the dataset, and matrix operations for core model computations. Many existing systems focus on efficiently processing matrix-only operations, and assume that the inputs to the relational operators are already pre-computed and are materialized as intermediate matrices. However, the input to a relational operator may be complex in machine learning pipelines, and may involve various combinations of matrix operators. Hence, it is critical to realize scalable and efficient relational query processors that directly operate on big matrix data. This research develops new efficient and scalable relational query processing techniques on big matrix data for in- memory distributed clusters. The proposed techniques leverage algebraic transformation rules to rewrite query execution plans into ones with lower computation costs. A distributed query plan optimizer exploits the sparsity-inducing property of merge functions as well as Bloom join strategies for efficiently evaluating various flavors of the join operation. Furthermore, optimized partitioning schemes for the input matrices are developed to facilitate the performance of join operations based on a cost model that minimizes the communication overhead. The proposed relational query processing techniques are prototyped in Apache Spark. Experiments on both real and synthetic data demonstrate that the proposed techniques achieve up to two orders of magnitude performance improvement over state-of-the-art systems on a wide range of applications.


The LSM RUM-Tree: A Log Structured Merge R-Tree for Update-intensive Spatial Workloads [ICDE 2021] Many applications require update-intensive workloads on spatial objects, e.g., social-network services and shared-riding services that track moving objects (devices). By buffering insert and delete operations in memory, the Log Structured Merge Tree (LSM) has been used widely in various systems because of its ability to handle insert-intensive workloads. While the focus on LSM has been on key-value stores and their optimizations, there is a need to study how to efficiently support LSM-based secondary indexes. We investigate the augmentation of a main-memory-based memo structure into an LSM secondary index structure to handle update-intensive workloads efficiently. We conduct this study in the context of an R-tree-based secondary index. In particular, we introduce the LSM RUM-tree that demonstrates the use of an Update Memo in an LSM-based R-tree to enhance the performance of the R-tree’s insert, delete, update, and search operations. The LSM RUM-tree introduces novel strategies to reduce the size of the Update Memo to be a light-weight in-memory structure that is suitable for handling update-intensive workloads without introducing significant overhead.

 

A Tutorial on Learned Multi-dimensional Indexes [ACM SIGSPATIAL 2020] Recently, Machine Learning (ML, for short) has been successfully applied to database indexing. Initial experimentation on Learned Indexes has demonstrated better search performance and lower space requirements than their traditional database counterparts. Numerous attempts have been explored to extend learned indexes to the multi-dimensional space. This makes learned indexes potentially suitable for spatial databases. The goal of this tutorial is to provide up-to-date coverage of learned indexes both in the single and multi-dimensional spaces. The tutorial covers over 25 learned indexes. The tutorial navigates through the space of learned indexes through a taxonomy that helps classify the covered learned indexes both in the single and multi-dimensional spaces.

 

TrioStat: Online Workload Estimation in Distributed Spatial Data Streaming Systems [ACM SIGSPATIAL 2020] The wide spread of GPS-enabled devices and the Internet of Things (IoT) has increased the amount of spatial data being generated every second. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial data streaming systems that scale to process in real-time large amounts of streamed spatial data. The performance of distributed streaming systems relies on how even the workload is distributed among their machines. However, it is challenging to estimate the workload of each machine because spatial data and query streams are skewed and rapidly change with time and users’ interests. Moreover, a distributed spatial streaming system often does not maintain a global system workload state because it requires high network and processing overheads to be collected from the machines in the system. In this research, we introduce TrioStat; an online workload estimation technique that relies on a probabilistic model for estimating the workload of partitions and machines in a distributed spatial data streaming system. It is infeasible to collect and exchange statistics with a centralized unit because it requires high network overhead. Instead, TrioStat uses a decentralized technique to collect and maintain the required statistics in real-time locally in each machine. TrioStat enables distributed spatial data streaming systems to com- pare the workloads of machines as well as the workloads of data partitions. TrioStat requires minimal network and storage overhead. Moreover, the required storage is distributed across the system’s machines.

 

STULL: Unbiased Online Sampling for Visual Exploration of Large Spatiotemporal Data [VAST 2020] Online sampling-supported visual analytics is increasingly important, as it allows users to explore large datasets with acceptable approximate answers at interactive rates. However, existing online spatiotemporal sampling techniques are often biased, as most re- searchers have primarily focused on reducing computational latency. Biased sampling approaches select data with unequal probabilities and produce results that do not match the exact data distribution, leading end users to incorrect interpretations. In this research, we propose a novel approach to perform unbiased online sampling of large spatiotemporal data. The proposed approach ensures the same prob- ability of selection to every point that qualifies the specifications of a user’s multidimensional query. To achieve unbiased sampling for accurate representative interactive visualizations, we design a novel data index and an associated sample retrieval plan. Our proposed sampling approach is suitable for a wide variety of visual analytics tasks, e.g., tasks that run aggregate queries of spatiotemporal data. Extensive experiments confirm the superiority of our approach over a state-of-the-art spatial online sampling technique, demonstrating that within the same computational time, data samples generated in our approach are at least 50% more accurate in representing the actual spatial distribution of the data and enable approximate visualizations to present closer visual appearances to the exact ones.

 

LocationSpark: In-memory Distributed Spatial Query Processing and Optimization [Frontiers 2020] Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this research, we develop new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for handling query skew that commonly happens in practice, and minimizes communication costs accordingly. We propose a distributed query scheduler that uses a new cost model to minimize the cost of spatial query processing. The scheduler generates query execution plans that minimize the effect of query skew. The query scheduler utilizes new spatial indexing techniques based on bitmap filters to forward queries to the appropriate local nodes. Each local computation node is responsible for optimizing and selecting its best local query execution plan based on the indexes and the nature of the spatial queries in that node. All the proposed spatial query processing and optimization techniques are prototyped inside Spark, a distributed memory-based computation system. Our prototype system is termed LocationSpark. The experimental study is based on real datasets and demonstrates that LocationSpark can enhance distributed spatial query processing by up to an order of magnitude over existing in-memory and distributed spatial systems.

 

Training and Professional Development

One Ph.D. student, Ahmed S. Abdelhamid, has graduated who was partially funded under this project. His dissertation topic is: Efficient Distributed Processing over Micro-Batched Data Streams. Student Nameer Qureshi took a graduate independent study course on topics related to this project.

 

In Fall 2020, I revised the project component of the graduate-level database systems course (CS541) to allow for new research opportunities and training of graduate students be semester-long group research projects of three students each. Each group agrees on a project from a list of potential projects that I provided to them (List is given below). Projects vary in nature to match the various orientations of the students. Some projects are survey-oriented while others are research-oriented with heavy programming components. Projects cover a broad spectrum of research topics. 39 students benefitted from this opportunity and formed 13 group projects that met with Walid on weekly and biweekly bases. Two of these projects have resulted in submitted conference papers that are under review, and another paper is currently being prepared. I had some collaborators from industrial companies, e.g., Google, Uber, and Facebook, to help collaborate in the projects, and provide the students an industrial twist. Student feedback was positive as many of them appreciated this research experience and the close follow-up and feedback I provided on their biweekly or weekly progressions. 

 

Projects provided included:  

(1) A survey of 50 years of database models

(2) Studying and comparing the performance of the disk page and record layouts of PostgreSQL, MySQL, and SQLite

(3) Studying and comparing the performance of the table directory structures and free-space management for PostgreSQL, MySQL, and SQLite

(4) Surveying of SSD and persistent memory storage technologies, their properties, and how they impact on query processing techniques, index design, concurrency, and recovery

(5) Query compilation techniques for graph data systems

(6) Handling bulk-loading anomalies in spatial indexing techniques (is leading to a conference paper publication)

(7) Realizing an updatable compressed bitmap index

(8) Implementing different concurrency control protocols in RDMA and evaluating their tradeoffs

(9) Studying RDMA-based distributed transaction management and realizing an RDMA-based two phase commit protocol,

(10) Surveying multi-model database system techniques

(11) Studying and comparing techniques for supporting high-dimensional vector databases

(12) Decoupling the storage engine from the compute engine in MySQL

(13) Realizing a new algorithm for answering k-nearest-neighbor queries based on a newly published external merge sort in spatial data systems (is leading to a conference publication)

(14) Designing and implementing various memory-based indexing techniques over RDMA that support concurrency.

 

There were 39 students in class that formed 13 groups. I met biweekly with every group for 30 minutes each over Zoom. Some groups wanted to be more productive by meeting weekly in contrast to biweekly and I accommodated that. The students demonstrated their weekly progress in the project in the form of slide-show presentations that the students have prepared for every meeting, and/or if need be, showed program code or performance studies and results.

 

For graduate students, in Fall 2020 and Spring 2021, Walid has offered several research training opportunities as graduate-level PhD research and independent study courses that involve various project-related topics, where several students got training on presenting and discussing research papers and in conducting a semester-long project. These students include

1.     Jaewoo Shin (Update-tolerant LSM-based Spatial Indexing – resulted in an ICDE 2021 paper)

2.     Abdullah Al Mamun (Learned Spatial Indexing – resulted in a conference paper submission)

3.     Ahmed Abdelhamid (Intelligent Data Partitioning in Micro-batched Big Data Streaming Systems – resulted in a conference paper submission)

4.     Lu Xing (two projects: Concurrency Control in Graph Data Management Systems, and Waves of Misery in R-trees – resulted in a conference paper submission)

5.     Ruihong Wang (RDMA-based Big Data Systems – resulted in a conference paper submission)

6.     Libin Zhou (Distance Oracles for dynamic query constraints)

7.     Yeasir Rayhan (Learned spatial data partitioning and vectorized spatial query processing).

 

Undergraduate Research Opportunities: In the context of this project, Walid has offered research opportunities during Summer and Fall 2020 for undergraduate students. Most notably is Undergraduate Student Hao Wu who was interested in the topic of ML-based Learned Multidimensional Data Indexing. His research in the topic materialized into a co-authored tutorial in the ACM SIGSPATIAL 2020 Conference, where co-authors Hao Wu and Graduate Student Abdullah Al-Mamun participated in the presentation of the tutorial at the conference. A survey article on the topic is currently underway.

 

2021-2022 Project Activities

 

In 2021/2022, the final year of the project, we investigated supporting data analytics over fast arriving spatial data streams. We developed STAR, a distributed in-memory data stream warehouse system, that provides low-latency and up-to-date analytical results over a fast-arriving spatial data stream. STAR was published in the 2021 ACM SIGSPATIAL Conference and was selected as one of the best papers in the Conference, and was invited in an extended form for journal publication, and was accepted in the ACM Transactions in Spatial Algorithms and Systems. We developed Guard, a component that detects and blocks attacks that target the adaptive load balancing of distributed streaming systems. Guard uses an unsupervised machine-learning technique to detect malicious users that are involved in the attack. Guard was published in the IEEE Transactions on Dependable and Secure Computing. We investigated the presence of non-deterministic performance and “waves of misery” in update-intensive workloads over several R-tree variants, and studied how to mitigate this issue. This research was published in Proceedings of the VLDB 2022. We studied learned multi-dimensional indexes in the context of an instance-optimized R-tree that was published in the IEEE Mobile Data Management Conference 2022. We envisioned how to design location data systems that have location as first-class data type and not as an afterthought. Finally, we studied the impact of new hardware architectures on the design of database systems and separating memory from compute in a disaggregated setup would impact database system techniques. Our vision paper was published in the Proceedings of the VLDB 2022.

 

Below, we highlight each of these contributions and other project’s research, education, and training activities.

 

STAR: A Cache-based Stream Warehouse System for Spatial Data [SIGSPATIAL'2021], [ACM TSAS 2023] The proliferation of mobile phones and location-based services has given rise to an explosive growth in spatial data. In order to enable spatial data analytics, spatial data needs to be streamed into a data stream warehouse system that can provide real-time analytical results over the most recent and historical spatial data in the warehouse. Existing data stream warehouse systems are not tailored for spatial data. In this paper, we introduce the STAR system. STAR is a distributed in-memory data stream warehouse system that provides low-latency and up-to-date analytical results over a fast-arriving spatial data stream. STAR supports both snapshot and continuous queries that are composed of aggregate functions and ad hoc query constraints over spatial, textual, and temporal data attributes. STAR implements a cache-based mechanism to facilitate the processing of snapshot queries that collectively utilizes the techniques of query-based caching (i.e., view materialization) and object-based caching. Moreover, to speed-up processing continuous queries, STAR proposes a novel index structure that achieves high efficiency in both object checking and result updating. Extensive experiments over real data sets demonstrate the superior performance of STAR over existing systems.

 

Guard: Attack-Resilient Adaptive Load Balancing in Distributed Streaming Systems [IEEE TDSC 2021] The performance of distributed streaming systems relies on how even the workload is distributed among their machines. However, data and query workloads are skewed and change rapidly. Therefore, multiple adaptive load-balancing mechanisms have been proposed in the literature to rebalance distributed streaming systems according to the changes in their workloads. This paper introduces a novel attack model that targets adaptive load-balancing mechanisms of distributed streaming systems. The attack reduces the throughput and the availability of the system by making it stay in a continuous state of rebalancing. This paper proposes Guard, a component that detects and blocks attacks that target the adaptive load balancing of distributed streaming systems. Guard uses an unsupervised machine-learning technique to detect malicious users that are involved in the attack. Guard does not block any user unless it detects that the user is malicious. Guard does not depend on a specific application. Experimental evaluation for a high-intensity attack illustrates that Guard improves the throughput and the availability of the system by 85% and 86%, respectively. Moreover, Guard improves the minimum availability that the attacker achieves by 325%.

 

An Experimental Evaluation and Investigation of Waves of Misery in R-trees [VLDB 2021] Waves of misery is a phenomenon where spikes of many node splits occur over short periods of time in tree indexes. Waves of misery negatively affect the performance of tree indexes in insertion-heavy workloads. Waves of misery have been first observed in the context of the B-tree, where these waves cause unpredictable index performance. In particular, the performance of search and index-update operations deteriorate when a wave of misery takes place, but is more predictable between the waves. This paper investigates the presence or lack of waves of misery in several R-tree variants, and studies the extent of which these waves impact the performance of each variant. Interestingly, although having poorer query performance, the Linear and Quadratic R-trees are found to be more resilient to waves of misery than both the Hilbert and R*-trees. This paper presents several techniques to reduce the impact in performance of the waves of misery for the Hilbert and R*-trees. One way to eliminate waves of misery is to force node splits to take place at regular times before nodes become full to achieve deterministic performance. The other way is that upon splitting a node, do not split it evenly but rather at different node utilization factors. This allows leaf nodes not to fill at the same pace. We study the impact of two new techniques to mitigate waves of misery after the tree index has been constructed, namely Regular Elective Splits (RES, for short) and Unequal Random Splits (URS, for short). Our experimental investigation highlights the trade-offs in performance of the introduced techniques and the pros and cons of each technique.

 

The “AI+R”-tree: An Instance-optimized R-tree [MDM 2022] The emerging class of instance-optimized systems has shown potential to achieve high performance by specializing to a specific data and query workloads. Particularly, Machine Learning (ML) techniques have been applied successfully to build various instance-optimized components (e.g., learned indexes). This paper investigates to leverage ML techniques to enhance the performance of spatial indexes, particularly the R-tree, for a given data and query workloads. As the areas covered by the R-tree index nodes overlap in space, upon searching for a specific point in space, multiple paths from root to leaf may potentially be explored. In the worst case, the entire R-tree could be searched. In this paper, we define and use the overlap ratio to quantify the de- gree of extraneous leaf node accesses required by a range query. The goal is to enhance the query performance of a traditional R-tree for high-overlap range queries as they tend to incur long running-times. We introduce a new AI-tree that transforms the search operation of an R-tree into a multi-label classification task to exclude the extraneous leaf node accesses. Then, we augment a traditional R-tree to the AI-tree to form a hybrid “AI+R”- tree. The “AI+R”-tree can automatically differentiate between the high- and low-overlap queries using a learned model. Thus, the “AI+R”-tree processes high-overlap queries using the AI- tree, and the low-overlap queries using the R-tree. Experiments on real datasets demonstrate that the “AI+R”-tree can enhance the query performance over a traditional R-tree by up to 500%.

 

The Case for Distributed Shared-Memory Databases with RDMA-Enabled Memory Disaggregation [VLDB’22] Memory disaggregation (MD) allows for scalable and elastic data center design by separating compute (CPU) from memory. With MD, compute and memory are no longer coupled into the same server box. Instead, they are connected to each other via ultra-fast networking such as RDMA. MD can bring many advantages, e.g., higher memory utilization, better independent scaling (of compute and memory), and lower cost of ownership. This paper makes the case that MD can fuel the next wave of innovation on database systems. We observe that MD revives the great debate of "shared what" in the database community. We envision that distributed shared- memory databases (DSM-DB, for short) – that have not received much attention before – can be promising in the future with MD. We present a list of challenges and opportunities that can inspire next steps in system design making the case for DSM-DB.

 

ILX: Intelligent "Location+X" Data Systems (Vision Paper) Due to the ubiquity of mobile phones and location-detection devices, location data is being generated in very large volumes. Queries and operations that are performed on location data warrant the use of database systems. Despite that, location data is being supported in data systems as an afterthought. Typically, relational or NoSQL data systems that are mostly designed with non-location data in mind get extended with spatial or spatiotemporal indexes, some query op- erators, and higher level syntactic sugar in order to support location data. The ubiquity of location data and location data services call for systems that are solely designed and optimized for the efficient support of location data. This paper envisions designing intelligent location+X data systems, ILX for short, where location is treated as a first-class citizen type. ILX is tailored with location data as the main data type (location-first). Because location data is typically augmented with other data types X, e.g., graphs, text data, click streams, annotations, etc., ILX needs to be extensible to support other data types X along with location. This paper envisions the main features that ILX should support, and highlights research challenges in realizing and supporting ILX.

 

Training and Professional Development

 

During the 2021/2022 academic year, Walid has offered several research training opportunities for graduate students as graduate-level PhD research and independent study courses that involve various project-related topics, where several students got training on presenting and discussing research papers and in conducting a semester-long project. These students include

1.     Jaewoo Shin (Concurrency Control for Update-tolerant LSM-based Spatial and B-tee Indexing – resulted in a journal paper submission)

2.     Abdullah Al Mamun (Learned R-tree Spatial Indexing – resulted in an MDM’22 conference paper)

3.     Lu Xing (two projects: Waves of Misery in R-trees – resulted in a VLDB’21 conference paper, and adaptive tree indexing)

4.     Ruihong Wang (RDMA-based Big Data Systems – resulted in a VLDB’22 vision paper)

5.     Libin Zhou (two projects: Lock-free Concurrency Control for Graph Systems – resulted in a conference paper submission, and Distance Oracles for dynamic query constraints)

6.     Yeasir Rayhan (SIMD-aware R-tree index – resulted in a conference paper submission).

 

* Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

 

Date of Last Update: December 10, 2023