NSF- III: Small: On the Conceptual Evaluation and Optimization of Queries in Spatiotemporal Data Systems

 

Walid G. Aref

Principal Investigator

Department of Computer Science
Purdue University

_
Contact Information

Walid G. Aref

Department of Computer Science

Purdue University

305 N. University Street

West Lafayette, Indiana 47907

Phone: (765) 494-1997
Fax : (765) 494-0739
Email: aref@cs.purdue.edu
URL: http://www.cs.purdue.edu/faculty/aref.html

Project Award Information

Project Web Page:
http://www.cs.purdue.edu/homes/aref/dbsystems2014-3.htm

Project Focus:

The widespread use of smart-phones, social networks, and location devices has given rise to a wide spectrum of location-based applications and services. Along with the advancement and ubiquity of location-based applications and services, users' queries and needs are consistently growing in complexity and are becoming very sophisticated. Current location-based query systems offer shallow query interface capabilities that do not match users' expectations. This project identifies and addresses the challenges in correctly expressing and efficiently executing complex location-based queries. A user's query can get arbitrarily complex having a combination of multiple location-based, relational, and textual operations, e.g., location-based selections, spatial joins, relational selections, relational joins, relational group-by's, etc. Current location-based query systems support only one location-based operator, e.g., range queries or nearby queries (k-Nearest-Neighbors, or kNN, for short). Our focus is to develop a query system, termed the multi-predicate location-based query processing system, that supports any combination of location-based, relational, and textual operations.

 

Progress:

In the initial phase of this project, we are able to identify location-based application scenarios and their associated queries that current state-of-the-art location-based query systems cannot handle. We isolate, identify, and classify the reasons and ingredients for what makes these queries hard to handle. The first class of such queries is one that involves two k-nearest-neighbor operators. We term this class of queries as the 2-kNN queries. Many application scenarios call for this type of queries. We highlight several examples in our [VLDB2012] paper. There are several challenges that arise when handling 2-kNN queries. These challenges include:

(1) Depending on the order of execution of the two kNN operations, the result of a 2-kNN query will differ.

(2) How to determine what the correct answer for a 2-kNN query should be and on what bases?

(3) How to efficiently execute a 2-kNN query while preserving correctness?

(4) How to adapt well-known relational query optimization strategies (e.g., pushing selects under joins in query evaluation plans) to still be applicable in the context of 2-kNN queries?

 

Then, our research focused on expanding the capabilities of multi-predicate location-based query servers beyond the class of 2-kNN queries along several dimensions. We explored various combinations of relational and spatial operations (Journal Papers: GeoInformatica 2013, VLDBJ 2013). We studied crowd-based database indexing techniques (DBCrowd 2013, IEEE TKDE 2013), and semantic web location-based linked data architectures (IEEE BigData 2013).

 

In 2013-2014, we studied optimization issues of queries that involve both spatial and relational predicates. We developed a novel cost estimation model for kNN operations when executing in conjunction with relational operators. We extend our research to trajectories with limited histories (SIGSPATIAL 2014). We study adaptivity of query execution plans in continuous streaming queries (EDBT 2014). We extend our work to "approximate" nearest-neighbor computations in the context of high-dimensional data, and proximity-based intersect operators (SISAP 2014). Our work on dependency models when human actions are part of a database transaction appears in (IEEE TKDE 2014).

 

 

Research Activities and Publications:

Spatial Queries with Two kNN Predicates [VLDB 2012]

The widespread use of location-aware devices has led to countless location-based services in which a user query can be arbitrarily complex, i.e., one that embeds multiple spatial selection and join predicates. Amongst these predicates, the k-Nearest-Neighbor (kNN) predicate stands as one of the most important and widely used predicates. Unlike related research, this research goes beyond the optimization of queries with single kNN predicates, and shows how queries with two kNN predicates can be optimized. In particular, our research addresses the optimization of queries with:

(i)             two kNN-select predicates

(ii)           two kNN-join predicates, and

(iii)          one kNN-join predicate and one kNN-select predicate.

For each type of queries, conceptually correct query evaluation plans (QEPs) and new algorithms that optimize the query execution time are studied. Experimental results demonstrate that the proposed algorithms outperform the conceptually correct QEPs by orders of magnitude. We conducted the first complete study for the optimization of queries with two kNN predicates. We demonstrated how traditional optimization techniques can compromise the correctness of evaluation for a query that involves two interacting kNN predicates. For different combinations of two kNN predicates, we presented efficient algorithms that guarantee the correct- ness of evaluation, and outperform the corresponding conceptually correct QEPs by orders of magnitude. The algorithms introduced are designed for snapshot queries. Applying further optimization techniques that can support incremental evaluation of continuous queries with two kNN predicates is a potential future work. Moreover, we believe that the ideas introduced through this research pave the way towards a query optimizer that can support spatial queries with more than two kNN predicates.

 

M3: Stream Processing on Main-Memory MapReduce [ICDE 2012]

MapReduce has been a popular framework for parallel computing to process large scale data on large scale computing clusters. However, most of the current implementations of the MapReduce framework support only the execution of fixed-input jobs. Such restriction makes these implementations inapplicable for most streaming applications, in which queries are continuous in nature and input data streams are continuously received at high arrival rates. In this research, we realize M3, a prototype implementation of the Hadoop MapReduce framework in which continuous queries over streams of data can be efficiently answered. Data-path in M3 is over main-memory only, bypassing the Hadoop Distributed File System (HDFS) and any local disk accesses. M3 is resilient to machine failures and utilizes the main-memory of the machines by dynamically adapting to the fluctuations in the data arrival rates of the streams. M3 operates on time pulses, where it has a data acquisition layer with DataTasks that collect streamed data chunks that arrive during time periods. Map and Reduce tasks execute continuously while periodically taking turns to process the chunks of data collected. This research highlights techniques and challenges for realizing various modes of synchronization and overlap in execution of tasks in M3. We envision that M3 will be the underlying massively distributed streaming engine for handling city-scale streamed spatiotemporal data that we will be realizing next.

 

Similarity queries: their conceptual evaluation, transformations, and processing [VLDB Journal 2013]

This research forms the foundation for multi-predicate spatial queries that involve kNN predicates. Similarity queries is one possible generalization of range queries and kNN queries, where the former can be viewed as a similarity operation (within epsilon similarity) while the latter is a k-nearest-neighbors similarity. Many application scenarios can significantly benefit from the identification and processing of spatial similarities in the data. Even though some work has been done to extend the semantics of some operators, for example join and selection, to be aware of data similarities, there has not been much study on the role and implementation of similarity-aware operations as first-class database operators. Furthermore, very little work has addressed the problem of evaluating and optimizing queries that combine several similarity operations. The focus of this research is the study of similarity queries that contain one or multiple first-class similarity database operators such as Similarity Selection, Similarity Join, and Similarity Group-by. Particularly, we analyze the implementation techniques of several similarity operators, introduce a consistent and comprehensive conceptual evaluation model for similarity queries, and present a rich set of transformation rules to extend cost-based query optimization to the case of similarity queries.

The focus of this research is the proposal and analysis of several similarity database operators, and the thorough study of the evaluation and optimization of similarity queries that combine multiple similarity operators. We introduce a model for the conceptual evaluation of similarity queries that clearly specifies the way a similarity query should be evaluated even if it has multiple similarity operations. We present a rich set of transformation rules for similarity queries to transform the conceptual evaluation plan into more efficient plans. Furthermore, we demonstrate that transformation rules for similarity queries can take advantage of special properties of the similarity-aware operations and the involved distance functions to enable more useful query transformations. We also extend the important Eager/Lazy Aggregation transformation rules to the case of SGB and SJ. The experimental evaluation of the proposed rules shows they are highly effective. We also show that similarity operators can be efficiently implemented taking advantage of structures and mechanisms already available in DBMSs.

 

Continuous aggregate nearest neighbor queries [GeoInformatica 2013]

This research addresses the problem of supporting continuous aggregate nearest-neighbor (CANN) queries for moving objects in spatio-temporal data stream management systems. A CANN query specifies a set of landmarks, an integer k, and an aggregate distance function f (e.g., min, max, or sum), where f computes the aggregate distance between a moving object and each of the landmarks. The answer to this continuous query is the set of k moving objects that have the smallest aggregate distance f. A CANN query may also be viewed as a combined set of nearest neighbor queries. We introduce several algorithms to continuously and incrementally answer CANN queries. Extensive experimentation shows that the proposed operators outperform the state-of-the-art algorithms by up to a factor of 3 and incur low memory overhead. P-CANN is a best-effort progressive algorithm that associates thresholds with the individual landmarks. These thresholds are used to eagerly prune the moving objects. Different elimination order policies are identified to specify the order of the landmarks in the computation of the aggregate distance in P-CANN. The optimization of P-CANN and how to assign the thresholds are addressed. From the performed extensive experiments, we achieve cases whose performance is improved by up to a factor of 3 from the state-of-the-art algorithm. P-CANN out-performs both H-CANN and the CPM algorithm (the state of the art). For the optimization of P-CANNs, the worst-fit elimination policy gives the least penalty for sum (additional 14% CPU cost away from optimal) when we do not use the prohibitively expensive optimal order. On the other hand, the best-fit elimination policy gives the least penalty for the max case. The optimization time of P-CANN is about 100 msec for typical CANN queries. P-CANN, which is a best-effort algorithm might produce 95% of the required output size on the average. As for future work, we will investigate the continuous aggregate nearest neighbor queries on moving objects in road networks. On the one hand, we would like to develop similar incremental algorithms using the road network distance instead of the Euclidean distance between the objects. On the other hand, we would like to make a first-class operator in a data stream management system such that the CANN operator is considered while optimizing the query evaluation plan.

 

The Palm-tree Index: Indexing with the crowd [DBCrowd 2013]

This research addresses using humans (the crowd) to perform database indexing and searching operations. Crowdsourcing services allow employing human intelligence in tasks that are difficult to accomplish with computers such as image tagging and data collection. At a relatively low monetary cost and through web interfaces such as Amazon’s Mechanical Turk (AMT), humans can act as a computational operator in large systems. Recent work has been conducted to build database management systems that can harness the crowd power in database operators, such as sort, join, count, etc. The fundamental problem of indexing within crowd-sourced databases has not been studied. In this research, we study the problem of tree-based indexing within crowd-enabled databases. We investigate the effect of various tree and crowdsourcing parameters on the quality of index operations. We propose new algorithms for index search, insert, and update. We propose a taxonomy of the problem and highlight its main challenges. We propose techniques for index construction and querying. We intend to complete this study with an extensive analysis and experimental evaluation. Our current work focuses on one-dimensional indexing. We plan to study other variations of crowd-sourced indexing, such as multidimensional indexing, spatial and spatio-temporal indexing and fuzzy indexing.

 

Knowledge Cubes - A Proposal for Scalable and Semantically-Guided Management of Big Data [IEEE BigData 2013]

A Knowledge Cube, or cube for short, is an intelligent and adaptive database instance capable of storing, analyzing, and searching data. Each cube is established based on a set of semantic aspects, e.g., (1) Topical, (2) Contextual, (3) Spatial, or (4) Temporal. A cube specializes in handling data that is only relevant to its semantic aspect. Knowledge cubes are inspired by two prime architectures: (1) "Dataspaces" that provides an abstraction for data management where heterogeneous data sources can co-exist and it requires no unifying schema to be specified in advance (2) "Linked Data" that provides a set of best practices for publishing and interlinking structured data on the web. A knowledge cube uses the best practices of Linked Data as its main building block for its data layer and encompasses some of the data integration abstractions defined by Dataspaces. In this research, we propose knowledge cubes as a semantically-guided data management architecture, where data management is influenced by the data semantics rather than by a predefined scheme. Knowledge cubes support the five pillars of Big Data also known as the five V's : Volume, Velocity, Veracity, Variety, and Value. Many interesting opportunities can be leveraged when learning the semantics of the data. In the research, we highlight these opportunities and propose a strawman design for knowledge cubes along with the research challenges that arise when realizing them. Our research forms a vision for the knowledge cube architecture that motivates understanding the semantics of the data as a mean for solving some of the research challenges we face today, specifically in Big Data.  We also motivate the use of Linked Data as a mean for achieving this target given the rich semantics that Linked Data provides. We illustrate some of the challenges and potential solutions associated with the knowledge cubes proposal. Finally, we demonstrate how the proposed architectural components can be used to tackle the five pillars of Big Data.

 

JISC: Adaptive Stream Processing Using Just-In-Time State Completion [EDBT 2014]

The continuous and dynamic nature of data streams may lead a query execution plan (QEP) of a long-running continuous query to become suboptimal during execution, and hence will need to be altered. The ability to perform an efficient and flawless transition to an equivalent, yet optimal QEP is essential for a data stream query processor. Such transition is challenging for plans with stateful binary operators, such as joins, where the states of the QEP have to be maintained during query transition without compromising the correctness of the query output. In this research, we develop a Just-In-Time State Completion (JISC); a new technique for query plan migration. JISC does not cause any halt to the query execution, and thus allows the query to maintain steady output. JISC is applicable to pipelined as well as eddy-based query evaluation frameworks. During plan transition, JISC maintains steady query output in contrast to existing plan adaptation techniques that can exhibit significant output latencies. Maintaining steady query output is essential for applications that require continuous monitoring or real-time response. JISC employs a lazy state migration strategy that computes the missing states in the new QEP on-demand in a single QEP. In highly dynamic scenarios where the queries and streams have fluctuating selectivities and arrival rates, overlapped plan transitions are imminent. The lazy migration strategy enables JISC to avoid performance thrashing, which is a vital issue in all other existing techniques. A key property of JISC is that during plan transition, it detects the unchanged parts of the QEP and avoids recomputing their states. Our probabilistic analysis demonstrates that the number of operators with unchanged states in the new QEP is usually close to the total number of operators. For that reason, JISC outperforms existing techniques that also maintain steady query output, e.g., the Parallel Track Strategy, leading to performance gains during a plan migration stage that can reach up to an order of magnitude.

 

Indexing Recent Trajectories of Moving Objects [ACM SIGSPATIAL 2014]

Advances in location-aware and object-tracking devices have led to the development of location-based services that process large amounts of spatio-temporal data. The need for efficient processing of queries on the locations of moving objects calls for trajectory-based indexing methods. These indexes need to capture both the spatial and temporal dimensions of the data in a way that minimizes the number of disk I/Os required for both updating and querying. Most existing spatio-temporal index structures capture either the current locations of the moving objects or the entire history of the moving objects. However, many applications require that only the recent history of a moving object's trajectory be stored. This research introduces the trails-tree; a disk-based data structure for indexing the recent history of moving objects' trajectories. The trails-tree maintains a temporal-sliding window over trajectories and uses: (1) an in-memory memo structure that reduces the I/O cost of updates using a lazy-update mechanism, and (2) lazy vacuum-cleaning mechanisms to delete parts of the trajectories that fall out of the sliding window. Theoretical analysis and experimental evaluation illustrate that the trails-tree outperforms the state-of-the art index structures for indexing recent trajectory data by up to a factor of two.

 

The Similarity-aware Relational Intersect Database Operator [SISAP 2014] (Best Paper Award)

Identifying similarities in large datasets is an essential operation in many applications such as bioinformatics, pattern recognition, and data integration. To make the underlying database system similarity-aware, the core relational operators have to be extended. Several similarity-aware relational operators have been proposed that introduce similarity processing at the database engine level, e.g., similarity joins and similarity group-by. In this research, we extend the semantics of the set intersection operator to operate over similar values.

We introduced the semantics and extended SQL syntax of the similarity-based set intersection operator. We developed an algorithm that is based on a Mark/Restore mechanism to avoid the O(n^2) complexity. The proposed operator is implemented inside an open-source database system, namely PostgreSQL. Several queries from the TPC-H benchmark are extended to include similarity-based set intersection predicates. Our implementation of the proposed operator outperforms the queries that produce the same result using only regular operations. The speedup ranges between 1000 and 4 times for similarity threshold values ranging between 0.01% and 10% of the attribute domain range. We also demonstrated that the added functionality is achieved without a big overhead when compared to standard operators.

 

HandsOn DB: Managing Data Dependencies Involving Human Actions [TKDE 2014]

Consider two values, x and y, in the database, where y = F(x). To maintain the consistency of the data, whenever x changes, F needs to be executed to re-compute y and update its value in the database. This is straightforward in the case where F can be executed by the DBMS, e.g., SQL or C function. In this published research in [TKDE 2014], we address the more challenging case where F is a human action, e.g., conducting a wet-lab experiment, taking manual measurements, or collecting instrument readings. In this case, when x changes, y remains invalid (inconsistent with the current value of x) until the human action involved in the derivation is performed and its output result is reflected into the database. Many application domains, e.g., scientific applications in biology, chemistry, and physics, contain multiple such derivations and dependencies that involve human actions. In this research, we introduce HandsOn DB, a prototype database engine for managing dependencies that involve human actions while maintaining the consistency of the derived data. HandsOn DB includes the following features: (1) semantics and syntax for interfaces through which users can register human activities into the database and express the dependencies among the data items on these activities; (2) mechanisms for invalidating and revalidating the derived data; and (3) new operator semantics that alert users when the returned query results contain potentially invalid data, and enable evaluating queries on either valid data only, or both valid and potentially invalid data. Performance results are presented that study the overheads associated with these features and demonstrate the feasibility and practicality in realizing HandsOn DB.

 

LIMO: A Web-based Environment for Learning Programming using Interactive Map Operations [Technical Report 2014]

Learning how to write computer programs is often challenging for the novice with little or no coding experience. Many pedagogical programming environments motivate programming with intuitive syntax-free interfaces, interesting real-world contexts, and animated output. In this research, we introduce LIMO; a web-based programming environment that is centered around operation on interactive geographical maps, location-oriented data, and the operations of synthetic objects that move on the maps. LIMO materializes a low-cost open-ended environment that integrates interactive maps and spatial data (e.g., OpenStreetMap). The unique advantage of LIMO is that it relates programming concepts to geographical maps and locations that have basis in reality. LIMO offers an environment for learning how to program by providing 1) map and programming constructs, 2) high-quality interactive map graphics, and 3) example programs that introduce users to writing programs in the LIMO environment.

 

AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data [VLDB 2016 – Full Paper (To Appear)] (paper) (AQWA code)

In support of big spatial data, this research focuses on the data layer in a cluster-based platform. The motivation for this research is the unprecedented spread of location-aware devices that has resulted in a plethora of location-based services in which huge amounts of spatial data need to be efficiently processed by large-scale computing clusters. Existing cluster-based systems for processing spatial data employ static data partitioning structures that cannot adapt to data changes, and that are insensitive to the query-workload. Hence, these systems are not able to consistently provide good performance. To close this gap, we present AQWA, an adaptive and query-workload-aware data partitioning mechanism for processing large-scale spatial data. AQWA does not assume prior knowledge of the data distribution or the query-workload. Instead, AQWA incrementally reacts to changes in both the data and the query-workload. As data is consumed and queries are processed, the data partitions are incrementally updated. With extensive experiments using real spatial data from Twitter, we demonstrate that AQWA can achieve an order of magnitude gain in query performance compared to standard spatial partitioning structures.

 

Spatial Queries with k-Nearest-Neighbor and Relational Predicates [ACM SIGSPATIAL 2015 – Full Paper] (paper) (Code)

This research is an important milestone for this project as it shows all the feasible optimizations when k-Nearest-Neighbor (kNN, for short) predicates are intermixed with relational predicates in SQL-like query languages. This research is motivated by the ubiquity of location-aware devices and smartphones that has unleashed an unprecedented proliferation of location-based services that require processing queries with both spatial and relational predicates. Many algorithms and index structures already exist for processing kNN predicates either solely or when combined with textual keyword search. Unfortunately, there has not been enough study on how to efficiently process queries where kNN predicates are combined with general relational predicates, i.e., ones that have selects, joins and group-by’s. One major challenge is that because the kNN is a ranking operation, applying a relational predicate before or after a kNN predicate in a query evaluation pipeline (QEP, for short) can result in different outputs, and hence leads to different query semantics. In particular, this renders classical relational query optimization heuristics, e.g., pushing selects below joins, inapplicable. This research introduces and studies various query optimization heuristics for queries that involve combinations of kNN select/join predicates and relational predicates. The proposed optimizations can significantly enhance the performance of these queries while preserving their semantics. Experimental results that are based on queries from the TPC-H benchmark and real spatial data from OpenStreetMap demonstrate that the proposed optimizations can achieve orders of magnitude enhancement in query performance.

 

Cost Estimation of Spatial k-Nearest-Neighbor Operators EDBT 2015 (paper) (Code)

Advances in geo-sensing technology have led to an unprecedented spread of location-aware devices. In turn, this has resulted into a plethora of location-based services in which huge amounts of spatial data need to be efficiently consumed by spatial query processors. For a spatial query processor to properly choose among the various query processing strategies, the cost of the spatial operators has to be estimated. In this research, we study the problem of estimating the cost of the spatial k-nearest-neighbor (k-NN, for short) operators, namely, k-NN-Select and k-NN-Join. Given a query that has a k-NN operator, the objective is to estimate the number of blocks that are going to be scanned during the processing of this operator. Estimating the cost of a k-NN operator is challenging for several reasons. For instance, the cost of a k-NN-Select operator is directly affected by the value of k, the location of the query focal point, and the distribution of the data. Hence, a cost model that captures these factors is relatively hard to realize. This study introduces cost estimation techniques that maintain a compact set of catalog information that can be kept in main-memory to enable fast estimation via lookups. A detailed study of the performance and accuracy trade-off of each proposed technique is presented. Experimental results using real spatial datasets from OpenStreetMap demonstrate the robustness of the proposed estimation techniques.

 

A Demonstration of AQWA: Adaptive Query Workload Aware Partitioning of Big Spatial Data  [VLDB 2015 Demo paper] (Demo paper) (AQWA code)

During Academic Year 2014-2015, we have realized a prototype open-source system for AQWA. AQWA is motivated by the needs for efficiently organizing Big Spatial Data. The ubiquity of location-aware devices, e.g., smartphones and GPS-devices, has led to a variety of location-based services in which huge amounts of geo-tagged information is created every day. This prototype system demonstrates AQWA, a novel mechanism for partitioning large-scale spatial data over distributed clusters. Unlike existing cluster-based systems, e.g. SpatialHadoop, that apply static partitioning of spatial data, AQWA has the ability to react to changes in query-workload or data distribution. A key feature of AQWA is that it does not assume prior knowledge of the query-workload or data distribution. Instead, AQWA adapts to changes in the query-workload or data distribution in an online fashion by incrementally updating the partitioning of the data in a way that enhances the query execution time. We demonstrate two real prototypes of AQWA applied to both Hadoop and Spark platforms. In both prototypes, we process spatial range and k-nearest-neighbor queries over large-scale real and synthetic datasets, and exploit the performance of AQWA under different query-workloads. Currently, we are making the code for AQWA on Hadoop publicly available.

 

Tornado: A Distributed SpatioTextual Stream Processing System [VLDB 2015 Demo paper] (Demo paper)

During Academic Year 2014-2015, we have realized a prototype open-source system termed Tornado. Tornado is motivated by the needs for efficiently querying Big Spatial Data. Nowadays, most spatial data is also associated with text data. The widespread use of location-aware devices together with the increased popularity of micro-blogging applications (e.g., Twitter) has led to the creation of large streams of spatio-textual data. In order to serve real-time applications, the processing of these large-scale spatio-textual streams needs to be distributed. However, existing distributed stream processing systems (e.g., Spark and Storm) are not optimized for spatial/textual content. In this demonstration, we introduce Tornado, a distributed in-memory spatio-textual stream processing server that extends Storm. To efficiently process spatio-textual streams, Tornado introduces a spatio-textual indexing layer to the architecture of Storm. The indexing layer is adaptive, i.e., dynamically re-distributes the processing across the system according to changes in the data distribution and/or query workload. In addition to keywords, higher-level textual concepts are identified and are semantically matched against spatio-textual queries. Tornado provides data de-duplication and fusion to eliminate redundant textual data. We demonstrate a prototype of Tornado running against real Twitter streams, where the users can register continuous or snapshot spatio-textual queries using a map-assisted query interface.

 

Kangaroo: Workload-Aware Processing of Range Queries in Hadoop [WSDM 2016 – Full Paper] (paper)

Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures, e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, in this study, we prototype Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload aware, i.e., results in partitioning layouts that minimize the query processing time of given query patterns. In this study, we expose and address the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.

 

Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs [SIGMOD 2016 – Full Paper] (paper)

A variety of applications spanning various domains, e.g., social networks, transportation, and bioinformatics, have graphs as first-class citizens. These applications share a vital operation, namely, finding the shortest path between two nodes. In many scenarios, users are interested in filtering the graph before finding the shortest path. For example, in social networks, one may need to compute the shortest path between two persons on a sub-graph containing only family relationships. This research focuses on dynamic graphs with labeled edges, where the target is to find a shortest path after filtering some edges based on user-specified query labels. This problem is termed the Edge-Constrained Shortest Path query (or ECSP, for short). This research introduces Edge-Disjoint Partitioning (EDP, for short), a new technique for efficiently answering ECSP queries over dynamic graphs. EDP has two main components: a dynamic index that is based on graph partitioning, and a traversal algorithm that exploits the regular patterns of the answers of ECSP queries. EDP partitions the graph based on the labels of the edges. On demand, EDP computes specific sub-paths within each partition and updates its index. The computed sub-paths are cached and can be leveraged by future queries. To answer an ECSP query, EDP connects sub-paths from different partitions using its efficient traversal algorithm. EDP can dynamically handle various types of graph updates, e.g., label, edge, and node updates. The index entries that are potentially affected by graph updates are invalidated and are re-computed on demand. EDP is evaluated using real graph datasets from various domains. Experimental results demonstrate that EDP can achieve query performance gains of up to four orders of magnitude in comparison to state of the art techniques.

 

Cruncher: Distributed In-Memory Processing for Location-Based Services [IEEE ICDE 2016] (Demo paper)

Advances in location-based services (LBS) demand high-throughput processing of both static and streaming data. Recently, many systems have been introduced to support distributed main-memory processing to maximize the query throughput. However, these systems are not optimized for spatial data processing. In this system, we showcase Cruncher, a distributed main-memory spatial data warehouse and streaming system. Cruncher extends Spark with adaptive query processing techniques for spatial data. Cruncher uses dynamic batch processing to distribute the queries and the data streams over commodity hardware according to an adaptive partitioning scheme. The batching technique also groups and orders the overlapping spatial queries to enable inter-query optimization. Both the data streams and the offline data share the same partitioning strategy that allows for data co-locality optimization. Furthermore, Cruncher uses an adaptive caching strategy to maintain the frequently-used location data in main memory. Cruncher maintains operational statistics to optimize query processing, data partitioning, and caching at runtime. We demonstrate two LBS applications over Cruncher using real datasets from OpenStreetMap and two synthetic data streams. We demonstrate that Cruncher achieves order(s) of magnitude throughput improvement over Spark when processing spatial data.

 

Efficient Processing of Hamming-Distance-Based Similarity Search Queries Over MapReduce [EDBT 2015] (paper)

Similarity and proximity searches are crucial to many applications. Of particular interest are two flavors of the Hamming distance range query, namely, the Hamming select and the Hamming join (Hamming-select and Hamming-join, respectively). Hamming distance is widely used in approximate near neighbor search for high dimensional data, such as images and document collections. For example, using predefined similarity hash functions, high-dimensional data is mapped into one-dimensional binary codes that are, then linearly scanned to perform Hamming-distance comparisons. These distance comparisons on the binary codes are usually costly and, often involve excessive redundancies. In this study, we introduce a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries. An efficient search algorithm based on the HA-index is presented. A distributed version of the HA-index is introduced and algorithms for realizing Hamming distance-select and Hamming distance-join operations on a MapReduce platform are prototyped. Extensive experiments using real datasets demonstrates that the HA-index and the corresponding search algorithms achieve up to two orders of magnitude speedup over existing state-of-the-art approaches, while saving more than ten times in memory space.

 

Geo-tagging Non-Spatial Concepts [ACM SIGSPATIAL Mobile GIS workshop 2015  - To Appear] (paper)

This research addresses a new interesting type of queries, namely, geo-tagging non-spatial concepts. For example, consider the following queries: Show on the map potential crime locations in Chicago, or show on the map potential pollution regions in San Francisco. Because “crime” and “pollution” are non-spatial concepts, it is hard to pin-point a location on the map to place these concepts.  In general, Concept Geo-tagging is the process of assigning a textual identifier that describes a real-world entity to a physical geographic location. A concept can either be a spatial concept where it possesses a spatial presence or be a non-spatial concept where it has no explicit spatial presence. Geo-tagging locations with non-spatial concepts that have no direct relation is a very useful and important operation but is also very challenging. The reason is that, being a non-spatial concept, e.g., crime, makes it hard to geo-tag it. In this study, we propose to use the semantic information associated with concepts and locations such as the type as a mean for identifying these relations. The co-occurrence of spatial and non-spatial concepts within the same textual resources, e.g., in the web, can be an indicator of a relationship between these spatial and non-spatial concepts. Techniques are presented for learning and modeling relations among spatial and non-spatial concepts from web textual resources. Co-occurring concepts are extracted and modeled as a graph of relations. This graph is used to infer the location types related to a concept. A location type can be a hospital, restaurant, an educational facility and so forth. Due to the immense number of relations that are generated from the extraction process, a semantically-guided query processing algorithm is introduced to prune the graph to the most relevant set of related concepts. For each concept, the most relevant types are matched against the location types. Experiments evaluate the proposed algorithm based on its filtering efficiency and the relevance of the discovered relationships. Performance results illustrate how semantically-guided query processing can outperform the baseline in terms of efficiency and relevancy. The proposed approach achieves an average precision of 74% across three different datasets.

 

LIMO: Learning Programming using Interactive Map Activities [ACM SIGSPATIAL 2015 Demo Paper] (Demo paper)

Advances in geographic information, interactive two- and three-dimensional map visualization accompanied with the proliferation of mobile devices and location data have tremendously benefited the development of geo-educational applications. We demonstrate LIMO; a web-based programming environment that is centered around operations on interactive geographical maps, location-oriented data, and the operations of synthetic objects that move on the maps. LIMO materializes a low-cost open-ended environment that integrates interactive maps and spatial data (e.g., Open-StreetMap). The unique advantage of LIMO is that it relates programming concepts to interactive geo- graphical maps and location data. LIMO offers an environment for students to learn how to program by providing 1. an easy to program library of map and spatial operations, 2. high-quality interactive map graphics, and 3. example programs that introduce users to writing programs in the LIMO environment.

 

On Map-Centric Programming Environments [ACM SIGSPATIAL 2015 Vision Paper] (Vision paper)

2D maps and 3D globes can be used as programming toys to help students learn programming in contrast to using robots, visual, or multimedia components in Computer Science I introductory programming courses. This study exposes research challenges related to supporting this concept, and presents one instance of a 2D and 3D map-based programming environment to motivate these challenges.

 

Similarity Group-by Operators for Multi-dimensional Relational Data [IEEE TKDE] (paper)

The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. While the standard group-by operator, which is based on equality, is useful in several applications, allowing proximity-aware grouping provides a more realistic view on real-world data that could lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.

 

The Similarity-aware Relational Database Set Operators [Information Systems Journal, 2015] (paper)

Finding data items in the proximity of a query item can be formulated as a similarity operation. Identifying these similarities in large datasets is an essential operation in several applications such as bioinformatics, pattern recognition, and data integration. To make a relational database management system similarity-aware, the core relational operators have to be extended. While similarity-awareness has been introduced in database engines for relational operators such as joins and group-by, little has been achieved for relational set operators, namely Intersection, Difference, and Union. In this paper, we propose to extend the semantics of relational set operators to take into account the similarity of values. We develop efficient query processing algorithms for evaluating them, and implement these operators inside an open-source database system, namely PostgreSQL. By extending several queries from the TPC-H benchmark to include predicates that involve similarity-based set operators, we perform extensive experiments that demonstrate up to three orders of magnitude speedup in performance over equivalent queries that only employ regular operators.

 

ORLF: A Flexible Framework for Online Record Linkage and Fusion [IEEE ICDE 2016] (Demo paper)

With the exponential growth of data on the Web comes the opportunity to integrate multiple sources to give more accurate answers to user queries. Upon retrieving records from multiple Web databases, a key task is to merge records that refer to the same real-world entity. We demonstrate ORLF (Online Record Linkage and Fusion), a flexible query-time record linkage and fusion framework. ORLF de-duplicates newly arriving query results jointly with previously processed query results. We use an iterative caching solution that leverages query locality to effectively de-duplicate newly incoming records with cached records. ORLF aims to deliver timely query answers that are duplicate-free and reflect knowledge collected from previous queries.

 

Approving Updates in Collaborative Databases [IEEE IC2E 2015] (paper)

Although this research was motivated by and was started in the context of cleaning dirty spatial data, the techniques developed proved to be useful and generally applicable to many application domains including scientific data curation. Data curation activities, especially in collaborative databases, mandate that collaborators interact until they converge and agree on the content of their data. Typically, updates by a member of the collaboration are made visible to all collaborators for comments but at the same time are pending the approval or rejection of the data custodian, e.g., the principal scientist or investigator (PI). In current database technologies, approval and authorization of updates are based solely on the identity of the user, e.g., via the SQL GRANT and REVOKE commands. However, in collaborative environments, the updated data is open for collaborators for discussion and further editing and is finally approved or rejected by the PI based on the content of the data and not on the identity of the updater. In this research, we introduce a cloud-based collaborative database system that promotes and enables collaboration and data curation scenarios. We realize content-based update approval and history tracking of updates inside HBase, a distributed and scalable open-source cluster-based database. The design and implementation as well as a detailed performance study of several approaches for update approval are investigated and alternative techniques are contrasted.

 

Precision-Bounded Access Control Using Sliding-Window Query Views for Privacy-Preserving Data Streams [TKDE 2015] (paper)

This research extends spatial indexing and spatial query processing techniques to apply in the context of access control and preserving privacy over data streams. Access control mechanisms and Privacy Protection Mechanisms (PPM) have been proposed for data streams. The access control for a data stream allows roles access to tuples satisfying an authorized predicate sliding-window query. Sharing the sensitive stream data without PPM can compromise the privacy. The PPM meets privacy requirements, e.g., k-anonymity or l-diversity by generalization of stream data. Imprecision introduced by generalization can be reduced by delaying the publishing of stream data. However, the delay in sharing the stream tuples to achieve better accuracy can lead to false-negatives if the tuples are held by PPM while the query predicate is evaluated. Administrator of an access control policy defines the imprecision bound for each query. The challenge for PPM is to optimize the delay in publishing of stream data so that the imprecision bound for the maximum number of queries is satisfied. We formulate the precision-bounded access control for privacy-preserving data streams problem, present the hardness results, provide an anonymization algorithm, and conduct experimental evaluation of the proposed algorithm. Experiments demonstrate that the proposed heuristic provides better precision for a given data stream access control policy as compared to the minimum or maximum delay heuristics proposed in existing literature.

 

 

Teaching, Education, Training, and Development

2011-2012

Education and Course Design:

In Fall 2011, Aref has offered a graduate-level seminar class CS698. The purpose of this course is to train students in spatiotemporal data management research. Students learned the basic and advanced techniques in spatial and spatiotemporal data management through several interesting projects. The main class project is to design an introductory level (CS-I) programming course based on spatiotemporal data management techniques, e.g., using 2D and 3D map and moving objects’ operations. More detail on the CS Programming-I course is given below. In the class, students covered some of the fundamental papers in spatial and spatiotemporal data management as well as the useful tools that facilitate spatial and spatiotemporal data systems research, e.g., data generators, benchmarks, open source systems, extensible indexing architectures, etc. This course counts as one of the two CS-698 research courses a Ph.D. student can take in his/her first three semesters or as a CS590 independent study course. The class meets once a week on Wednesdays. Each student was expected to give around two to three presentations over the span of the semester. In addition to the CS Programming-1 course project that every student is expected to participate in, each student or possibly a group project of teams of one-three students each selects a semester-long project. Each student (or each group) was expected to meet with Walid on individual basis once a week to follow up on the project’s progress. During class, group discussions evolve around the CS Programming-I Project following some students' presentations.

 

CS Programming-I Project:

This part of the class involved lots of brainstorming and design activities. The purpose of this course project is to design an introductory programming course in computer science to address low retention of computer science freshman year and their diminished interested in science. The target is to go beyond the classical 'Hello World' programming examples that have shown their limits. The goal of this project is to design the CS Programming I course around 2D and 3D map operations (e.g., using 3D Earth and 2D Map APIs) to increase the interest of students in programming, both in CS and in Science. According to CRA reports, the interest among undergraduate students in majoring in Computer Science has decreased significantly. Universities nationwide have been experiencing a noticeable decline in freshman registration in CS for the past years. For example, the number of new CS majors in Fall 2007 was half of what it was in Fall 2000 (15,958 versus 7,915). Several colleges attribute this decline to the increase in computer literacy among incoming students and the boring nature of the freshman Programming I courses. A first program that prints 'Hello World!' on the screen is not thrilling anymore to students playing computer games with high-quality graphics since KG. To add excitement, several colleges have redesigned their freshman Programming I courses to use toy devices, e.g., Robots. The target would be to program the robot to perform certain tasks while learning various programming concepts in the process. Although interesting to some students, using robots has several disadvantages. High setup cost in obtaining and maintaining the robots is one set back. Another disadvantage is the frustration that many students have when dealing with hardware as well as software debugging issues simultaneously. The goal of this project is to design a CS Programming-I course that uses 2D and 3D maps (e.g., Google Earth and Google Maps) as the ‘toy' through which freshman CS students learn programming concepts. Besides the low setup cost, learning using 2D and 3D maps adds the excitement needed by introducing high-quality graphics and high interactivity by relating the programming tasks to maps or photos of locations that the students can relate to. Wrapper APIs needs to be developed to simplify the setup and interaction with the map software. The new CS Programming-I course needs to be developed along the same lines of the first CS Programming-1 course at Purdue (CS-180).

 

Education and Course Design Progress:

11 graduate students (8 males and 3 females) attended the CS698 class that Aref offered in Fall 2011. Topics covered in class included:

-       Dissection of Purdue’s Existing CS-177 & CS-180 introductory CS Programming courses and their exercises.

-       Study of 2D and 3D Map APIs (e.g., Google’s and Microsoft products, among others)

-       Study of spatial and spatiotemporal data operations

-       What are the basic, minimal, and complete sets of spatiotemporal operations?

-       Study of Map and GIS (Geographic Information Systems) operations

-       Classic multi-dimensional (spatial) and spatiotemporal data indexing techniques

-       Spatial and spatiotemporal query processing algorithms

-       Spatial and spatiotemporal query optimization

-       Spatial and spatiotemporal data management systems

-       Spatiotemporal data streaming systems

-       Study of publicly available spatial and spatiotemporal data sets

-       Synthetic spatiotemporal data generators

-       Benchmarking spatial and spatiotemporal data systems

 

Student projects in the course included: spatiotemporal query optimization, a study of publicly available spatial and spatiotemporal data sets (real and synthetic), parallel spatiotemporal data indexing, distance oracles, map matching, a study of spatiotemporal data generators. Several 2D- and 3D-based programming project ideas and programming primitives were developed for the CS Programming I course. These ideas need further elaboration and are currently being studied by Walid and his group.

 

Training and Development:

Students have gained research experience in the context of this project. They have published their results in rigorous conference outlets, e.g., VLDB, which is one of the two top and most prestigious and competitive publication venues in the field of database systems (along with the ACM SIGMOD Conference). Mr. Aly got his paper accepted into VLDB 2012, which was an excellent experience for him as a first publication in his career. During the graduate seminar course (CS698) that Aref offered in Fall 2011, students (8 males and 3 females) have benefitted tremendously as, through the class activities, students gave many class presentations about fundamental ideas and fundamental papers in the field of spatial database systems as well as cover progress in their class projects. Also, students learned through this experience how to criticize ideas of others in a balanced fashion, whether published work of others, or those of their colleagues in the class. Through this class experience, several students got started in their Ph.D. research, including Mr. Mahmood and Mr. Tang. Students have also gained tremendous experience working with large open-source software systems, e.g., Hadoop. Our M3 project has exposed them to invaluable software development experience while adapting Hadoop by eliminating all the disk layers from it and turning it into a massively distributed and parallel data streaming system. Mr. Aly was later invited to spend the summer at Microsoft Research where he sharpened his research skills further.

 

2012-2013

Education and Course Design

Aref has offered over 15 graduate research independent study courses (CS590, CS698, CS699) in Fall 2012 and Spring 2013. The purpose of these courses is to train graduate students individually or in small group in various topics in spatiotemporal data management research. Students learn advanced topics in spatial and spatiotemporal data management through several interesting independent study or group projects. Aref currently has 11 Ph.D. students (10 in CS, 1 in ECE) three of which were supported by this grant and one additional woman Ph.D. student (Ruby Tahboub) will be supported by this grant as of January 2014. One of the targets of this project's education plan is to design an introductory programming course in computer science to address low retention of computer science freshman year and their diminished interest in science. The target is to go beyond the classical 'Hello World' programming examples that are not thrilling anymore to students playing computer games with high-quality graphics since KG. Our goal is to design a CS Programming-I course that uses 2D and 3D maps (e.g., Google Earth and Google Maps) as the 'toy' through which freshman CS students learn programming concepts in contrast to using robots as other programming environments do. Aref and one Ph.D. student have been working on developing this introductory-level (CS-I) programming course based on spatiotemporal data management techniques, e.g., using 2D and 3D map and moving objects' operations. We have focused on studying several existing introductory programming environments, e.g., Scratch and Alice. We have started developing 2D map-based programming primitives and abstract operations that serve as building blocks for the map-based programming language to be developed. Currently, we are using these primitives to develop some simple programming exercises to serve as a way to debug these primitives. Once the design of these map primitives and programming abstractions converge and are finalized, we will start building these primitives for testing.

 

Training and Development

Students have gained research experience in the context of this project. They have published their results in rigorous conference and journal outlets, e.g., the VLDB Journal, the IEEE Transactions on Knowledge and Data Engineering, the GeoInformatica Journal, the IEEE BigData Conference, and the VLDB DBCrowd workshop. Many of the participating students gained excellent experience as this was their first publications in their careers. During the graduate independent study courses, Ph.D. students (10 males and 1 female) have benefitted from the various class activities, and many of them got "jumpstarted" into spatiotemporal data systems research through systems-oriented projects that resulted into conference and journal paper submissions.

Several students, e.g., Mr. Madkour and Mr. Mahmood (both were supported under this project in the 2012/2013 academic year), got their first research papers accepted into the IEEE BigData 2013 and VLDB DBCroud 2013, respectively. Students have also gained experience working with large open-source software systems, e.g., Hadoop and PostgreSQL. Because of this experience, four of the Ph.D. students got invited for summer internships at various industrial and research labs.

Post-doctorate Dr. Shen Zhihong from the Chinese Academy of Sciences has visited Purdue to collaborate with Aref’s research group at his government's expenses to get training in the context of this project. Also, Dr. Saleh Basalamah, Director of the GIS Technology Innovation Center, Umm Al-Qura University, Saudi Arabia, has spent an academic year at Purdue to collaborate in the context of this project. Dr. Basalamah is a co-author in the DBCrowd 2013 and IEEE BigData 2013 conference papers.

Dissemination of Results to Communities of Interest and Outreach Activities:

In 2012/2013, Aref has offered the over 15 independent graduate research courses, which introduced the topics in the project to 11 Ph.D. graduate students during the fall and spring semesters 2013. Walid has served as an organization committee member of the Spatial Computing 2020 Workshop sponsored by NSF/CCC to promote a unified agenda for Spatial Computing research (http://cra.org/ccc/visioning/visioning-activities/spatial-computing/196-spatial-computing-workshop), where Walid was the lead for the Systems track in the workshop. This has resulted in an article that is currently being submitted to CACM in which Walid is a co-author. In December 2012, Walid gave a keynote speech at the International Conference for Connected Vehicles and Expo (Beijing, China) to reflect advances in spatiotemporal database systems research. Walid continues to chair the ACM SIGSPATIAL special interest group on spatial algorithms and systems (2011-2014), where he was one of four founding members of the SIG.

2013-2014

Education and Course Design

Aref has offered over 20 graduate and undergraduate research independent study and theses courses (CS490, CS590, CS698, CS699) in Fall 2013, and Spring and Summer 2014. The purpose of these courses is to train students individually or in small groups in various topics in spatiotemporal data management research. Students learn advanced topics in spatial and spatiotemporal data management through several interesting independent study or group projects including distributed and parallel spatiotemporal data streaming, benchmarking of spatiotemporal databases, declustered spatiotemporal query processing, among many other topics. It is expected that these studies to result in conference papers in the coming academic year. Aref currently has one Masters Student and 12 Ph.D. students (11 in CS, 1 in ECE) three of which were supported by this grant including one woman Ph.D. student (Ruby Tahboub). Another woman Ph.D. student is scheduled to join this project in Spring 2015. Undergraduate Senior Student Linda Thongsavath did her senior design project on the topic of benchmarking spatiotemporal databases with deep query evaluation pipelines.

 

LIMO: A Web-based Environment for Learning Programming using Interactive Map Operations [Technical Report 2014]

The computer science education community has placed tremendous resources to raise interest in computing and improve retention. A substantial share of these efforts relies on developing innovative programming environments that target the young-aged group. Along this line, we introduce LIMO; a web-based environment for learning programming using activities related to interactive two-dimensional geographical maps and the operations of moving objects on the maps. An essential aspect of LIMO is that it provides diverse programming constructs that enable users to process, customize, and visualize location data on the map. In our view, the integration of interactive maps, location-based data, and delivering seamless spatial constructs within an educational programming environment is a very promising direction. We plan to develop LIMO further with additional spatial and visual constructs to make it more attractive to novice programmers. We plan to conduct a usability study that involves novice users to assess the efficacy of using LIMO while learning programming. Finally, we plan to develop a curriculum for an introductory computer science course based on LIMO.

Aref and two Ph.D. students have continued to work on developing LIMO based on spatiotemporal data management techniques using 2D and 3D map and moving objects' operations. In 2013/2014, we have developed 2D and 3D map-based programming primitives and abstract operations around Google Maps and Google Earth. These 2D and 3D map primitives serve as building blocks for the map-based programming language that we are currently developing. Currently, we are using these primitives to develop some simple programming exercises to serve as a way to debug these primitives. To enrich this map-based programming environment, we integrated an open-source map database (namely, OpenStreamMap) underneath the building blocks, so that the building blocks can refer to the underlying spatial as well as the textual descriptions of the spatial objects (e.g., rivers, lakes, shopping centers, etc.). The programming exercises that we developed gave rise to interesting new research problems that Ph.D. student Ruby Tahboub is currently studying.

 

Publications

 

1.     A.M. Aly, W.G. Aref, M. Ouzzani, "Spatial Queries with Two kNN Predicates", (VLDB 2012). Proceedings of the VLDB Endowment, International Conference on Very Large Data Bases, p. 1100-1111, vol. 5, Istanbul, Turkey, (2012).

 

2.     A.M. Aly, A. Sallam, B.M. Gnanasekaran, L.-V. Nguyen-Dinh, W.G. Aref, M. Ouzzani, and A. Ghafoor, "M3: Stream Processing on Main-Memory MapReduce", Proceedings of the 28th International Conference on Data Engineering (ICDE), Washington DC, (2012).

 

3.     Y.N. Silva, W.G. Aref, P. Larson, S. Pearson, M.H. Ali: Similarity queries: their conceptual evaluation, transformations, and processing. VLDB J. 22(3): 395-420 (2013).

 

4.     H.G. Elmongui, M.F. Mokbel, W.G. Aref: Continuous aggregate nearest neighbor queries. GeoInformatica 17(1): 63-95 (2013).

 

5.     A.R. Mahmood, W.G. Aref, S.M. Basalamah, E. Dragut. The Palm-tree Index: Indexing with the crowd. The VLDB DBCrowd Workshop, 2013.

 

6.     A. Madkour, W.G. Aref, S.M. Basalamah. Knowledge Cubes - A Proposal for Scalable and Semantically-Guided Management of Big Data. IEEE BigData Conference, 2013.

 

7.     A.R. Mahmood, A.M. Aly, W.G. Aref, S.M. Basalamah. Indexing Recent Trajectories of Moving Objects. ACM SIGSPATIAL GIS, 2014.

 

8.     R. Tahboub,  J. Shen, W.G. Aref, S. Prabhakar, LIMO: A Web-based Environment for Learning Programming using Interactive Map Operations. Purdue Computer Science Technical Report 2014. Submitted for Conference Publication.

 

9.     Z. Pervaiz, W.G. Aref, A. Ghafoor, N. Prabhu: Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data. IEEE Trans. Knowl. Data Eng. 26(4): 795-807 (2014).

 

10.  M.Y. Eltabakh, W.G. Aref, A.K. Elmagarmid, M. Ouzzani: HandsOn DB: Managing Data Dependencies Involving Human Actions. IEEE Trans. Knowl. Data Eng. 26(9): 2193-2206 (2014).

 

11.  A.M. Aly, W.G. Aref, M. Ouzzani, H.M. Mahmoud: JISC: Adaptive Stream Processing Using Just-In-Time State Completion. EDBT 2014: 73-84.

 

12.  W.J. Al Marri, Q.M. Malluhi, M. Ouzzani, M. Tang, W.G. Aref: The Similarity-Aware Relational Intersect Database Operator. SISAP 2014: 164-175. (Best paper award and invited for special issue journal publication).

 

13.  A.M. Aly, W.G. Aref, M. Ouzzani, Cost Estimation of Spatial k-Nearest-Neighbor Operators, EDBT 2015.

 

14.  A.M. Aly, A.S. Abdelhamid, A.R. Mahmood, W.G. Aref, M.S. Hassan, H. Elmeleegy, M. Ouzzani (2015). A Demonstration of AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data.. VLDB, Proceedings of the VLDB. 

 

15.  K. Mershad, Q.M. Malluhi, M. Ouzzani, M. Tang, W.G. Aref (2015). Approving Updates in Collaborative Databases. 2015 IEEE International Conference on Cloud Engineering. 

 

16.  A.Madkour, W.G. Aref, M. Mokbel, S. Basalamah (2015). Geo-tagging Non-Spatial Concepts. ACM SIGSPATIAL MobiGIS Workshop. 

 

17.  A.M. Aly, H. Elmeleegy, Y. Qi, W.G. Aref (2016). Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop. 9th ACM International Conference on Web Search and Data Mining (WSDM). Status = ACCEPTED.

 

18.  R.Y. Tahboub, J. Shin, Aya Abdelsalam, J.W. Aref, W.G. Aref, S. Prabhakar (2015). LIMO: Learning Programming using Interactive Map Activities (Demo Paper). ACM SIGSPATIAL. Status = ACCEPTED.

 

19.  M. Tang, W.G. Aref, Q.M. Malluhi, M. Ouzzani (2015). MingJie Tang, Yongyang Yu, Walid G. Aref, Qutaibah M. Malluhi, Mourad Ouzzani: Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce.. EDBT 2015: 361-372. 

 

20.  W.G. Aref, S. Prabhakar, J. Shin, R.Y. Tahboub, A. Abdelsalam, J.W. Aref (2015). On Map-Centric Programming Environments (Vision Paper). ACM SIGSPATIAL.  Status = ACCEPTED.

 

21.  Mingjie Tang, Ruby Tahboub, Walid G. Aref, Michael Atallah, Michael Gribskov, Qutaibah Malluhi, Mourad Ouzzani, Yasin Silva (2015). Similarity Group-by Operators for Multi-dimensional Relational Data.. IEEE Transactions on Knowledge and Data Engineering. Status = ACCEPTED.

 

22.  A.M. Aly, W.G. Aref, M. Ouzzani (2015). Spatial Queries with k-Nearest-Neighbor and Relational Predicates. ACM SIGSPATIAL. Status = ACCEPTED.

 

23.  A.R. Mahmood, A.M. Aly, T. Qadah, A. Madkour, A.S. Abdelhamid, M.S. Hassan, W.G. Aref, S. Basalamah (2015). Tornado: A Distributed Spatio-Textual Stream Processing System. VLDB, Proceedings of the VLDB. Status = PUBLISHED.

 

24.  Wadha J. Al Marri and Qutaibah Malluhi, Mourad Ouzzani, Mingjie Tang and Walid G. Aref (2015). The Similarity-aware Relational Database Set Operators.  Information Systems Journal.  Status = ACCEPTED.

 

25.  Z. Pervaiz, A. Ghafoor, W.G. Aref: (2015). Precision-Bounded Access Control Using Sliding-Window Query Views for Privacy-Preserving Data Streams..  IEEE Trans. Knowl. Data Eng.. 27 (7),  1992-2004.

 

26.  M.S. Hassan, A.M. Aly, W.G. Aref, Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs, ACM SIGMOD 2016 (Full Paper).

 

27.  A.M. Aly, M.S. Hassan, A.R. Mahmood, W.G. Aref, H. Elmeleegy, M. Ouzzani, AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data, VLDB 2016 (Full Paper).

 

28.  M. Tang, R.Y. Tahboub, W.G. Aref, M.J. Atallah, Q.M. Malluhi, M. Ouzzani, Y.N. Silva: Similarity Group-by Operators for Multi-Dimensional Relational Data. IEEE Trans. Knowl. Data Eng. 28(2): 510-523 (2016).

 

29.  Shashi Shekhar, Steven K. Feiner, Walid G. Aref: Spatial computing. Communications of the ACM 59(1): 72-81 (2016).

 

30.  A.S. Abdelhamid, M. Tang, A.M. Aly, A.R. Mahmood, T. Qadah, W.G. Aref, S.Basalamah. Cruncher: Distributed In-Memory Processing for Location-Based Services. IEEE ICDE 2016, Demo paper.

 

31.  El Kindi Rezig, Eduard C. Dragut, Mourad Ouzzani, Ahmed K. Elmagarmid, Walid G. Aref. ORLF: A Flexible Framework for Online Record Linkage and Fusion. IEEE ICDE 2016, Demo paper.

 

Collaborators:

·       Dr. Paul Larson: Microsoft Research, Redmond, WA collaborated in writing a research paper [VLDBJ2013] that provides foundational research for the correctness ideas in this grant.

 

·       Dr. M.H. Ali: Microsoft Corporation, Redmond, Washington, collaborated in writing a research paper [VLDBJ2013] that provides foundational research for the correctness ideas in this grant.

 

·       Dr. Hicham Elmongui: Amazon, collaborated in writing a research paper [GeoInformatica2013] that presents algorithms for efficiently evaluation aggregate nearest neighbor queries.

 

·       Prof. M. Eltabakh: WPI, Massachusetts, collaborated in writing a research paper [IEEE TKDE 2014] that presents models and techniques for integrating human activities into database systems.

 

·       Graduate Students: A. Aly (5th year Ph.D. student), M. Tang (4th year Ph.D. student), A. Mahmood (4th year Ph.D. student), R. Tahboub (4th year Ph.D. student), A. Madkour (4th year Ph.D. student), M. Hassan (2nd year Ph.D. student) were funded at different periods under this grant and worked at various aspects of this projects.

 

·       S. Pearson: Graduate student at Arizona State who collaborated in the context of generalizing the results in this project for similarity-based databases. Mr. Pearson conducted experiments requested by the reviewers in order to revise the submitted paper that was later accepted to appear in the VLDB Journal [VLDBJ2013].

 

·       B. Coffey: Graduate student at Purdue University who collaborated in the context of collecting meta-data about publicly available spatial data sets.

 

·       Undergraduate Students: Linda Thongsavath: Undergraduate student at Purdue University who collaborated in the context of realizing a benchmark for spatiotemporal databases that use k-nearest neighbors in multi-predicate queries that include relational, spatial, and kNN predicates.

 

·       Visitors and Post-docs:

 

o   Dr. Shen Zhihong: Chinese Academy of Sciences, Beijing, China (August 2013- October 2013) participated in designing cluster-based user workspaces.

 

o   Dr. Saleh Basalamah: Director of the GIS Technology Innovation Center, Umm Al-Qura University, Saudi Arabia (April 2013-Present), participated in coauthoring multiple research papers, including [DBCrowd2013], [IEEE BigData 2013], and [ACM SIGSPATIAL 2014].

 

o   Dr. Eduard Dragut: Dr. Dragut was a post-doc at Purdue University during 2012-2013 and participated in the context of this project in the cloud sourcing indexing paper [DBCrowd 2013].