NSF- III: Small: On the Conceptual Evaluation and Optimization of Queries in Spatiotemporal Data Systems

 

Walid G. Aref

Principal Investigator

Department of Computer Science
Purdue University

_

Contact Information

Walid G. Aref

Department of Computer Science

Purdue University

305 N. University Street

West Lafayette, Indiana 47907

Phone: (765) 494-1997
Fax : (765) 494-0739
Email: aref@cs.purdue.edu
URL: http://www.cs.purdue.edu/faculty/aref.html

 

Project Award Information

Project Web Page:
http://www.cs.purdue.edu/homes/aref/MPS

Project Focus:

The widespread use of smart-phones, social networks, and location devices has given rise to a wide spectrum of location-based applications and services. Along with the advancement and ubiquity of location-based applications and services, users' queries and needs are consistently growing in complexity and are becoming very sophisticated. Current location-based query systems offer shallow query interface capabilities that do not match users' expectations. This project identifies and addresses the challenges in correctly expressing and efficiently executing complex location-based queries. A user's query can get arbitrarily complex having a combination of multiple location-based, relational, and textual operations, e.g., location-based selections, spatial joins, relational selections, relational joins, relational group-by's, etc. Current location-based query systems support only one location-based operator, e.g., range queries or nearby queries (k-Nearest-Neighbors, or kNN, for short). Our focus is to develop a query system, termed the multi-predicate location-based query processing system, that supports any combination of location-based, relational, and textual operations.

 

Project Goals:

Spatial data management systems have become vital due to the ubiquity of location-detection devices. The challenge is not only in supporting a massive number of moving objects that generate spatiotemporal data streams, but also in supporting a massive number of location-based continuous queries. With location-based services becoming widespread and popular, location-based queries have gone beyond the simple single-predicate range and k-nearest-neighbor queries. Location-based services demand complex multi-predicate spatial (MPS, for short) queries. Although a large body of research has been devoted to continuous spatial query processing, surprisingly none addresses the processing of MPS queries. In contrast to relational queries, MPS queries may produce different results depending on the order in which their predicates are evaluated. This results in the ambiguity of the intended semantics of the MPS queries and hence would hamper portability and consistency of query executions across platforms. In this case, different implementations of MPS queries would result in different answers to the query. This project addresses this key limitation as well as related optimization issues.

 

This project has research and education goals. The research goal of the project is to develop a database query processing system, termed the multi-predicate location-based query processing system, that supports any combination of location-based and relational operations that match the needs and complexity of location-aware services. More specifically, the specific research goals of the project are as follows:

 

1.   Study MPS queries and propose a consistent and correct conceptual evaluation strategy similar to the one for relational SQL queries.

 

2.   Based on this foundation that addresses the correctness of MPS query execution, the target is to develop various algebraic transformation rules that retain correctness but help transform MPS query execution plans into more efficient ones.

 

3.   Investigate the validity of well-known optimization heuristics in the context of MPS queries with the aim to identify several potential optimization scenarios unique to MPS queries.

 

4.   Develop a cost model for query optimization purposes to select the best query execution plans for MPS queries.

 

5.   Study the impact of the variation on temporal and spatial object distributions on MPS query execution plans.

 

6.   Develop new adaptive spatiotemporal query processing techniques that can cope with the dynamic nature and scale of moving object environments.

 

This project has a significant educational component. The target is to address low retention of computer science freshman by designing more interesting entry-level programming languages that go beyond the classical ``Hello World" programming examples which have shown their limits. The specific educational and training goals of this project are as follows.

 

1.   Design a map-centeric programming language and programming enviorment that is centered around 2D and 3D map operations (e.g., using 3D Earth and Map APIs) to increase the interest of students in programming, both in CS and in Science.

 

2.   Graduate and undergraduate student training that will result in graduating PhD students and the involvement of undergraduate students in research projects.

 

Major activities, highlights, and findings of the project are presented below.

 

 

Summary of Project Activities and Highlights:

2011-2012:

In the initial phase of this project in 2011/2012, we were able to identify location-based application scenarios and their associated queries that current state-of-the-art location-based query systems cannot handle. We were able to isolate, identify, and classify the reasons and ingredients for what makes these queries hard to handle. In our VLDB 2012 paper, we identified the first such class of queries, namely the class of 2-kNN queries, where queries in this class contain two k-nearest-neighbor operators. In the VLDB paper, we not only illustrated that many application scenarios call for queries in this class, but also identified challenges that arise when handling this class of 2-kNN queries. These challenges include:

1.   Depending on the order of execution of the two kNN operations, the result of a 2-kNN query will differ.

 

2.   Correctness Challenge: How to determine what the correct answer for a 2-kNN query should be and on what bases?

 

3.   Efficiency Challenge: How to efficiently execute a 2-kNN query while preserving correctness?

 

4.   Query Optimization Challenge: How to adapt well-known relational query optimization strategies (e.g., pushing selects under joins in query evaluation plans) to still be applicable in the context of 2-kNN queries? Our VLDB 2012 paper addressed all four challenges.

 

2012-2013:

In 2012/2013, our research focused on expanding the capabilities of multi-predicate location-based query servers beyond the class of 2-kNN queries along several dimensions. We explored various combinations of relational and spatial operations (Journal Papers: GeoInformatica 2013, VLDBJ 2013). We studied crowd-based database indexing techniques (DBCrowd 2013), and semantic web location-based linked data architectures (IEEE BigData 2013).

 

2013-2014:

In 2013/2014, we studied query optimization issues of queries that involve both spatial and relational predicates. We developed a novel cost estimation model for kNN operations when executing in conjunction with relational operators. We extended our work to trajectories with limited histories (SIGSPATIAL 2014). We studied adaptivity of query execution plans in continuous streaming queries (EDBT 2014). We extended our work to "approximate" nearest-neighbor computations in the context of high-dimensional data, and proximity-based intersect operators (SISAP 2014). We completed and published our work on dependency models when human actions are part of a database transaction (TKDE 2014).

 

2014-2015:

In 2014/2015, we completed an important milestone in supporting multi-predicate location-based query processing systems. We studied the various semantics of post- and pre-filter for k-nearest-neighbor operators and their interactions with relational operators, and the various algorithmic optimizations that lead to orders of magnitude enhancement in performance (ACM SIGSPATIAL 2015). We developed an accurate cost estimation model that helps arbitrate among these optimizations and decide which one to apply at what time (EDBT 2015a). We studied how to efficiently process Hamming-Distance-based similarity-search queries using similarity hashing techniques both in a centralized and distributed setups (EDBT 2015b).

 

2015-2016:

In 2015/2016, we studied proximity-based group-by operations (IEEE TKDE 2016). Our study on proximity-based set intersection operation (SISAP 2014) received the best-paper award and was invited for journal publication. The journal version (Information Systems 2016) covesr all proximity-based set operators including the proximity-based set intersect, set difference, and set union operators. Also, we studied how to use semantics over web data to learn how to geo-tag non-spatial concepts (e.g., crime, pollution, or education) and show potential locations of these non-spatial concepts over the map (MobiGIS 2015). We developed Tornado; a prototype system for scalable organization and query processing of Big Spatial Data. Tornado (VLDB 2015 demo) addresses the needs of micro-blogging applications where text data is often associated with location data, e.g., as in tweets and check-in systems. Tornado focuses on distributed query processing of online streamed spatio-textual data. We explored how the workload awareness techniques that we have developed for big spatial data can be extended to support range queries over general streaming data (WSDM 2016). We continued to develop the LIMO map-centric programming language environment (ACM SIGSPATIAL 2015 demo) that helps freshmen in Computer Science learn programming concepts using maps. Through the development of the LIMO system, we realized that programming environments centered around maps is a potentially important topic that deserves further research. Our vision paper (ACM SIGSPATIAL 2015 vision paper) about map-centric programming environments highlights the research challenges and open questions in the area. We adapted spatial indexing and spatial query processing techniques to apply in the context of access control and preserving privacy over data streams [TKDE 2015]. Along with collaborators Drs. Shashi Shekhar and Steven Feiner, Aref published two overview and visionary articles on the future of Spatial Computing (GeoInformatica 2015 and Communications of the ACM 2016). The CACM 2016 article was featured as the main cover page for CACM. Aref, Mokbel, and Chow won the ten-year best paper award at the VLDB 2016 conference for their 2006 VLDB paper titled: The New Casper: Query Processing for Location Services without Compromising Privacy. Mohamed F. Mokbel, Chi-Yin Chow, Walid G. Aref. VLDB 2006 pp. 763-774, for inspiring research to support privacy in spatial query processing. This award has been accompanied by a presentation at the 2016 VLDB conference (after ten years of the original Casper paper). We investigated adaptive query processing techniques and adaptive caching for big spatial data in in-memory distributed spatial data systems (ICDE 2016). We developed a flexible framework for online record linkage and fusion (ICDE 2016). Finally, we developed, EDP, a graph indexing technique to compute shortest-paths over any subset of the graph (SIGMOD 2016). EDP allows graph updates and sub-graph selections based on categorical attributes.

 

2016-2017:

In 2016/2017, my group developed two prototypes, AQWA and LocationSpark, towards scalable organization and query processing of Big Spatial Data. AQWA focuses on dynamic and online query-workload-aware spatial data organization that achieves up to two orders of magnitude enhancement in performance over other static and existing big spatial data platforms (VLDB 2015 demo and VLDB 2016 full paper). In contrast, LocationSpark is a distributed in-memory data management system for big spatial data that uses advanced spatial filtering techniques. We designed a spatial version of a bloom filter and demonstrated its use in LocationSpark (VLDB 2016 demo). We studied spatial trending queries over real-time microblogs (SIGSPATIAL 2016). We designed a new query language, Atlas, to express all spatial-keyword queries (SIGSPATIAL 2016). We follow the same spirit as the relational query language SQL. In Atlas, we introduce new simple building-block constructs (similar to SQL’s selects, projects, and joins) that can be composed together to express all types of spatial-keyword queries. We surveyed shortest-path algorithms (Corr 2017). We developed in-memory distributed matrix computation processing and optimization techniques for use with large scale graph processing algorithms (ICDE 2017). Walid and one of his Ph.D. students presented a tutorial on the subject of query processing techniques for big spatial-keyword data at the SIGMOD 2017 conference. Finally, Walid gave a keynote speech with the title: "Location+X Big Data Systems: Challenges and Some Solutions" at the 2017 SIGMOD GeoRich Workshop.

 

Project Accomplishments, Significant Results, and Outcomes:

The research and education goals of the project have all been achieved. We have published over 30 journal and refereed conference articles. More specifically, we published 16 journal papers in top-tier journal venues including the IEEE Transactions on Knowledge and Data Engineering, The VLDB Journal, Communications of the ACM, GeoInformatica, Information Systems, Proceedings of the VLDB Journal, and the IEEE Software magazine, as well as 21 papers in top-tier conference venues including ACM SIGMOD, VLDB, IEEE ICDE, EDBT, ACM SIGSPATIAL, SISAP, and WSDM, and one upcoming tutorial at the ACM SIGMOD (on the subject of Spatial-keyword Query Processing Techniques). Moreover, our research in the scope of this project was awarded the following best-paper awards:

 

-- 2016 VLDB Endowment Award: VLDB 10-year Best Paper Award for inspiring research to support privacy in spatial query processing. This award has been accompanied by a presentation at the 2016 VLDB conference titled " Location Data Management: A Tale of Two Systems and the "Next Destination"! Mohamed F. Mokbel, Chi-Yin Chow, Walid G. Aref. PVLDB 9(13): 1622 (2016).

 

-- 2015 Best Demonstration Award, ACM SIGSPATIAL 2015, for our system, LIMO, a map-centric programming language environment. Our LIMO system is hosted at: http://ibnkhaldun.cs.purdue.edu:8181/limo/GWTMaps.html with sample LIMO programs.

 

-- 2014 Best Paper Award in the International Conference on Similarity Search and Applications (SISAP) 2014 for our paper Wadha J. Al Marri, Qutaibah M. Malluhi, Mourad Ouzzani, MingJie Tang, Walid G. Aref: The Similarity-Aware Relational Intersect Database Operator. SISAP 2014: 164-175, that was later extended and published in the Information Systems Journal (2016).

 

-- Two Ph.D. students have graduated based on the research conducted within the scope of this project.

 

-- LIMO (ACM SIGSPATIAL 2015 Demo and Vision Papers):

We finished the development of the LIMO map-centric programming language and programming environment.

LIMO can be found at: http://ibnkhaldun.cs.purdue.edu:8181/limo/GWTMaps.html

LIMO was initially designed to serve as a map-centric programming language that uses 2D and 3D maps as the programming "Toy" in contrast to using robots or animations. LIMO provides high-level programming contructs that are easy to use and that help a user navigate a 2D or 3D map. The LIMO website contains 16 sample LIMO programs that demonstrate how to use the LIMO commands to interactively navigate the 2D or the 3D map that is displayed on the right. In addition, users can create their own accounts and store their own LIMO programs in their hosted accounts at our machine. LIMO is not open for public use as of yet. However, after a demo of LIMO at ACM SIGSPATIAL 2015 that received best demo award, and a vision paper about map-centric programming language at the same SIGSPATIAL conference, LIMO was very well received and several researchers highlighted the need for using LIMO not only as an entry-level programming language but also as a tool for spatial computing. While the goal of LIMO as a programming language is simplicity of use, the goal of LIMO as a spatial computing tool is to be comprehensive in functionality. The use of LIMO for spatial computing is the subject of a future study.

 

-- Survey, Overview, Tutorial, and Vision Articles on Multi-Predicate Spatial Querying and Related Topics (SIGMOD'17, CACM'16, GeoInformatica'15, SIGSPATIAL'15, IEEE Software Magazine'12)

One of the important outcomes of this project is a collection of survey and overview articles that are published in prestigious outlets including:

1.   An article about Spatial Computing that was featured in the cover of the widely distributed Communications of the ACM in 2016.

 

2.   An article titled: "From GPS and virtual globes to spatial computing" that was published in the Springer Verlag GeoInformatica Journal 2015.

 

3.   An article titled: "On map-centric programming environments: vision paper" that was published in the Vision Papers section at ACM SIGSPATIAL 2015.

 

4.   A tutorial titled: "Query Processing Techniques for Big Spatial-Keyword Data" that will be covered in ACM SIGMOD 2017.

 

5.   An overview article about Distributed Access Control Architectures for Cloud Computing that was published in the IEEE Software Magazine 2012.

 

-- Multi-predicate Spatial Query Processing and Optimization (VLDB'12, VLDBJ'13, SIGSPATIAL'15, EDBT'15)

With this series of papers, we have addressed all the ingredients necessary for building a multi-predicate spatial query processing engine.

In our VLDB'12 paper, we presented the first complete study for the optimization of queries with two k-nearest-neighbor (kNN) predicates. We demonstrated how traditional optimization techniques can compromise the correctness of evaluation for a query that involves two interacting kNN predicates. For different combinations of two kNN predicates, we presented efficient algorithms that guarantee the correctness of evaluation, and outperform the corresponding conceptually correct QEPs by orders of magnitude.

Our VLDB Journal'13 paper gives the foundation for correctly evaluating multi-predicate spatial queries involving kNN predicates. We present a rich set of transformation rules for MPS queries to transform the conceptual evaluation plan into more efficient plans. The experimental evaluation of the proposed transformation rules shows they are highly effective.

In our SIGSPATIAL'15 paper, we studied various query optimization heuristics for queries that involve combinations of kNN select/join predicates and relational predicates. We demonstrated how spatial locality can significantly enhance the performance of relational joins. The presented optimizations enhance the performance of these queries while preserving their semantics. Experimental results that are based on queries from the TPC-H benchmark and real spatial data from OpenStreetMap demonstrate that the presented optimizations can achieve orders of magnitude enhancement in query performance.

In our EDBT'15 paper, we study the problem of estimating the cost of the k-NN-Select and k-NN-Join operators, which is a needed ingredient to enable the query optimizer to select among the proposed plans by accurately costing the introduced kNN operators. We present various estimation techniques. Performance evaluation using real spatial datasets from OpenStreetMap demonstrates that:

 

1.   The Staircase technique for k-NN-Select cost estimation is faster than the state-of-the-art technique by more than two orders of magnitude and has better estimation accuracy;

 

2.   The Catalog-Merge and Virtual-Grid techniques for k-NN-Join cost estimation achieve less than 5 and 20 percent error-ratio, respectively, while keeping the estimation time below one microsecond and one millisecond, respectively; and

 

3.   The Virtual-Grid technique reduces the storage required to maintain the catalogs by an order of magnitude compared to the Catalog-Merge technique.

 

-- Adaptivity and Big Spatial Data Techniques (EDBT'14, VLDB'15a, VLDB'16, VLDB'16b)

We studied adaptivity for general data streams, and then in the context of big spatial data. In our EDBT'14 paper, we developed JISC, a novel technique for plan adaptation of continuous queries over data streams. During plan transition, JISC maintains steady query output in contrast to existing plan adaptation techniques that can exhibit significant output latencies. JISC employs a lazy state migration strategy that computes the missing states in the new QEP on-demand in a single QEP. The lazy migration strategy enables JISC to avoid performance thrashing, which is a vital issue in all other existing techniques. A key property of JISC is that during plan transition, it detects the unchanged parts of the QEP and avoids recomputing their states. Our probabilistic analysis demonstrates that the number of operators with unchanged states in the new QEP is usually close to the total number of operators. For that reason, JISC outperforms existing techniques that also maintain steady query output, e.g., the Parallel Track Strategy, leading to performance gains during a plan migration stage that can reach up to an order of magnitude.

In the context of big spatial data, In our VLDB'15 and VLDB'16 papers, we developed AQWA, a new adaptive and query-workload-aware partitioning mechanism for large-scale spatial data. We used Hadoop as our big data platform, for illustration. In AQWA, we addressed several performance and system challenges; these include the limitations of Hadoop (i.e., the NameNode bottleneck), the overhead of rebuilding the partitions in HDFS, the dynamic nature of the data where new batches are created every day, and the issue of workload-awareness where not only the query workload is skewed, but also it can change. We showed that AQWA successfully addresses these challenges and provides an efficient spatial-query processing framework. Using our implementation of AQWA on a Hadoop cluster, real spatial data from Twitter, and various workloads of range and kNN queries, we demonstrated that AQWA outperforms SpatialHadoop, the state-of-the-art system. Moreover, we demonstrated that AQWA incurs little overhead (during the process of repartitioning the data) that is negligible when compared to the overhead of recreating the partitions. Although our experimental evaluation is based on Hadoop, AQWA can be applied to other platforms, e.g., Storm, a distributed platform for processing streaming data. AQWA can dynamically reorganize the data in the Storm's computing units (bolts) according to the query workload and the data distribution.

 

-- Big Spatial Data Streaming Techniques (ICDE'12, VLDB'15b, ICDE'16, VLDB'16, SIGMOD'16)

We have published a collection of papers that demonstrate very high throughput for spatial-keyword (e.g., micro-blogging) data streaming systems, where we demonstrated up to two order of magnitude enhancement over existing systems.

Other findings can be found in the publications below.

 

Research Publications:

Spatial Queries with Two kNN Predicates [VLDB 2012]

The widespread use of location-aware devices has led to countless location-based services in which a user query can be arbitrarily complex, i.e., one that embeds multiple spatial selection and join predicates. Amongst these predicates, the k-Nearest-Neighbor (kNN) predicate stands as one of the most important and widely used predicates. Unlike related research, this research goes beyond the optimization of queries with single kNN predicates, and shows how queries with two kNN predicates can be optimized. In particular, our research addresses the optimization of queries with:

(i)            two kNN-select predicates

(ii)          two kNN-join predicates, and

(iii)         one kNN-join predicate and one kNN-select predicate.

For each type of queries, conceptually correct query evaluation plans (QEPs) and new algorithms that optimize the query execution time are studied. Experimental results demonstrate that the proposed algorithms outperform the conceptually correct QEPs by orders of magnitude. We conducted the first complete study for the optimization of queries with two kNN predicates. We demonstrated how traditional optimization techniques can compromise the correctness of evaluation for a query that involves two interacting kNN predicates. For different combinations of two kNN predicates, we presented efficient algorithms that guarantee the correct- ness of evaluation, and outperform the corresponding conceptually correct QEPs by orders of magnitude. The algorithms introduced are designed for snapshot queries. Applying further optimization techniques that can support incremental evaluation of continuous queries with two kNN predicates is a potential future work. Moreover, we believe that the ideas introduced through this research pave the way towards a query optimizer that can support spatial queries with more than two kNN predicates.

 

M3: Stream Processing on Main-Memory MapReduce [ICDE 2012]

MapReduce has been a popular framework for parallel computing to process large scale data on large scale computing clusters. However, most of the current implementations of the MapReduce framework support only the execution of fixed-input jobs. Such restriction makes these implementations inapplicable for most streaming applications, in which queries are continuous in nature and input data streams are continuously received at high arrival rates. In this research, we realize M3, a prototype implementation of the Hadoop MapReduce framework in which continuous queries over streams of data can be efficiently answered. Data-path in M3 is over main-memory only, bypassing the Hadoop Distributed File System (HDFS) and any local disk accesses. M3 is resilient to machine failures and utilizes the main-memory of the machines by dynamically adapting to the fluctuations in the data arrival rates of the streams. M3 operates on time pulses, where it has a data acquisition layer with DataTasks that collect streamed data chunks that arrive during time periods. Map and Reduce tasks execute continuously while periodically taking turns to process the chunks of data collected. This research highlights techniques and challenges for realizing various modes of synchronization and overlap in execution of tasks in M3. We envision that M3 will be the underlying massively distributed streaming engine for handling city-scale streamed spatiotemporal data that we will be realizing next.

 

Similarity queries: their conceptual evaluation, transformations, and processing [VLDB Journal 2013]

This research forms the foundation for multi-predicate spatial queries that involve kNN predicates. Similarity queries is one possible generalization of range queries and kNN queries, where the former can be viewed as a similarity operation (within epsilon similarity) while the latter is a k-nearest-neighbors similarity. Many application scenarios can significantly benefit from the identification and processing of spatial similarities in the data. Even though some work has been done to extend the semantics of some operators, for example join and selection, to be aware of data similarities, there has not been much study on the role and implementation of similarity-aware operations as first-class database operators. Furthermore, very little work has addressed the problem of evaluating and optimizing queries that combine several similarity operations. The focus of this research is the study of similarity queries that contain one or multiple first-class similarity database operators such as Similarity Selection, Similarity Join, and Similarity Group-by. Particularly, we analyze the implementation techniques of several similarity operators, introduce a consistent and comprehensive conceptual evaluation model for similarity queries, and present a rich set of transformation rules to extend cost-based query optimization to the case of similarity queries.

The focus of this research is the proposal and analysis of several similarity database operators, and the thorough study of the evaluation and optimization of similarity queries that combine multiple similarity operators. We introduce a model for the conceptual evaluation of similarity queries that clearly specifies the way a similarity query should be evaluated even if it has multiple similarity operations. We present a rich set of transformation rules for similarity queries to transform the conceptual evaluation plan into more efficient plans. Furthermore, we demonstrate that transformation rules for similarity queries can take advantage of special properties of the similarity-aware operations and the involved distance functions to enable more useful query transformations. We also extend the important Eager/Lazy Aggregation transformation rules to the case of SGB and SJ. The experimental evaluation of the proposed rules shows they are highly effective. We also show that similarity operators can be efficiently implemented taking advantage of structures and mechanisms already available in DBMSs.

 

Continuous aggregate nearest neighbor queries [GeoInformatica 2013]

This research addresses the problem of supporting continuous aggregate nearest-neighbor (CANN) queries for moving objects in spatio-temporal data stream management systems. A CANN query specifies a set of landmarks, an integer k, and an aggregate distance function f (e.g., min, max, or sum), where f computes the aggregate distance between a moving object and each of the landmarks. The answer to this continuous query is the set of k moving objects that have the smallest aggregate distance f. A CANN query may also be viewed as a combined set of nearest neighbor queries. We introduce several algorithms to continuously and incrementally answer CANN queries. Extensive experimentation shows that the proposed operators outperform the state-of-the-art algorithms by up to a factor of 3 and incur low memory overhead. P-CANN is a best-effort progressive algorithm that associates thresholds with the individual landmarks. These thresholds are used to eagerly prune the moving objects. Different elimination order policies are identified to specify the order of the landmarks in the computation of the aggregate distance in P-CANN. The optimization of P-CANN and how to assign the thresholds are addressed. From the performed extensive experiments, we achieve cases whose performance is improved by up to a factor of 3 from the state-of-the-art algorithm. P-CANN out-performs both H-CANN and the CPM algorithm (the state of the art). For the optimization of P-CANNs, the worst-fit elimination policy gives the least penalty for sum (additional 14% CPU cost away from optimal) when we do not use the prohibitively expensive optimal order. On the other hand, the best-fit elimination policy gives the least penalty for the max case. The optimization time of P-CANN is about 100 msec for typical CANN queries. P-CANN, which is a best-effort algorithm might produce 95% of the required output size on the average. As for future work, we will investigate the continuous aggregate nearest neighbor queries on moving objects in road networks. On the one hand, we would like to develop similar incremental algorithms using the road network distance instead of the Euclidean distance between the objects. On the other hand, we would like to make a first-class operator in a data stream management system such that the CANN operator is considered while optimizing the query evaluation plan.

 

The Palm-tree Index: Indexing with the crowd [DBCrowd 2013]

This research addresses using humans (the crowd) to perform database indexing and searching operations. Crowdsourcing services allow employing human intelligence in tasks that are difficult to accomplish with computers such as image tagging and data collection. At a relatively low monetary cost and through web interfaces such as Amazon’s Mechanical Turk (AMT), humans can act as a computational operator in large systems. Recent work has been conducted to build database management systems that can harness the crowd power in database operators, such as sort, join, count, etc. The fundamental problem of indexing within crowd-sourced databases has not been studied. In this research, we study the problem of tree-based indexing within crowd-enabled databases. We investigate the effect of various tree and crowdsourcing parameters on the quality of index operations. We propose new algorithms for index search, insert, and update. We propose a taxonomy of the problem and highlight its main challenges. We propose techniques for index construction and querying. We intend to complete this study with an extensive analysis and experimental evaluation. Our current work focuses on one-dimensional indexing. We plan to study other variations of crowd-sourced indexing, such as multidimensional indexing, spatial and spatio-temporal indexing and fuzzy indexing.

 

Knowledge Cubes - A Proposal for Scalable and Semantically-Guided Management of Big Data [IEEE BigData 2013]

A Knowledge Cube, or cube for short, is an intelligent and adaptive database instance capable of storing, analyzing, and searching data. Each cube is established based on a set of semantic aspects, e.g., (1) Topical, (2) Contextual, (3) Spatial, or (4) Temporal. A cube specializes in handling data that is only relevant to its semantic aspect. Knowledge cubes are inspired by two prime architectures: (1) "Dataspaces" that provides an abstraction for data management where heterogeneous data sources can co-exist and it requires no unifying schema to be specified in advance (2) "Linked Data" that provides a set of best practices for publishing and interlinking structured data on the web. A knowledge cube uses the best practices of Linked Data as its main building block for its data layer and encompasses some of the data integration abstractions defined by Dataspaces. In this research, we propose knowledge cubes as a semantically-guided data management architecture, where data management is influenced by the data semantics rather than by a predefined scheme. Knowledge cubes support the five pillars of Big Data also known as the five V's: Volume, Velocity, Veracity, Variety, and Value. Many interesting opportunities can be leveraged when learning the semantics of the data. In the research, we highlight these opportunities and propose a strawman design for knowledge cubes along with the research challenges that arise when realizing them. Our research forms a vision for the knowledge cube architecture that motivates understanding the semantics of the data as a mean for solving some of the research challenges we face today, specifically in Big Data.  We also motivate the use of Linked Data as a mean for achieving this target given the rich semantics that Linked Data provides. We illustrate some of the challenges and potential solutions associated with the knowledge cubes proposal. Finally, we demonstrate how the proposed architectural components can be used to tackle the five pillars of Big Data.

 

JISC: Adaptive Stream Processing Using Just-In-Time State Completion [EDBT 2014]

The continuous and dynamic nature of data streams may lead a query execution plan (QEP) of a long-running continuous query to become suboptimal during execution, and hence will need to be altered. The ability to perform an efficient and flawless transition to an equivalent, yet optimal QEP is essential for a data stream query processor. Such transition is challenging for plans with stateful binary operators, such as joins, where the states of the QEP have to be maintained during query transition without compromising the correctness of the query output. In this research, we develop a Just-In-Time State Completion (JISC); a new technique for query plan migration. JISC does not cause any halt to the query execution, and thus allows the query to maintain steady output. JISC is applicable to pipelined as well as eddy-based query evaluation frameworks. During plan transition, JISC maintains steady query output in contrast to existing plan adaptation techniques that can exhibit significant output latencies. Maintaining steady query output is essential for applications that require continuous monitoring or real-time response. JISC employs a lazy state migration strategy that computes the missing states in the new QEP on-demand in a single QEP. In highly dynamic scenarios where the queries and streams have fluctuating selectivities and arrival rates, overlapped plan transitions are imminent. The lazy migration strategy enables JISC to avoid performance thrashing, which is a vital issue in all other existing techniques. A key property of JISC is that during plan transition, it detects the unchanged parts of the QEP and avoids recomputing their states. Our probabilistic analysis demonstrates that the number of operators with unchanged states in the new QEP is usually close to the total number of operators. For that reason, JISC outperforms existing techniques that also maintain steady query output, e.g., the Parallel Track Strategy, leading to performance gains during a plan migration stage that can reach up to an order of magnitude.

 

Indexing Recent Trajectories of Moving Objects [ACM SIGSPATIAL 2014]

Advances in location-aware and object-tracking devices have led to the development of location-based services that process large amounts of spatio-temporal data. The need for efficient processing of queries on the locations of moving objects calls for trajectory-based indexing methods. These indexes need to capture both the spatial and temporal dimensions of the data in a way that minimizes the number of disk I/Os required for both updating and querying. Most existing spatio-temporal index structures capture either the current locations of the moving objects or the entire history of the moving objects. However, many applications require that only the recent history of a moving object's trajectory be stored. This research introduces the trails-tree; a disk-based data structure for indexing the recent history of moving objects' trajectories. The trails-tree maintains a temporal-sliding window over trajectories and uses: (1) an in-memory memo structure that reduces the I/O cost of updates using a lazy-update mechanism, and (2) lazy vacuum-cleaning mechanisms to delete parts of the trajectories that fall out of the sliding window. Theoretical analysis and experimental evaluation illustrate that the trails-tree outperforms the state-of-the art index structures for indexing recent trajectory data by up to a factor of two.

 

The Similarity-aware Relational Intersect Database Operator [SISAP 2014] (Best Paper Award)

Identifying similarities in large datasets is an essential operation in many applications such as bioinformatics, pattern recognition, and data integration. To make the underlying database system similarity-aware, the core relational operators have to be extended. Several similarity-aware relational operators have been proposed that introduce similarity processing at the database engine level, e.g., similarity joins and similarity group-by. In this research, we extend the semantics of the set intersection operator to operate over similar values.

We introduced the semantics and extended SQL syntax of the similarity-based set intersection operator. We developed an algorithm that is based on a Mark/Restore mechanism to avoid the O(n^2) complexity. The proposed operator is implemented inside an open-source database system, namely PostgreSQL. Several queries from the TPC-H benchmark are extended to include similarity-based set intersection predicates. Our implementation of the proposed operator outperforms the queries that produce the same result using only regular operations. The speedup ranges between 1000 and 4 times for similarity threshold values ranging between 0.01% and 10% of the attribute domain range. We also demonstrated that the added functionality is achieved without a big overhead when compared to standard operators.

 

HandsOn DB: Managing Data Dependencies Involving Human Actions [TKDE 2014]

Consider two values, x and y, in the database, where y = F(x). To maintain the consistency of the data, whenever x changes, F needs to be executed to re-compute y and update its value in the database. This is straightforward in the case where F can be executed by the DBMS, e.g., SQL or C function. In this published research in [TKDE 2014], we address the more challenging case where F is a human action, e.g., conducting a wet-lab experiment, taking manual measurements, or collecting instrument readings. In this case, when x changes, y remains invalid (inconsistent with the current value of x) until the human action involved in the derivation is performed and its output result is reflected into the database. Many application domains, e.g., scientific applications in biology, chemistry, and physics, contain multiple such derivations and dependencies that involve human actions. In this research, we introduce HandsOn DB, a prototype database engine for managing dependencies that involve human actions while maintaining the consistency of the derived data. HandsOn DB includes the following features: (1) semantics and syntax for interfaces through which users can register human activities into the database and express the dependencies among the data items on these activities; (2) mechanisms for invalidating and revalidating the derived data; and (3) new operator semantics that alert users when the returned query results contain potentially invalid data, and enable evaluating queries on either valid data only, or both valid and potentially invalid data. Performance results are presented that study the overheads associated with these features and demonstrate the feasibility and practicality in realizing HandsOn DB.

 

LIMO: A Web-based Environment for Learning Programming using Interactive Map Operations [Technical Report 2014]

Learning how to write computer programs is often challenging for the novice with little or no coding experience. Many pedagogical programming environments motivate programming with intuitive syntax-free interfaces, interesting real-world contexts, and animated output. In this research, we introduce LIMO; a web-based programming environment that is centered around operation on interactive geographical maps, location-oriented data, and the operations of synthetic objects that move on the maps. LIMO materializes a low-cost open-ended environment that integrates interactive maps and spatial data (e.g., OpenStreetMap). The unique advantage of LIMO is that it relates programming concepts to geographical maps and locations that have basis in reality. LIMO offers an environment for learning how to program by providing 1) map and programming constructs, 2) high-quality interactive map graphics, and 3) example programs that introduce users to writing programs in the LIMO environment.

 

AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data [VLDB 2016 – Full Paper (To Appear)] (paper) (AQWA code)

In support of big spatial data, this research focuses on the data layer in a cluster-based platform. The motivation for this research is the unprecedented spread of location-aware devices that has resulted in a plethora of location-based services in which huge amounts of spatial data need to be efficiently processed by large-scale computing clusters. Existing cluster-based systems for processing spatial data employ static data partitioning structures that cannot adapt to data changes, and that are insensitive to the query-workload. Hence, these systems are not able to consistently provide good performance. To close this gap, we present AQWA, an adaptive and query-workload-aware data partitioning mechanism for processing large-scale spatial data. AQWA does not assume prior knowledge of the data distribution or the query-workload. Instead, AQWA incrementally reacts to changes in both the data and the query-workload. As data is consumed and queries are processed, the data partitions are incrementally updated. With extensive experiments using real spatial data from Twitter, we demonstrate that AQWA can achieve an order of magnitude gain in query performance compared to standard spatial partitioning structures.

 

Spatial Queries with k-Nearest-Neighbor and Relational Predicates [ACM SIGSPATIAL 2015 – Full Paper] (paper) (Code)

This research is an important milestone for this project as it shows all the feasible optimizations when k-Nearest-Neighbor (kNN, for short) predicates are intermixed with relational predicates in SQL-like query languages. This research is motivated by the ubiquity of location-aware devices and smartphones that has unleashed an unprecedented proliferation of location-based services that require processing queries with both spatial and relational predicates. Many algorithms and index structures already exist for processing kNN predicates either solely or when combined with textual keyword search. Unfortunately, there has not been enough study on how to efficiently process queries where kNN predicates are combined with general relational predicates, i.e., ones that have selects, joins and group-by’s. One major challenge is that because the kNN is a ranking operation, applying a relational predicate before or after a kNN predicate in a query evaluation pipeline (QEP, for short) can result in different outputs, and hence leads to different query semantics. In particular, this renders classical relational query optimization heuristics, e.g., pushing selects below joins, inapplicable. This research introduces and studies various query optimization heuristics for queries that involve combinations of kNN select/join predicates and relational predicates. The proposed optimizations can significantly enhance the performance of these queries while preserving their semantics. Experimental results that are based on queries from the TPC-H benchmark and real spatial data from OpenStreetMap demonstrate that the proposed optimizations can achieve orders of magnitude enhancement in query performance.

 

Cost Estimation of Spatial k-Nearest-Neighbor Operators EDBT 2015 (paper) (Code)

Advances in geo-sensing technology have led to an unprecedented spread of location-aware devices. In turn, this has resulted into a plethora of location-based services in which huge amounts of spatial data need to be efficiently consumed by spatial query processors. For a spatial query processor to properly choose among the various query processing strategies, the cost of the spatial operators has to be estimated. In this research, we study the problem of estimating the cost of the spatial k-nearest-neighbor (k-NN, for short) operators, namely, k-NN-Select and k-NN-Join. Given a query that has a k-NN operator, the objective is to estimate the number of blocks that are going to be scanned during the processing of this operator. Estimating the cost of a k-NN operator is challenging for several reasons. For instance, the cost of a k-NN-Select operator is directly affected by the value of k, the location of the query focal point, and the distribution of the data. Hence, a cost model that captures these factors is relatively hard to realize. This study introduces cost estimation techniques that maintain a compact set of catalog information that can be kept in main-memory to enable fast estimation via lookups. A detailed study of the performance and accuracy trade-off of each proposed technique is presented. Experimental results using real spatial datasets from OpenStreetMap demonstrate the robustness of the proposed estimation techniques.

 

A Demonstration of AQWA: Adaptive Query Workload Aware Partitioning of Big Spatial Data  [VLDB 2015 Demo paper] (Demo paper) (AQWA code)

During Academic Year 2014-2015, we have realized a prototype open-source system for AQWA. AQWA is motivated by the needs for efficiently organizing Big Spatial Data. The ubiquity of location-aware devices, e.g., smartphones and GPS-devices, has led to a variety of location-based services in which huge amounts of geo-tagged information is created every day. This prototype system demonstrates AQWA, a novel mechanism for partitioning large-scale spatial data over distributed clusters. Unlike existing cluster-based systems, e.g. SpatialHadoop, that apply static partitioning of spatial data, AQWA has the ability to react to changes in query-workload or data distribution. A key feature of AQWA is that it does not assume prior knowledge of the query-workload or data distribution. Instead, AQWA adapts to changes in the query-workload or data distribution in an online fashion by incrementally updating the partitioning of the data in a way that enhances the query execution time. We demonstrate two real prototypes of AQWA applied to both Hadoop and Spark platforms. In both prototypes, we process spatial range and k-nearest-neighbor queries over large-scale real and synthetic datasets, and exploit the performance of AQWA under different query-workloads. Currently, we are making the code for AQWA on Hadoop publicly available.

 

Tornado: A Distributed SpatioTextual Stream Processing System [VLDB 2015 Demo paper] (Demo paper)

During Academic Year 2014-2015, we have realized a prototype open-source system termed Tornado. Tornado is motivated by the needs for efficiently querying Big Spatial Data. Nowadays, most spatial data is also associated with text data. The widespread use of location-aware devices together with the increased popularity of micro-blogging applications (e.g., Twitter) has led to the creation of large streams of spatio-textual data. In order to serve real-time applications, the processing of these large-scale spatio-textual streams needs to be distributed. However, existing distributed stream processing systems (e.g., Spark and Storm) are not optimized for spatial/textual content. In this demonstration, we introduce Tornado, a distributed in-memory spatio-textual stream processing server that extends Storm. To efficiently process spatio-textual streams, Tornado introduces a spatio-textual indexing layer to the architecture of Storm. The indexing layer is adaptive, i.e., dynamically re-distributes the processing across the system according to changes in the data distribution and/or query workload. In addition to keywords, higher-level textual concepts are identified and are semantically matched against spatio-textual queries. Tornado provides data de-duplication and fusion to eliminate redundant textual data. We demonstrate a prototype of Tornado running against real Twitter streams, where the users can register continuous or snapshot spatio-textual queries using a map-assisted query interface.

 

Kangaroo: Workload-Aware Processing of Range Queries in Hadoop [WSDM 2016 – Full Paper] (paper)

Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures, e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, in this study, we prototype Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload aware, i.e., results in partitioning layouts that minimize the query processing time of given query patterns. In this study, we expose and address the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.

 

Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs [SIGMOD 2016 – Full Paper] (paper)

A variety of applications spanning various domains, e.g., social networks, transportation, and bioinformatics, have graphs as first-class citizens. These applications share a vital operation, namely, finding the shortest path between two nodes. In many scenarios, users are interested in filtering the graph before finding the shortest path. For example, in social networks, one may need to compute the shortest path between two persons on a sub-graph containing only family relationships. This research focuses on dynamic graphs with labeled edges, where the target is to find a shortest path after filtering some edges based on user-specified query labels. This problem is termed the Edge-Constrained Shortest Path query (or ECSP, for short). This research introduces Edge-Disjoint Partitioning (EDP, for short), a new technique for efficiently answering ECSP queries over dynamic graphs. EDP has two main components: a dynamic index that is based on graph partitioning, and a traversal algorithm that exploits the regular patterns of the answers of ECSP queries. EDP partitions the graph based on the labels of the edges. On demand, EDP computes specific sub-paths within each partition and updates its index. The computed sub-paths are cached and can be leveraged by future queries. To answer an ECSP query, EDP connects sub-paths from different partitions using its efficient traversal algorithm. EDP can dynamically handle various types of graph updates, e.g., label, edge, and node updates. The index entries that are potentially affected by graph updates are invalidated and are re-computed on demand. EDP is evaluated using real graph datasets from various domains. Experimental results demonstrate that EDP can achieve query performance gains of up to four orders of magnitude in comparison to state of the art techniques.

 

Cruncher: Distributed In-Memory Processing for Location-Based Services [IEEE ICDE 2016] (Demo paper)

Advances in location-based services (LBS) demand high-throughput processing of both static and streaming data. Recently, many systems have been introduced to support distributed main-memory processing to maximize the query throughput. However, these systems are not optimized for spatial data processing. In this system, we showcase Cruncher, a distributed main-memory spatial data warehouse and streaming system. Cruncher extends Spark with adaptive query processing techniques for spatial data. Cruncher uses dynamic batch processing to distribute the queries and the data streams over commodity hardware according to an adaptive partitioning scheme. The batching technique also groups and orders the overlapping spatial queries to enable inter-query optimization. Both the data streams and the offline data share the same partitioning strategy that allows for data co-locality optimization. Furthermore, Cruncher uses an adaptive caching strategy to maintain the frequently-used location data in main memory. Cruncher maintains operational statistics to optimize query processing, data partitioning, and caching at runtime. We demonstrate two LBS applications over Cruncher using real datasets from OpenStreetMap and two synthetic data streams. We demonstrate that Cruncher achieves order(s) of magnitude throughput improvement over Spark when processing spatial data.

 

Efficient Processing of Hamming-Distance-Based Similarity Search Queries Over MapReduce [EDBT 2015] (paper)

Similarity and proximity searches are crucial to many applications. Of particular interest are two flavors of the Hamming distance range query, namely, the Hamming select and the Hamming join (Hamming-select and Hamming-join, respectively). Hamming distance is widely used in approximate near neighbor search for high dimensional data, such as images and document collections. For example, using predefined similarity hash functions, high-dimensional data is mapped into one-dimensional binary codes that are, then linearly scanned to perform Hamming-distance comparisons. These distance comparisons on the binary codes are usually costly and, often involve excessive redundancies. In this study, we introduce a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries. An efficient search algorithm based on the HA-index is presented. A distributed version of the HA-index is introduced and algorithms for realizing Hamming distance-select and Hamming distance-join operations on a MapReduce platform are prototyped. Extensive experiments using real datasets demonstrates that the HA-index and the corresponding search algorithms achieve up to two orders of magnitude speedup over existing state-of-the-art approaches, while saving more than ten times in memory space.

 

Geo-tagging Non-Spatial Concepts [ACM SIGSPATIAL Mobile GIS workshop 2015] (paper)

This research addresses a new interesting type of queries, namely, geo-tagging non-spatial concepts. For example, consider the following queries: Show on the map potential crime locations in Chicago, or show on the map potential pollution regions in San Francisco. Because “crime” and “pollution” are non-spatial concepts, it is hard to pin-point a location on the map to place these concepts.  In general, Concept Geo-tagging is the process of assigning a textual identifier that describes a real-world entity to a physical geographic location. A concept can either be a spatial concept where it possesses a spatial presence or be a non-spatial concept where it has no explicit spatial presence. Geo-tagging locations with non-spatial concepts that have no direct relation is a very useful and important operation but is also very challenging. The reason is that, being a non-spatial concept, e.g., crime, makes it hard to geo-tag it. In this study, we propose to use the semantic information associated with concepts and locations such as the type as a mean for identifying these relations. The co-occurrence of spatial and non-spatial concepts within the same textual resources, e.g., in the web, can be an indicator of a relationship between these spatial and non-spatial concepts. Techniques are presented for learning and modeling relations among spatial and non-spatial concepts from web textual resources. Co-occurring concepts are extracted and modeled as a graph of relations. This graph is used to infer the location types related to a concept. A location type can be a hospital, restaurant, an educational facility and so forth. Due to the immense number of relations that are generated from the extraction process, a semantically-guided query processing algorithm is introduced to prune the graph to the most relevant set of related concepts. For each concept, the most relevant types are matched against the location types. Experiments evaluate the proposed algorithm based on its filtering efficiency and the relevance of the discovered relationships. Performance results illustrate how semantically-guided query processing can outperform the baseline in terms of efficiency and relevancy. The proposed approach achieves an average precision of 74% across three different datasets.

 

LIMO: Learning Programming using Interactive Map Activities [ACM SIGSPATIAL 2015 Demo Paper] (Demo paper)

Advances in geographic information, interactive two- and three-dimensional map visualization accompanied with the proliferation of mobile devices and location data have tremendously benefited the development of geo-educational applications. We demonstrate LIMO; a web-based programming environment that is centered around operations on interactive geographical maps, location-oriented data, and the operations of synthetic objects that move on the maps. LIMO materializes a low-cost open-ended environment that integrates interactive maps and spatial data (e.g., Open-StreetMap). The unique advantage of LIMO is that it relates programming concepts to interactive geo- graphical maps and location data. LIMO offers an environment for students to learn how to program by providing 1. an easy to program library of map and spatial operations, 2. high-quality interactive map graphics, and 3. example programs that introduce users to writing programs in the LIMO environment.

 

On Map-Centric Programming Environments [ACM SIGSPATIAL 2015 Vision Paper] (Vision paper)

2D maps and 3D globes can be used as programming toys to help students learn programming in contrast to using robots, visual, or multimedia components in Computer Science I introductory programming courses. This study exposes research challenges related to supporting this concept, and presents one instance of a 2D and 3D map-based programming environment to motivate these challenges.

 

Similarity Group-by Operators for Multi-dimensional Relational Data [IEEE TKDE 2016] (paper)

The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. While the standard group-by operator, which is based on equality, is useful in several applications, allowing proximity-aware grouping provides a more realistic view on real-world data that could lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.

 

The Similarity-aware Relational Database Set Operators [Information Systems Journal, 2016] (paper)

Finding data items in the proximity of a query item can be formulated as a similarity operation. Identifying these similarities in large datasets is an essential operation in several applications such as bioinformatics, pattern recognition, and data integration. To make a relational database management system similarity-aware, the core relational operators have to be extended. While similarity-awareness has been introduced in database engines for relational operators such as joins and group-by, little has been achieved for relational set operators, namely Intersection, Difference, and Union. In this paper, we propose to extend the semantics of relational set operators to take into account the similarity of values. We develop efficient query processing algorithms for evaluating them, and implement these operators inside an open-source database system, namely PostgreSQL. By extending several queries from the TPC-H benchmark to include predicates that involve similarity-based set operators, we perform extensive experiments that demonstrate up to three orders of magnitude speedup in performance over equivalent queries that only employ regular operators.

 

ORLF: A Flexible Framework for Online Record Linkage and Fusion [IEEE ICDE 2016] (Demo paper)

With the exponential growth of data on the Web comes the opportunity to integrate multiple sources to give more accurate answers to user queries. Upon retrieving records from multiple Web databases, a key task is to merge records that refer to the same real-world entity. We demonstrate ORLF (Online Record Linkage and Fusion), a flexible query-time record linkage and fusion framework. ORLF de-duplicates newly arriving query results jointly with previously processed query results. We use an iterative caching solution that leverages query locality to effectively de-duplicate newly incoming records with cached records. ORLF aims to deliver timely query answers that are duplicate-free and reflect knowledge collected from previous queries.

 

Approving Updates in Collaborative Databases [IEEE IC2E 2015] (paper)

Although this research was motivated by and was started in the context of cleaning dirty spatial data, the techniques developed proved to be useful and generally applicable to many application domains including scientific data curation. Data curation activities, especially in collaborative databases, mandate that collaborators interact until they converge and agree on the content of their data. Typically, updates by a member of the collaboration are made visible to all collaborators for comments but at the same time are pending the approval or rejection of the data custodian, e.g., the principal scientist or investigator (PI). In current database technologies, approval and authorization of updates are based solely on the identity of the user, e.g., via the SQL GRANT and REVOKE commands. However, in collaborative environments, the updated data is open for collaborators for discussion and further editing and is finally approved or rejected by the PI based on the content of the data and not on the identity of the updater. In this research, we introduce a cloud-based collaborative database system that promotes and enables collaboration and data curation scenarios. We realize content-based update approval and history tracking of updates inside HBase, a distributed and scalable open-source cluster-based database. The design and implementation as well as a detailed performance study of several approaches for update approval are investigated and alternative techniques are contrasted.

 

Precision-Bounded Access Control Using Sliding-Window Query Views for Privacy-Preserving Data Streams [TKDE 2015] (paper)

This research extends spatial indexing and spatial query processing techniques to apply in the context of access control and preserving privacy over data streams. Access control mechanisms and Privacy Protection Mechanisms (PPM) have been proposed for data streams. The access control for a data stream allows roles access to tuples satisfying an authorized predicate sliding-window query. Sharing the sensitive stream data without PPM can compromise the privacy. The PPM meets privacy requirements, e.g., k-anonymity or l-diversity by generalization of stream data. Imprecision introduced by generalization can be reduced by delaying the publishing of stream data. However, the delay in sharing the stream tuples to achieve better accuracy can lead to false-negatives if the tuples are held by PPM while the query predicate is evaluated. Administrator of an access control policy defines the imprecision bound for each query. The challenge for PPM is to optimize the delay in publishing of stream data so that the imprecision bound for the maximum number of queries is satisfied. We formulate the precision-bounded access control for privacy-preserving data streams problem, present the hardness results, provide an anonymization algorithm, and conduct experimental evaluation of the proposed algorithm. Experiments demonstrate that the proposed heuristic provides better precision for a given data stream access control policy as compared to the minimum or maximum delay heuristics proposed in existing literature.

 

GeoTrend: spatial trending queries on real-time microblogs [SIGSPATIAL 2016] (paper)

In this research, we develop GeoTrend; a system for scalable support of spatial trend discovery on recent microblogs, e.g., tweets and online reviews, that come in real time. GeoTrend is distinguished from existing techniques in three aspects: (1) It discovers trends in arbitrary spatial regions, e.g., city blocks. (2) It supports trending measures that effectively capture trending items under a variety of definitions that suit different applications. (3) It promotes recent microblogs as first-class citizens and optimizes its system components to digest a continuous flow of fast data in main-memory while removing old data efficiently. GeoTrend queries are top-k queries that discover the most trending k keywords that are posted within an arbitrary spatial region and during the last T time units. To support its queries efficiently, GeoTrend employs an in-memory spatial index that is able to efficiently digest incoming data and expire data that is beyond the last T time units. The index also materializes top-k keywords in different spatial regions so that incoming queries ctiaan be processed with low latency. In case of peak times, a main-memory optimization technique is employed to shed less important data, so that the system still sustains high query accuracy with limited memory resources. Experimental results based on real Twitter feed and Bing Mobile spatial search queries show the scalability of GeoTrend to support arrival rates of up to 50,000 microblog/second, average query latency of 3 milli-seconds, and at least 90+% query accuracy even under limited memory resources.

 

 

Atlas: On the expression of spatial-keyword group queries using extended relational constructs [SIGSPATIAL 2016] (paper)

This research addresses the query language and expressiveness of spatial keyword queries. The popularity of GPS-enabled cellular devices introduced numerous applications, e.g., social networks, micro-blogs, and crowd-powered reviews. These applications produce large amounts of geo-tagged textual data that need to be processed and queried. Nowadays, many complex spatio-textual operators and their matching complex indexing structures are being proposed in the literature to process this spatio-textual data. For example, there exist several complex variations of the spatio-textual group queries that retrieve groups of objects that collectively satisfy certain spatial and textual criteria. However, having complex operators is against the spirit of SQL and relational algebra. In contrast to these complex spatio-textual operators, in relational algebra, simple relational operators are offered, e.g., relational selects, projects, order by, and group by, that are composable to form more complex queries. In this research, we introduce Atlas, an SQL extension to express complex spatial-keyword group queries. Atlas follows the philosophy of SQL and relational algebra in that it uses simple declarative spatial and textual building-block operators and predicates to extend SQL. Not only that Atlas can represent spatio-textual group queries from the literature, but also it can compose other important queries, e.g., retrieve spatio-textual groups from subsets of object datasets where the selected subset satisfies user-defined relational predicates and the groups of close-by objects contain miss-spelled keywords. We demonstrate that Atlas is able to represent a wide range of spatial-keyword queries that existing indexes and algorithms would not be able to address. The building- block paradigm adopted by Atlas creates room for query optimization, where multiple query execution plans can be formed.

 

Spatial Computing [CACM 2016] (Overview paper)

Spatial computing encompasses the ideas, solutions, tools, technologies, and systems that transform our lives by creating a new understanding of locations—how we know, communicate, and visualize our relationship to locations and how we navigate through them. Pervasive GPS allows hikers in national parks, boaters on lakes, children visiting new places, and taxis (or Uber drivers or self-driving cars) and unmanned aerial vehicles to know their locations, nearby facilities, and routes to reach places of interest. This article provides an overview about the future of spatial computing and its roles in the various aspects of our lives.

 

LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data [VLDB 2016] (demo paper)

We present LocationSpark, a spatial data processing system built on top of Apache Spark, a widely used distributed data processing system. LocationSpark offers a rich set of spa- tial query operators, e.g., range search, kNN, spatio-textual operation, spatial-join, and kNN-join. To achieve high performance, LocationSpark employs various spatial indexes for in-memory data, and guarantees that immutable spatial indexes have low overhead with fault tolerance. In addition, we build two new layers over Spark, namely a query scheduler and a query executor. The query scheduler is responsible for mitigating skew in spatial queries, while the query executor selects the best plan based on the indexes and the nature of the spatial queries. Furthermore, to avoid unnecessary network communication overhead when processing overlapped spatial data, We embed an efficient spatial Bloom filter into LocationSpark’s indexes. Finally, LocationSpark tracks frequently accessed spatial data, and dynamically flushes less frequently accessed data into disk. We evaluate our system on real workloads and demonstrate that it achieves an order of magnitude performance gain over a baseline framework.

 

In-Memory Distributed Matrix Computation Processing and Optimization [ICDE 2017] (paper)

The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This research introduces new efficient and scalable matrix processing and optimization techniques for in-memory distributed clusters. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics to optimize the cost of matrix computations in an in-memory distributed environment. The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix processing and optimization techniques in Spark, a distributed in-memory computing platform. Experiments on both real and synthetic data demonstrate that our proposed techniques achieve up to an order-of-magnitude performance improvement over state-of the-art distributed matrix computation systems on a wide range of applications.

 

Query Processing Techniques for Big Spatial-Keyword Data [SIGMOD Tutorial 2017] (paper) (slides)

The widespread use of GPS-enabled cellular devices, i.e., smart phones, has led to the popularity of numerous mobile applications, e.g., social networks, micro-blogs, mobile web search, and crowd-powered reviews. These applications generate large amounts of geo-tagged textual data, i.e., spatial-keyword data. This data needs to be processed and queried at an unprecedented scale. The management of spatial-keyword data at this scale goes beyond the capabilities of centralized systems. We live in the era of big data and the big data model is currently been used to address scalability issues in various application domains. This has led to the development of various big spatial-keyword processing systems. These systems are designed to ingest, store, index, and query huge amounts of spatial-keyword data. In this 1.5 hour tutorial, we explore recent research efforts in the area of big spatial-keyword processing. First, we give main motivations behind big spatial-keyword systems with real-life applications. We describe the main models for big spatial-keyword processing, and list the popular spatial-keyword queries. Then, we present the approaches that have been adopted in big spatial-keyword processing systems with special attention to data indexing and spatial and keyword data partitioning. Finally, we conclude this tutorial with a discussion on some of the open problems and research directions in the area of big spatial-keyword query processing.

 

A Survey of Shortest-Path Algorithms [CoRR 2017] (paper)

A shortest-path algorithm finds a path containing the minimal cost between two vertices in a graph. A plethora of shortest-path algorithms is studied in the literature that span across multiple disciplines. This research surveys shortest-path algorithms based on a taxonomy that is introduce. One dimension of this taxonomy is the various flavors of the shortest-path problem. There is no one general algorithm that is capable of solving all variants of the shortest-path problem due to the space and time complexities associated with each algorithm. Other important dimensions of the taxonomy include whether the shortest-path algorithm operates over a static or a dynamic graph, whether the shortest-path algorithm produces exact or approximate answers, and whether the objective of the shortest-path algorithm is to achieve time-dependence or is to only be goal directed. This survey studies and classifies shortest-path algorithms according to the proposed taxonomy. The survey also presents the challenges and proposed solutions associated with each category in the taxonomy.

 

 

Teaching, Education, Training, and Development

2011-2012

Education and Course Design:

In Fall 2011, Aref has offered a graduate-level seminar class CS698. The purpose of this course is to train students in spatiotemporal data management research. Students learned the basic and advanced techniques in spatial and spatiotemporal data management through several interesting projects. The main class project is to design an introductory level (CS-I) programming course based on spatiotemporal data management techniques, e.g., using 2D and 3D map and moving objects’ operations. More detail on the CS Programming-I course is given below. In the class, students covered some of the fundamental papers in spatial and spatiotemporal data management as well as the useful tools that facilitate spatial and spatiotemporal data systems research, e.g., data generators, benchmarks, open source systems, extensible indexing architectures, etc. This course counts as one of the two CS-698 research courses a Ph.D. student can take in his/her first three semesters or as a CS590 independent study course. The class meets once a week on Wednesdays. Each student was expected to give around two to three presentations over the span of the semester. In addition to the CS Programming-1 course project that every student is expected to participate in, each student or possibly a group project of teams of one-three students each selects a semester-long project. Each student (or each group) was expected to meet with Walid on individual basis once a week to follow up on the project’s progress. During class, group discussions evolve around the CS Programming-I Project following some students' presentations.

 

CS Programming-I Project:

This part of the class involved lots of brainstorming and design activities. The purpose of this course project is to design an introductory programming course in computer science to address low retention of computer science freshman year and their diminished interested in science. The target is to go beyond the classical 'Hello World' programming examples that have shown their limits. The goal of this project is to design the CS Programming I course around 2D and 3D map operations (e.g., using 3D Earth and 2D Map APIs) to increase the interest of students in programming, both in CS and in Science. According to CRA reports, the interest among undergraduate students in majoring in Computer Science has decreased significantly. Universities nationwide have been experiencing a noticeable decline in freshman registration in CS for the past years. For example, the number of new CS majors in Fall 2007 was half of what it was in Fall 2000 (15,958 versus 7,915). Several colleges attribute this decline to the increase in computer literacy among incoming students and the boring nature of the freshman Programming I courses. A first program that prints 'Hello World!' on the screen is not thrilling anymore to students playing computer games with high-quality graphics since KG. To add excitement, several colleges have redesigned their freshman Programming I courses to use toy devices, e.g., Robots. The target would be to program the robot to perform certain tasks while learning various programming concepts in the process. Although interesting to some students, using robots has several disadvantages. High setup cost in obtaining and maintaining the robots is one set back. Another disadvantage is the frustration that many students have when dealing with hardware as well as software debugging issues simultaneously. The goal of this project is to design a CS Programming-I course that uses 2D and 3D maps (e.g., Google Earth and Google Maps) as the ‘toy' through which freshman CS students learn programming concepts. Besides the low setup cost, learning using 2D and 3D maps adds the excitement needed by introducing high-quality graphics and high interactivity by relating the programming tasks to maps or photos of locations that the students can relate to. Wrapper APIs needs to be developed to simplify the setup and interaction with the map software. The new CS Programming-I course needs to be developed along the same lines of the first CS Programming-1 course at Purdue (CS-180).

 

Education and Course Design Progress:

11 graduate students (8 males and 3 females) attended the CS698 class that Aref offered in Fall 2011. Topics covered in class included:

-       Dissection of Purdue’s Existing CS-177 & CS-180 introductory CS Programming courses and their exercises.

-       Study of 2D and 3D Map APIs (e.g., Google’s and Microsoft products, among others)

-       Study of spatial and spatiotemporal data operations

-       What are the basic, minimal, and complete sets of spatiotemporal operations?

-       Study of Map and GIS (Geographic Information Systems) operations

-       Classic multi-dimensional (spatial) and spatiotemporal data indexing techniques

-       Spatial and spatiotemporal query processing algorithms

-       Spatial and spatiotemporal query optimization

-       Spatial and spatiotemporal data management systems

-       Spatiotemporal data streaming systems

-       Study of publicly available spatial and spatiotemporal data sets

-       Synthetic spatiotemporal data generators

-       Benchmarking spatial and spatiotemporal data systems

 

Student projects in the course included: spatiotemporal query optimization, a study of publicly available spatial and spatiotemporal data sets (real and synthetic), parallel spatiotemporal data indexing, distance oracles, map matching, a study of spatiotemporal data generators. Several 2D- and 3D-based programming project ideas and programming primitives were developed for the CS Programming I course. These ideas need further elaboration and are currently being studied by Walid and his group.

 

Training and Development:

Students have gained research experience in the context of this project. They have published their results in rigorous conference outlets, e.g., VLDB, which is one of the two top and most prestigious and competitive publication venues in the field of database systems (along with the ACM SIGMOD Conference). Mr. Aly got his paper accepted into VLDB 2012, which was an excellent experience for him as a first publication in his career. During the graduate seminar course (CS698) that Aref offered in Fall 2011, students (8 males and 3 females) have benefitted tremendously as, through the class activities, students gave many class presentations about fundamental ideas and fundamental papers in the field of spatial database systems as well as cover progress in their class projects. Also, students learned through this experience how to criticize ideas of others in a balanced fashion, whether published work of others, or those of their colleagues in the class. Through this class experience, several students got started in their Ph.D. research, including Mr. Mahmood and Mr. Tang. Students have also gained tremendous experience working with large open-source software systems, e.g., Hadoop. Our M3 project has exposed them to invaluable software development experience while adapting Hadoop by eliminating all the disk layers from it and turning it into a massively distributed and parallel data streaming system. Mr. Aly was later invited to spend the summer at Microsoft Research where he sharpened his research skills further.

 

2012-2013

Education and Course Design

Aref has offered over 15 graduate research independent study courses (CS590, CS698, CS699) in Fall 2012 and Spring 2013. The purpose of these courses is to train graduate students individually or in small group in various topics in spatiotemporal data management research. Students learn advanced topics in spatial and spatiotemporal data management through several interesting independent study or group projects. Aref currently has 11 Ph.D. students (10 in CS, 1 in ECE) three of which were supported by this grant and one additional woman Ph.D. student (Ruby Tahboub) will be supported by this grant as of January 2014. One of the targets of this project's education plan is to design an introductory programming course in computer science to address low retention of computer science freshman year and their diminished interest in science. The target is to go beyond the classical 'Hello World' programming examples that are not thrilling anymore to students playing computer games with high-quality graphics since KG. Our goal is to design a CS Programming-I course that uses 2D and 3D maps (e.g., Google Earth and Google Maps) as the 'toy' through which freshman CS students learn programming concepts in contrast to using robots as other programming environments do. Aref and one Ph.D. student have been working on developing this introductory-level (CS-I) programming course based on spatiotemporal data management techniques, e.g., using 2D and 3D map and moving objects' operations. We have focused on studying several existing introductory programming environments, e.g., Scratch and Alice. We have started developing 2D map-based programming primitives and abstract operations that serve as building blocks for the map-based programming language to be developed. Currently, we are using these primitives to develop some simple programming exercises to serve as a way to debug these primitives. Once the design of these map primitives and programming abstractions converge and are finalized, we will start building these primitives for testing.

 

Training and Development

Students have gained research experience in the context of this project. They have published their results in rigorous conference and journal outlets, e.g., the VLDB Journal, the IEEE Transactions on Knowledge and Data Engineering, the GeoInformatica Journal, the IEEE BigData Conference, and the VLDB DBCrowd workshop. Many of the participating students gained excellent experience as this was their first publications in their careers. During the graduate independent study courses, Ph.D. students (10 males and 1 female) have benefitted from the various class activities, and many of them got "jumpstarted" into spatiotemporal data systems research through systems-oriented projects that resulted into conference and journal paper submissions.

Several students, e.g., Mr. Madkour and Mr. Mahmood (both were supported under this project in the 2012/2013 academic year), got their first research papers accepted into the IEEE BigData 2013 and VLDB DBCroud 2013, respectively. Students have also gained experience working with large open-source software systems, e.g., Hadoop and PostgreSQL. Because of this experience, four of the Ph.D. students got invited for summer internships at various industrial and research labs.

Post-doctorate Dr. Shen Zhihong from the Chinese Academy of Sciences has visited Purdue to collaborate with Aref’s research group at his government's expenses to get training in the context of this project. Also, Dr. Saleh Basalamah, Director of the GIS Technology Innovation Center, Umm Al-Qura University, Saudi Arabia, has spent an academic year at Purdue to collaborate in the context of this project. Dr. Basalamah is a co-author in the DBCrowd 2013 and IEEE BigData 2013 conference papers.

Dissemination of Results to Communities of Interest and Outreach Activities:

In 2012/2013, Aref has offered the over 15 independent graduate research courses, which introduced the topics in the project to 11 Ph.D. graduate students during the fall and spring semesters 2013. Walid has served as an organization committee member of the Spatial Computing 2020 Workshop sponsored by NSF/CCC to promote a unified agenda for Spatial Computing research (http://cra.org/ccc/visioning/visioning-activities/spatial-computing/196-spatial-computing-workshop), where Walid was the lead for the Systems track in the workshop. This has resulted in an article that is currently being submitted to CACM in which Walid is a co-author. In December 2012, Walid gave a keynote speech at the International Conference for Connected Vehicles and Expo (Beijing, China) to reflect advances in spatiotemporal database systems research. Walid continues to chair the ACM SIGSPATIAL special interest group on spatial algorithms and systems (2011-2014), where he was one of four founding members of the SIG.

2013-2014

Education and Course Design

Aref has offered over 20 graduate and undergraduate research independent study and theses courses (CS490, CS590, CS698, CS699) in Fall 2013, and Spring and Summer 2014. The purpose of these courses is to train students individually or in small groups in various topics in spatiotemporal data management research. Students learn advanced topics in spatial and spatiotemporal data management through several interesting independent study or group projects including distributed and parallel spatiotemporal data streaming, benchmarking of spatiotemporal databases, declustered spatiotemporal query processing, among many other topics. It is expected that these studies to result in conference papers in the coming academic year. Aref currently has one Masters Student and 12 Ph.D. students (11 in CS, 1 in ECE) three of which were supported by this grant including one woman Ph.D. student (Ruby Tahboub). Another woman Ph.D. student is scheduled to join this project in Spring 2015. Undergraduate Senior Student Linda Thongsavath did her senior design project on the topic of benchmarking spatiotemporal databases with deep query evaluation pipelines.

 

LIMO: A Web-based Environment for Learning Programming using Interactive Map Operations [Technical Report 2014]

The computer science education community has placed tremendous resources to raise interest in computing and improve retention. A substantial share of these efforts relies on developing innovative programming environments that target the young-aged group. Along this line, we introduce LIMO; a web-based environment for learning programming using activities related to interactive two-dimensional geographical maps and the operations of moving objects on the maps. An essential aspect of LIMO is that it provides diverse programming constructs that enable users to process, customize, and visualize location data on the map. In our view, the integration of interactive maps, location-based data, and delivering seamless spatial constructs within an educational programming environment is a very promising direction. We plan to develop LIMO further with additional spatial and visual constructs to make it more attractive to novice programmers. We plan to conduct a usability study that involves novice users to assess the efficacy of using LIMO while learning programming. Finally, we plan to develop a curriculum for an introductory computer science course based on LIMO.

Aref and two Ph.D. students have continued to work on developing LIMO based on spatiotemporal data management techniques using 2D and 3D map and moving objects' operations. In 2013/2014, we have developed 2D and 3D map-based programming primitives and abstract operations around Google Maps and Google Earth. These 2D and 3D map primitives serve as building blocks for the map-based programming language that we are currently developing. Currently, we are using these primitives to develop some simple programming exercises to serve as a way to debug these primitives. To enrich this map-based programming environment, we integrated an open-source map database (namely, OpenStreamMap) underneath the building blocks, so that the building blocks can refer to the underlying spatial as well as the textual descriptions of the spatial objects (e.g., rivers, lakes, shopping centers, etc.). The programming exercises that we developed gave rise to interesting new research problems that Ph.D. student Ruby Tahboub is currently studying.

 

2014-2015:

In 2014/2015, Walid's research group has been composed of 17 Ph.D. covering 5 continents (except for Australia :-) and including four female students (covering 4 continents). In addition, over the past year, Walid worked with 15 undergraduate students, three of which are female, where they participated in big spatial data projects for their senior design project work. Two of the undergraduate students co-authored two papers with Walid, and one systems development paper for supporting big spatial data over Hadoop is in the works. 

One student (A. Aly) from Walid's group graduated with Ph.D. from the group and joined Google as of August 2015. One Ph.D. student passed his Ph.D. prelim exam and five other students will have their prelim exam during this academic year.

2015-2016:

-- The second Ph.D. student (Mingjie Tang) has graduated based on the research conducted within the scope of this project.

-- Over the course of this project, Aref has offered over 50 graduate and undergraduate research independent study and theses courses (CS390, CS490, CS590, CS698, CS699). Around 20% of these course offerings were for female students. The purpose of these courses was to train students individually or in small groups in various topics in spatiotemporal data management research. Students learned advanced topics in spatial and spatiotemporal data management through several interesting independent study or group projects including distributed and parallel spatiotemporal data streaming, benchmarking of spatiotemporal databases, declustered spatiotemporal query processing, among many other topics. Some of these studies resulted in conference and journal papers.

2016-2017:

We have maintained and released the website for the LIMO system. LIMO Website: http://ibnkhaldun.cs.purdue.edu:8181/limo/GWTMaps.html

We have introduced the notion of user accounts in LIMO so that users can log in remotely through the web interface and code and store their LIMO programs in our LIMO server.

We have published 16 journal papers in top-tier journal venues including the IEEE Transactions on Knowledge and Data Engineering, The VLDB Journal, Communications of the ACM, GeoInformatica, Information Systems, Proceedings of the VLDB Journal, and the IEEE Software magazine, as well as 21 papers in top-tier conference venues including ACM SIGMOD, VLDB, IEEE ICDE, EDBT, ACM SIGSPATIAL, SISAP, and WSDM, and one upcoming tutorial at the ACM SIGMOD (on the subject of Spatial-keyword Query Processing Techniques).

As a way of disseminating the results to communities of interest, Walid and his Ph.D. student gave a tutorial on the subject of Spatial Keyword Big Data Processing at the 2017 SIGMOD Conference based on the research conducted in this project. Moreover, Walid gave an a keynote speech at the 2017 SIGMOD GeoRich Workshop on the subject of "Location+X Big Data Systems: Challenges and Some Solutions" based on the research conducted in this project.

 

Publications

 

1.   A.M. Aly, W.G. Aref, M. Ouzzani, "Spatial Queries with Two kNN Predicates", (VLDB 2012). Proceedings of the VLDB Endowment, International Conference on Very Large Data Bases, p. 1100-1111, vol. 5, Istanbul, Turkey, (2012).

 

2.   A.M. Aly, A. Sallam, B.M. Gnanasekaran, L.-V. Nguyen-Dinh, W.G. Aref, M. Ouzzani, and A. Ghafoor, "M3: Stream Processing on Main-Memory MapReduce", Proceedings of the 28th International Conference on Data Engineering (ICDE), Washington DC, (2012).

 

3.   Y.N. Silva, W.G. Aref, P. Larson, S. Pearson, M.H. Ali: Similarity queries: their conceptual evaluation, transformations, and processing. VLDB J. 22(3): 395-420 (2013).

 

4.   H.G. Elmongui, M.F. Mokbel, W.G. Aref: Continuous aggregate nearest neighbor queries. GeoInformatica 17(1): 63-95 (2013).

 

5.   A.R. Mahmood, W.G. Aref, S.M. Basalamah, E. Dragut. The Palm-tree Index: Indexing with the crowd. The VLDB DBCrowd Workshop, 2013.

 

6.   A. Madkour, W.G. Aref, S.M. Basalamah. Knowledge Cubes - A Proposal for Scalable and Semantically-Guided Management of Big Data. IEEE BigData Conference, 2013.

 

7.   A.R. Mahmood, A.M. Aly, W.G. Aref, S.M. Basalamah. Indexing Recent Trajectories of Moving Objects. ACM SIGSPATIAL GIS, 2014.

 

8.   R. Tahboub,  J. Shen, W.G. Aref, S. Prabhakar, LIMO: A Web-based Environment for Learning Programming using Interactive Map Operations. Purdue Computer Science Technical Report 2014. Submitted for Conference Publication.

 

9.   Z. Pervaiz, W.G. Aref, A. Ghafoor, N. Prabhu: Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data. IEEE Trans. Knowl. Data Eng. 26(4): 795-807 (2014).

 

10.  A.M. Aly, W.G. Aref, M. Ouzzani, H.M. Mahmoud: JISC: Adaptive Stream Processing Using Just-In-Time State Completion. EDBT 2014: 73-84.

 

11.  M.Y. Eltabakh, W.G. Aref, A.K. Elmagarmid, M. Ouzzani: HandsOn DB: Managing Data Dependencies Involving Human Actions. IEEE Trans. Knowl. Data Eng. 26(9): 2193-2206 (2014).

 

12.  W.J. Al Marri, Q.M. Malluhi, M. Ouzzani, M. Tang, W.G. Aref: The Similarity-Aware Relational Intersect Database Operator. SISAP 2014: 164-175. (Best paper award and invited for special issue journal publication).

 

 

13.  A.M. Aly, W.G. Aref, M. Ouzzani, Cost Estimation of Spatial k-Nearest-Neighbor Operators, EDBT 2015.

 

14.  A.M. Aly, A.S. Abdelhamid, A.R. Mahmood, W.G. Aref, M.S. Hassan, H. Elmeleegy, M. Ouzzani (2015). A Demonstration of AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data.. VLDB, Proceedings of the VLDB. 

 

15.  K. Mershad, Q.M. Malluhi, M. Ouzzani, M. Tang, W.G. Aref (2015). Approving Updates in Collaborative Databases. 2015 IEEE International Conference on Cloud Engineering. 

 

16.  A.Madkour, W.G. Aref, M. Mokbel, S. Basalamah (2015). Geo-tagging Non-Spatial Concepts. ACM SIGSPATIAL MobiGIS Workshop. 

 

17.  A.M. Aly, H. Elmeleegy, Y. Qi, W.G. Aref (2016). Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop. 9th ACM International Conference on Web Search and Data Mining (WSDM). 

 

18.  R.Y. Tahboub, J. Shin, Aya Abdelsalam, J.W. Aref, W.G. Aref, S. Prabhakar (2015). LIMO: Learning Programming using Interactive Map Activities (Demo Paper). ACM SIGSPATIAL.

 

19.  M. Tang, W.G. Aref, Q.M. Malluhi, M. Ouzzani (2015). MingJie Tang, Yongyang Yu, Walid G. Aref, Qutaibah M. Malluhi, Mourad Ouzzani: Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce.. EDBT 2015: 361-372. 

 

20.  W.G. Aref, S. Prabhakar, J. Shin, R.Y. Tahboub, A. Abdelsalam, J.W. Aref (2015). On Map-Centric Programming Environments (Vision Paper). ACM SIGSPATIAL. 

 

21.  A.M. Aly, W.G. Aref, M. Ouzzani (2015). Spatial Queries with k-Nearest-Neighbor and Relational Predicates. ACM SIGSPATIAL. 

 

22.  A.R. Mahmood, A.M. Aly, T. Qadah, A. Madkour, A.S. Abdelhamid, M.S. Hassan, W.G. Aref, S. Basalamah (2015). Tornado: A Distributed Spatio-Textual Stream Processing System. VLDB, Proceedings of the VLDB. 

 

23.  Wadha J. Al Marri and Qutaibah Malluhi, Mourad Ouzzani, Mingjie Tang and Walid G. Aref (2016). The Similarity-aware Relational Database Set Operators.  Information Systems Journal, Vol. 59: pp. 79-93.

 

24.  Z. Pervaiz, A. Ghafoor, W.G. Aref: (2015). Precision-Bounded Access Control Using Sliding-Window Query Views for Privacy-Preserving Data Streams..  IEEE Trans. Knowl. Data Eng.. 27 (7), 1992-2004.

 

25.  M. Tang, Y. Yu, Q.M. Malluhi, M. Ouzzani, W.G. Aref: LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data. PVLDB 9(13): 1565-1568 (2016) (Demo paper).

 

26.  M.S. Hassan, A.M. Aly, W.G. Aref, Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs, ACM SIGMOD 2016 (Full Paper).

 

27.  A.M. Aly, M.S. Hassan, A.R. Mahmood, W.G. Aref, H. Elmeleegy, M. Ouzzani, AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data, VLDB 2016 (Full Paper).

 

28.  M. Tang, R.Y. Tahboub, W.G. Aref, M.J. Atallah, Q.M. Malluhi, M. Ouzzani, Y.N. Silva: Similarity Group-by Operators for Multi-Dimensional Relational Data. IEEE Trans. Knowl. Data Eng. 28(2): 510-523 (2016).

 

29.  Shashi Shekhar, Steven K. Feiner, Walid G. Aref: Spatial computing. Communications of the ACM 59(1): 72-81 (2016).

 

30.  A.S. Abdelhamid, M. Tang, A.M. Aly, A.R. Mahmood, T. Qadah, W.G. Aref, S.Basalamah. Cruncher: Distributed In-Memory Processing for Location-Based Services. IEEE ICDE 2016, Demo paper.

 

31.  E.K. Rezig, E.C. Dragut, M. Ouzzani, A.K. Elmagarmid, W.G. Aref. ORLF: A Flexible Framework for Online Record Linkage and Fusion. IEEE ICDE 2016, Demo paper.

 

32.  A.R. Mahmood, W.G. Aref, A.M. Aly, M. Tang: Atlas: on the expression of spatial-keyword group queries using extended relational constructs. ACM SIGSPATIAL 2016: 45:1-45:10.

 

33.  A. Magdy, A.M. Aly, M.F. Mokbel, S. Elnikety, Y. He, S. Nath, W.G. Aref: GeoTrend: spatial trending queries on real-time microblogs. ACM SIGSPATIAL 2016: 7:1-7:10

 

34.  A. Madkour, W.G. Aref, F. Rehman, M.A. Rahman, S. Basalamah (2017). A Survey of Shortest Path Algorithms. https://arxiv.org/pdf/1705.02044.pdf. 26 pages.

 

35.  Y. Yu, M. Tang, W.G. Aref, Q.M. Malluhi, M.M. Abbas, M. Ouzzani: In-Memory Distributed Matrix Computation Processing and Optimization. ICDE 2017: 1047-1058.

 

36.  A.R. Mahmood, W.G. Aref: Query Processing Techniques for Big Spatial-Keyword Data. SIGMOD Conference 2017: 1777-1782 (Tutorial).

Collaborators:

·       Dr. Paul Larson: Microsoft Research, Redmond, WA collaborated in writing a research paper [VLDBJ2013] that provides foundational research for the correctness ideas in this grant.

 

·       Dr. M. Ouzzani: QCRI, was originally a co-PI in the project when he was at Purdue and withdrew as co-PI after joining QCRI, but continued to collaborate extensively in the context of this project co-authoring many of the papers and co-advising one of the Ph.D. students (A. Ali).

 

·       Dr. M.H. Ali: Microsoft Corporation, Redmond, Washington, collaborated in writing a research paper [VLDBJ2013] that provides foundational research for the correctness ideas in this grant.

 

·       Dr. Y. Silva: Arizona State University, collaborated in writing several papers in the context of this project related to similarity query processing [VLDBJ 2013 and TKDE 2016].

 

·       Dr. Hicham Elmongui: Amazon, collaborated in writing a research paper [GeoInformatica2013] that presents algorithms for efficiently evaluation aggregate nearest neighbor queries. Now at U. Alexandria, Egypt.

 

·       Prof. M. Eltabakh: WPI, Massachusetts, collaborated in writing a research paper [IEEE TKDE 2014] that presents models and techniques for integrating human activities into database systems.

 

·       Dr. Ahmed Elmagarmid: QCRI, collaborated in writing a demo paper related to data cleaning [ICDE 2016].

 

·       Prof. Mikhael Atallah: Purdue CS, collaborated in writing a research paper [TKDE 2016].

 

·       Prof. M. Gribskov: Purdue Biology, collaborated in writing a research paper [TKDE 2016] providing real-world driving applications from biology and providing relevant datasets.

 

·       Graduate Students: A. Aly (Ph.D., 2015, Now at Google), M. Tang (Ph.D., 2016, Now at Hortonworks), A. Mahmood (Ph.D. student), R. Tahboub (Ph.D. student), A. Madkour (Ph.D. student), M. Hassan (Ph.D. student) were funded at different periods under this grant and worked at various aspects of this projects.

 

·       S. Pearson: Graduate student at Arizona State who collaborated in the context of generalizing the results in this project for similarity-based databases. Mr. Pearson conducted experiments requested by the reviewers in order to revise the submitted paper that was later accepted to appear in the VLDB Journal [VLDBJ2013]. Now at Purdue.

 

·       Aya A. Ismail: Graduate student at Maryland, College Park. Collaborated in this project in the context of the LIMO system.

 

·       B. Coffey: Graduate student at Purdue University who collaborated in the context of collecting meta-data about publicly available spatial data sets.

 

·       Undergraduate Students: Linda Thongsavath: Undergraduate student at Purdue University who collaborated in the context of realizing a benchmark for spatiotemporal databases that use k-nearest neighbors in multi-predicate queries that include relational, spatial, and kNN predicates.

 

·       Visitors and Post-docs:

 

o   Dr. Shen Zhihong: Chinese Academy of Sciences, Beijing, China (August 2013- October 2013) participated in designing cluster-based user workspaces.

 

o   Dr. Saleh Basalamah: Director of the GIS Technology Innovation Center, Umm Al-Qura University, Saudi Arabia (April 2013-Present), participated in coauthoring multiple research papers, including [DBCrowd2013], [IEEE BigData 2013], and [ACM SIGSPATIAL 2014].

 

o   Dr. Eduard Dragut: Dr. Dragut was a post-doc at Purdue University during 2012-2013 and participated in the context of this project in the cloud sourcing indexing paper [DBCrowd 2013].