A Tutorial on Learned Multidimensional Indexes*~

 

Abdullah Al Mamun (Ph.D. Student)

Hao Wu (Undergraduate Student)

Walid G. Aref

 

Department of Computer Science

Purdue University

West Lafayette, Indiana 47907, U.S.A.

 

*We acknowledge the support of the National Science Foundation under Grant Numbers III-1815796 and IIS-1910216.

 

~This page is based on a tutorial given at the 2020 ACM SIGSPATIAL Conference:

Abdullah Al-Mamun, Hao Wu, Walid G. Aref: A Tutorial on Learned Multi-dimensional Indexes. ACM SIGSPATIAL Conference, pp. 1-4, Nov. 2020. (pdf of the tutorial proposal).

 

 

ABSTRACT

Recently, Machine Learning (ML, for short) has been successfully applied to database indexing. Initial experimentation on Learned Indexes has demonstrated better search performance and lower space requirements than their traditional database counter parts. Numerous attempts have been explored to extend learned indexes to the multi-dimensional space. This makes learned indexes potentially suitable for spatial databases. The goal of this tutorial is to provide up-to-date coverage of learned indexes both in the single and multidimensional spaces. The tutorial covers over 25 learned indexes. The tutorial navigates through the space of learned indexes through a taxonomy that helps classify the covered learned indexes both in the single and multi-dimensional spaces.

 

OUTLINE OF THE TAXONOMY

 

An updated version of the taxonomy is given below.

 

 

In the taxonomy, we distinguish between two main approaches:

-       Indexing the Learned Models vs.

-       Learning the Index

By indexing the learned models, we refer to the following problem. Assume that we are given a collection of learned models, where each model represents (and helps recognize) a certain object class, e.g., a learned model that represents the class dog, another that represents that class cat, etc. Given a query object, one needs to execute each of the models to identify that model that produces the highest matching score for the given query object. The question is:

Can we index the learned models to speed up the matching process?

Several indexes have been proposed that index these learned models to speed up the matching process (e.g., see [1,46]).

 

In contrast, by learning the index, we refer to the problem of substituting a traditional database index, e.g., a B+-tree, by a machine-learning based model. Instead of searching the B+-tree to locate the leaf page that contains an input search key, one uses an ML model that predicts the location (or the leaf page) that contains the search key [22].

In the tutorial, we give example indexes that follow each of the two approaches.

 

In the taxonomy, we further distinguish between learned indexes along the following dimension:

-       Learned indexes that support static data sets vs.

-       Learned indexes that support updates

Notice that the issue of supporting static vs. dynamic data sets is crucial due to the fact that learning an index needs offline training that is relatively slow in nature. Thus, learned indexes that support updates need to accommodate for this fact and still support online response.

 

Next, we distinguish between learned indexes along the dimensionality of the data.

-       Learned indexes for one-dimensional data vs.

-       Learned indexes for multi-dimensional data

 

Finally, we close by discussing open research problems.

 

Here are the slides and video for the SIGSPATIAL 2020 tutorial: <slides> and <video>

 

Extended Version of the slides (V2.0): <slides>

In this version, in addition to covering more learned multi-dimensional indexes, we extend the taxonomy to cover the following new dimension:

-       Learned indexes with fixed data layout vs.

-       Learned indexes with dynamic data layout

 

PRESENTERS

 

Abdullah Al Mamun

Abdullah is a PhD student at the Department of Computer Science (CS), Purdue University. His research interests are in the area of Learned Index Structures, particularly, in the area of Learned Multidimensional and Spatial Indexes. Previously, he completed his MS in CS from Memorial University of Newfoundland, Canada.

 

 

 

Hao Wu

Hao is a senior undergraduate student at Purdue University with majors in Data Science, Statistics-Math, Aviation Management. He is interested in ML-oriented research as well as its application in data-driven multi-disciplinary projects.

 

 

 

Walid G. Aref

Walid is a professor of computer science at Purdue. His research interests are in extending the functionality of database systems in support of emerging applications, e.g., spatial, spatio-temporal, graph, biological, and sensor databases. His focus is on query processing, indexing, data streaming, and geographic information systems (GIS). Walid’s research has been supported by the National Science Foundation, the National Institute of Health, Purdue Research Foundation, CERIAS, Panasonic, and Microsoft Corp. In 2001, he received the CAREER Award from the National Science Foundation and in 2004, he received a Purdue University Faculty Scholar award. Walid is a member of Purdue’s CERIAS. He is the Editor-in-Chief of the ACM Transactions of Spatial Algorithms and Systems (ACM TSAS), an editorial board member of the Journal of Spatial Information Science (JOSIS), and has served as an editor of the VLDB Journal and the ACM Transactions of Database Systems (ACM TODS). Walid has won several best paper awards including the 2016 VLDB ten-year best paper award. He is a Fellow of the IEEE, and a member of the ACM. Between 2011 and 2014, Walid has served as the chair of the ACM Special Interest Groupon Spatial Information (SIGSPATIAL).

 

REFERENCES

 

[1] Walid Aref, Daniel Barbará, and Padmavathi Vallabhaneni. 1995. The Handwritten Trie: Indexing Electronic Ink. SIGMOD Rec.24, 2 (May 1995), 151–162.   https://doi.org/10.1145/568271.223811

 

[2] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger.1990. The R*-tree: an efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD international conference on Management of data. 322–331.

 

[3] Angjela Davitkova, Evica Milchevski, and Sebastian Michel. 2020. The ML-Index: A Multidimensional, Learned Index for Point, Range, and Nearest-Neighbor Queries.. In EDBT. 407–410.

 

[4] Jialin Ding, Umar Farooq Minhas, Hantian Zhang, Yinan Li, Chi Wang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, and David Lomet. 2019. ALEX: An Updatable Adaptive Learned Index. arXiv preprint arXiv:1905.08898(2019).

 

[5] Jialin  Ding,  Vikram  Nathan,  Mohammad  Alizadeh,  and  Tim  Kraska.  2020. Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads. arXiv preprint arXiv:2006.13282(2020).

 

[6] Paolo Ferragina and Giorgio Vinciguerra. 2020.  Learned Data Structures.  In Recent Trends in Learning From Data, Luca Oneto, Nicolò Navarin, AlessandroSperduti, and Davide Anguita (Eds.). Springer International Publishing, 5-41. https://doi.org/10.1007/978-3-030-43883-8_2

 

[7] Paolo Ferragina and Giorgio Vinciguerra. 2020.  The PGM-index. Proceedings of the VLDB Endowment13, 8 (Apr 2020), 1162–1175.   https://doi.org/10.14778/3389133.3389135

 

[8] Paolo Ferragina, Giorgio Vinciguerra, and Michele Miccinesi. 2019.  Superseding traditional indexes by orchestrating learning and geometry. arXiv preprintarXiv:1903.00507(2019).

 

[9] Raphael A. Finkel and Jon Louis Bentley. 1974. Quad trees a data structure for retrieval on composite keys. Acta informatica4, 1 (1974), 1–9.

 

[10] Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. Fiting-tree: A data-aware index structure. In Proceedings of the 2019International Conference on Management of Data. 1189–1206.

 

[11] Behzad Ghaffari, Ali Hadian, and Thomas Heinis. 2020. Leveraging Soft Functional  Dependencies  for  Indexing  Multi-dimensional  Data. arXiv preprintarXiv:2006.16393(2020).

 

[12] Antonin Guttman. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD international conference on Management of data. 47–57.

 

[13] Ali Hadian and Thomas Heinis. 2019.  Considerations for handling updates in learned index structures. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. ACM, 3.

 

[14] Ali Hadian and Thomas Heinis. 2019. Interpolation-friendly B-trees: Bridging the gap between algorithmic and learned indexes. In22nd International Conference on Extending Database Technology (EDBT 2019).  https://doi.org/10.5441/002/edbt.2019.93

 

[15] Ali Hadian and Thomas Heinis. 2020. MADEX: Learning-augmented Algorithmic Index Structures. In Proceedings of the 2nd International Workshop on Applied AI for Database Systems and Applications.

 

[16] Ali Hadian, Ankit Kumar, and Thomas Heinis. 2020. Hands-off Model Integration in Spatial Index Structures. In Proceedings of the 2nd International Workshop on Applied AI for Database Systems and Applications.

 

[17] Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, Not from Queries!13, 7 (2020).

 

[18] Stratos Idreos and Tim Kraska. 2019.  From Auto-tuning One Size Fits All to Self-designed and Learned Data-intensive Systems (Tutorial). In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference2019, Amsterdam, The Netherlands, June 30 - July 5, 2019. 2054–2059. http://people.csail.mit.edu/kraska/pub/sigmod19tutorialpart2.pdf

 

[19] Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper,Tim Kraska, and Thomas Neumann. 2019.  SOSD: A Benchmark for Learned Indexes. ArXivabs/1911.13014 (2019).

 

[20] Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper,Tim Kraska, and Thomas Neumann. 2020. RadixSpline: A Single-Pass Learned Index.ArXivabs/2004.14541 (2020).

 

[21] Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H Chi, Jialin Ding, Ani Kristo,Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019.Sagedb: A learned database system. (2019).

 

[22] Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018.The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. ACM, 489–504.

 

[23] Pengfei Li, Yu Hua, Pengfei Zuo, and Jingnan Jia. 2019.   A Scalable Learned Index Scheme in Storage Systems. CoRRabs/1905.06256 (2019). arXiv:1905.06256http://arxiv.org/abs/1905.06256

 

[24] Pengfei Li, Hua Lu, Qian Zheng, Long Yang, and Gang Pan. 2020. LISA: A Learned Index Structure for Spatial Data. SIGMOD(2020)

 

[25] Xin Li, Jingdong Li, and Xiaoling Wang. 2019. ASLM: Adaptive single layer model for learned index. In International Conference on Database Systems for Advanced Applications. Springer, 80–95.

 

[26] A Llavesh, Utku Sirin, R West, and A Ailamaki. 2019. Accelerating b+ tree search by using simple machine learning techniques. In Proceedings of the 1stInternational Workshop on Applied AI for Database Systems and Applications.

 

[27] Stephen Macke, Alex Beutel, Tim Kraska, Maheswaran Sathiamoorthy,Derek Zhiyuan Cheng, and EH Chi. 2018. Lifting the curse of multidimensional data with learned existence indexes. InWorkshop on ML for Systems at NeurIPS.

 

[28] Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra,Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. arXiv preprint arXiv:2006.12804(2020).

 

[29] Ryan Marcus, Emily Zhang, and Tim Kraska. 2020. CDFShop: Exploring and Optimizing Learned Index Structures. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2789–2792.

 

[30] Michael Mitzenmacher. 2018. A model for learned bloom filters and optimizing by sandwiching. In Advances in Neural Information Processing Systems. 464–473.

 

[31] Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning Multi-dimensional Indexes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 985–1000.

 

[32] Varun Pandey, Alexander van Renen, Andreas Kipf, Ibrahim Sabek, Jialin Ding,and Alfons Kemper. 2020. The Case for Learned Spatial Indexes. arXiv preprintarXiv:2008.10349(2020).

 

[33] Jianzhong Qi, Guanli Liu, Christian S Jensen, and Lars Kulik. 2020. Effectively learning spatial indices. Proceedings of the VLDB Endowment13, 12 (2020), 2341–2354.

 

[34] Wenwen Qu, Xiaoling Wang, Jingdong Li, and Xin Li. 2019. Hybrid indexes by exploring traditional B-tree and linear regression. In International Conference on Web Information Systems and Applications. Springer, 601–613.

 

[35] Ibrahim Sabek and Mohamed F Mokbel. 2020. Machine learning meets big spatial data. In2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE,1782–1785.

 

[36] Hanan Samet. 1984. The quadtree and related hierarchical data structures. ACM Computing Surveys (CSUR)16, 2 (1984), 187–260.

 

[37] Hanan Samet. 2006.Foundations of multidimensional and metric data structures. Morgan Kaufmann.

 

[38] Chuzhe Tang, Zhiyuan Dong, Minjie Wang, Zhaoguo Wang, and Haibo Chen.2019. Learned Indexes for Dynamic Workloads. arXiv preprint arXiv:1902.00655(2019).

 

[39] Chuzhe Tang, Youyun Wang, Zhiyuan Dong, Gansen Hu, Zhaoguo Wang, MinjieWang, and Haibo Chen. 2020.  XIndex: a scalable learned index for multicore data storage. Proceedings of the 25th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming(2020).

 

[40] Peter Van Sandt, Yannis Chronis, and Jignesh M Patel. 2019. Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?. In Proceedingsof the 2019 International Conference on Management of Data. ACM, 36–53.

 

[41] Haixin Wang, Xiaoyi Fu, Jianliang Xu, and Hua Lu. 2019. Learned Index for Spatial Queries. In2019 20th IEEE International Conference on Mobile Data Management(MDM). IEEE, 569–574.

 

[42] Youyun Wang, Chuzhe Tang, Zhaoguo Wang, and Haibo Chen. 2020. SIndex: a scalable learned index for string keys. In Proceedings of the 11th ACM SIGOPSAsia-Pacific Workshop on Systems. 17–24.

 

[43] Yingjun Wu, Jia Yu, Yuanyuan Tian, Richard Sidle, and Ronald Barber. 2019.Designing Succinct Secondary Indexing Mechanism by Exploiting Column Correlations. arXiv preprint arXiv:1903.11203(2019).

 

[44] Wenkun Xiang, Hao Zhang, Rui Cui, Xing Chu, Keqin Li, and Wei Zhou. 2018.Pavo: A RNN-Based Learned Inverted Index, Supervised or Unsupervised?IEEEAccess7 (2018), 293–303.

 

[45] Zongheng Yang, Badrish Chandramouli, Chi Wang, Johannes Gehrke, Yinan Li,Umar Farooq Minhas, Per-Åke Larson, Donald Kossmann, and Rajeev Acharya.2020. Qd-tree: Learning Data Layouts for Big Data Analytics. In Proceedings of the2020 ACM SIGMOD International Conference on Management of Data. 193–208.

 

[46] Jin, Hui, and H. V. Jagadish. "Indexing Hidden Markov Models for Music Retrieval." In ISMIR. 2002.

 

[47] Babu, G. Phanendra. "Self-organizing neural networks for spatial data." Pattern Recognition Letters 18, no. 2 (1997): 133-142.

 

[48] J. Qi, Y. Tao, Y. Chang, and R. Zhang. Theoretically optimal and empirically efficient R-trees with strong parallelizability. PVLDB, 11(5):621–634, 2018.