The Database Systems Seminar

 

 

 

Wednesday, 10/2/2024 (DB Systems Seminar)

10:30am, LWSN 3102AB (In-person only)

Da Yan, Associate Professor, Indiana University, Bloomington

Talk Title: T-thinker: A Task-Based Parallel Computing Model for Compute-Intensive Graph Analytics and Beyond

 

 

Abstract

Pioneered by Google's Pregel, the think-like-a-vertex (TLAV) computing model has dominated the area of parallel and distributed graph processing. However, TLAV models are only scalable for data-intensive iterative graph algorithms such as random walks and graph traversal. Unfortunately, researchers were using TLAV models to solve compute-intensive graph problems, leading to performance not much beyond that of a serial algorithm due to the IO bottleneck incurred by unnecessarily materializing a lot of intermediate data. This talk advocates a new parallel computing model called T-thinker, which adopts the think-like-a-task (TLAT) computing paradigm to divide the computing workloads of compute-intensive problems while allowing backtracking search to avoid data materialization as much as possible. We will explain how the T-thinker model can achieve ideal speedup ratio for many compute-intensive problems such as mining dense subgraphs, frequent subgraph pattern mining, and subgraph matching/enumeration. A number of TLAT-based systems will be covered including G-thinker, G-thinkerQ, T-FSM, PrefixFPM, G2-AIMD and T-DFS, which tackles compute-intensive graph problems in various settings such as on a shared-memory multi-core machine, on a distributed cluster, and on multiple GPUs. We will also explain how the T-thinker model applies beyond the graph domain to problems such as training big models consisting of many decision trees, and massively parallel spatial data processing.

 

Bio:

Da Yan is an Associate Professor in the Department of Computer Sciences of the Luddy School of Informatics, Computing, and Engineering (SICE) at Indiana University Bloomington. He received his Ph.D. degree in Computer Science from the Hong Kong University of Science and Technology in 2014, and he received my B.S. degree in Computer Science from Fudan University in Shanghai in 2009. He is a DOE Early Career Research Program (ECRP) awardee in 2023, and the sole winner of the Hong Kong 2015 Young Scientist Award in Physical/Mathematical Science. His research interests include parallel and distributed systems for big data analytics, data mining, and machine learning (esp. deep learning). He frequently publishes in top DB and AI conferences such as SIGMOD, VLDB, ICDE, KDD, ICML, ICLR, AAAI, IJCAI, EMNLP, and in top journals such as ACM TODS, VLDB Journal, IEEE TKDE, IEEE TPDS, ACM Computing Surveys. He also serves extensively in the major top DB and AI conferences and journals as reviewers, co-organized events such as the BIOKDD workshop with SIGKDD, Dagstuhl seminars, and a few top conferences, and he served as guest editors of journals such as IEEE/ACM TCBB, BMC Bioinformatics, and IEEE CG&A.

 

——————————————————————————————————————————————————————————————————

 

Wednesday, 10/4/2023 (DB Systems Seminar)

10:30am, LWSN 3102AB (In-person only)

Faisal Nawab, Assistant Professor, University of California, Irvine

Talk Title: Enabling Emerging Edge and IoT Applications with Edge-Cloud Data Management

 

Abstract

The potential of Edge and IoT applications encompasses realms like smart cities, mobility solutions, and immersive technologies. Yet, the actualization of these promising applications stumbles upon a fundamental impediment: the prevailing cloud data management technologies are often tethered to remote data centers. This architectural choice introduces daunting challenges, including substantial wide-area latency, burdensome connectivity and communication bandwidth demands, and regulatory constraints related to personal and sensitive data.

 

This talk presents our research in introducing edge-cloud data management that provides a framework for managing data across edge nodes to overcome the limits of cloud-only data management. We encounter various challenges to achieving this vision such as managing the sheer amount of edge nodes, their sporadic availability, and device constraints in terms of compute, storage, and trust. To navigate these multifaceted challenges, our work redesigns distributed data management technologies to adapt to the edge environment. This includes introducing design concepts in the domains of hierarchical and asymmetric edge-cloud data management, decentralized edge coordination techniques, and edge-friendly mechanisms to maintain security and trust. The talk includes a demonstration of 'AnyLog'–an edge-cloud data management solution that integrates our research findings.

 

faisal nawab - Assistant Professor - UC Irvine | LinkedInBio: Faisal Nawab is an assistant professor in the computer science department at the University of California, Irvine. He is the director of EdgeLab, which is dedicated to building edge-cloud data management solutions for emerging edge and IoT applications. Faisal's research is influenced by practical industry problems through his involvement with the startup 'AnyLog' where he acts as the lead architect of designing an edge-cloud database. Faisal has received recognition for his work, winning the "Next-Generation Data Infrastructure" award from Facebook, being named the runner-up for the IEEE TEMS Blockchain Early-Career Award, and being awarded several NSF grants, and industry funding from Meta and Roblox.

 

——————————————————————————————————————————————————————————————————

 

April 18, 2022 (DB Systems Seminar)

Stratos Idreos, Harvard University

Talk Title: The Data Systems Grammar

 

Abstract

Data structures are everywhere. They define the behavior of modern data systems and data-driven algorithms. For example, with data systems that utilize the correct data structure design for the problem at hand, we can reduce the monthly bill of large-scale data systems applications on the cloud by hundreds of thousands of dollars. We can accelerate data science tasks by being able to dramatically speed up the computation of statistics over large amounts of data. We can train drastically more neural networks within a given time budget, improving accuracy. 

 

However, knowing the right data structure and data system design for any given scenario is a notoriously hard problem; there is a massive space of possible designs while there is no single design that is perfect across all data, queries, and hardware scenarios. We will discuss our quest for the first principles of data structures and data system design. We will show signs that it is possible to reason about this massive design space, and we will show early results from a prototype self-designing data system which can take drastically different shapes to optimize for the workload, hardware, and available cloud budget using machine learning and what we call machine knowing. These shapes include data structure and system designs which are discovered automatically and do not exist in the literature or industry. 

 

Stratos Idreos | Harvard John A. Paulson School of Engineering and Applied  SciencesBio: Stratos Idreos is an associate professor of Computer Science at Harvard University where he leads the Data Systems Laboratory. His research focuses on making it easy and even automatic to design workload and hardware conscious data structures and data systems with applications on relational, NoSQL, and data science problems. For his PhD thesis on adaptive indexing, Stratos was awarded the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award from the European Research Council on Informatics and Mathematics. In 2015 he was awarded the IEEE TCDE Rising Star Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems and in 2020 he received the ACM SIGMOD Contributions award for his work on reproducible research. He is also a recipient of the National Science Foundation Career award and the Department of Energy Early Career award. Stratos was PC Chair of ACM SIGMOD 2021 and IEEE ICDE 2022, and he is the founding editor of the ACM/IMS Journal of Data Science and the chair of the ACM SoCC Steering Committee. 

 

——————————————————————————————————————————————————————————————————

 

February 14, 2022 (DB Systems Seminar)

Mahmoud Sakr, Free University of Brussels, Belgium

Talk Title: How to build a Moving Object Database System


Abstract:
The increased availability of geospatial trajectory data, fueled by the advances in embedded sensors, has opened opportunities and ambitions to build new applications that make use of location intelligence in domains like maritime, social mobility, urban, and logistics. MobilityDB, an open source SQL moving object database, provides data management support for such kind of applications. It uses the extensibility features of PostgreSQL to implement an abstract data type model of moving objects. It defines the TEMPORAL type constructor, that extends the base types of PostgreSQL and the geometry types of PostGIS respectively into temporal and spatiotemporal types for representing temporal integers, temporal Booleans, temporal geometries, etc. On top of these types there is a rich API of data management functions in SQL. MobilityDB is engineered as an extension to PostgreSQL, which can be dynamically loaded, without a need to restart the server. In contrast to a fork, an extension is fully compatible with PostgreSQL and PostGIS and their ecosystem, which enables users to seamlessly use MobilityDB in deployment environments. In this talk, I'll describe the architecture of MobilityDB, then dive into the R&D challenges of selected functions and open problems.

 

Mahmoud SakrBio: Dr. Mahmoud Sakr is assistant professor at the Brussels school of Engineering in the Free University of Brussels (English for Université Libre de Bruxelles ULB). His main research scope is mobility data science. He is a main contributor and a co-founder of the MobilityDB MOD, an OSGeo community project that extends PostgreSQL and PostGIS with temporal and spatiotemporal data types. It provides data management of geospatial trajectories in SQL. He is also a main contributor and co-chair of the Moving Feature Standards Working Group MF-SWG of the Open Geospatial Consortium OGC. He contributes to the making standards on moving object data representation, exchange and analysis. He contributed as initial convener in amending the MF-SWG charter to cover these topics. He participates in the Erasmus Mundus Joint Master programme on Big Data Management and Analytics BDMA, and in the Marie Sklodowska-Curie ITN-European Joint Doctorate programme on Data Engineering for Data Science DEDS. He holds a PhD (dr.rer.nat) from the FernUniversität in Hagen, Germany. Beside the academic activities, he is actively participating and giving talks in open-source community conferences including FOSS4G, PGConf, and FOSDEM. For more information: https://cs.ulb.ac.be/members/mahmoud/

 

——————————————————————————————————————————————————————————————————

 

November 5, 2019 (Samuel Conte Distinguished Lecture Series and DB Systems Seminar):

Michael Stonebraker, Adjunct Professor of Computer Science, MIT

Talk Title: We Are Often Working on the Wrong Problem (10 misconceptions about what is important)

 

Abstract: 

In the DBMS/Data Systems area, many of us seem to have lost our way.  This talk discusses 10 different problem areas in which there is considerable current research.  Then, I present why I believe much of the work is misguided, either because our assumptions about these problems are incorrect or because we are not paying attention to real users.  Topics considered include machine learning (deep and conventional), public blockchain, data warehouses, schema evolution and the cloud.

 

Bio: Dr. Stonebraker has been a pioneer of data base research and technology for more than forty years.  He was the main architect of the INGRES relational DBMS, and the object-relational DBMS, POSTGRES.  These prototypes were developed at the University of California at Berkeley where Stonebraker was a Professor of Computer Science for twenty five years.  More recently at M.I.T. he was a co-architect of the Aurora/Borealis stream processing engine, the C-Store column-oriented DBMS, the H-Store transaction processing engine, the SciDB array DBMS, and the Data Tamer data curation system. Presently he serves as Chief Technology Officer of Paradigm4 and Tamr, Inc.

 

Professor Stonebraker was awarded the ACM System Software Award in 1992 for his work on INGRES.  Additionally, he was awarded the first annual SIGMOD Innovation award in 1994, and was elected to the National Academy of Engineering in 1997.  He was awarded the IEEE John Von Neumann award in 2005 and the 2014 Turing Award, and is presently an Adjunct Professor of Computer Science at M.I.T.

 

——————————————————————————————————————————————————————————————————

 

October 29, 2019 (Samuel Conte Distinguished Lecture Series and DB Systems Seminar):

Raghu Ramakrishnan, CTO for Data and Technical Fellow at Microsoft

Talk Title: Data in the Cloud

Abstract: 

The cloud has forced a rethinking of database architectures.  Does this offer an opportunity to address the siloed nature of data management systems?  The question is especially important given the rise of machine learning and data governance. In this talk, I'll discuss these issues through the lens of the Microsoft data journey, both internal and external.

 

Raghu Ramakrishnan headshotBio: Raghu Ramakrishnan is CTO for Data, and a Technical Fellow at Microsoft since 2012. From 1987 to 2006, he was a professor at University of Wisconsin-Madison, where he wrote the widely used text “Database Management Systems”.  In 1999, he founded QUIQ, a company powering crowd-sourced question-answering as a cloud service. He joined Yahoo! in 2006 as a Yahoo! Fellow and served as Chief Scientist for the portal, cloud, and search divisions. Ramakrishnan has received several awards, including the ACM SIGMOD Edgar F. Codd Innovations Award, the ACM SIGKDD Innovations Award, the ACM SIGMOD Contributions Award, 10-year Test-of-Time Awards from the ACM SIGMOD, ACM SOCC, ICDT and VLDB conferences, the IIT Madras Distinguished Alumnus Award, the NSF Presidential Young Investigator Award, and the Packard Fellowship in Science and Engineering. He is a Fellow of the ACM and IEEE and has served as Chair of ACM SIGMOD.

——————————————————————————————————————————————————————————————————

 

October 25, 2019: 

Cyrus Shahabi, Professor and Chair, Department of Computer Science, Univ. of Southern California

Talk Title: Transportation Data, Applications & Research for Smart Cities

 

Abstract: 

In this talk, I first introduce the Integrated Media Systems Center (IMSC), a data science research center at USC that focuses in data-driven solutions for real-world applications. IMSC is motivated by the need to address fundamental Data Science problems related to applications with major societal impact. Towards this end, I delve into one specific application domain, Transportation, and discuss the design and development of a large-scale transportation data platform and its application to address real world problems in Smart Cities.  I will then continue covering some of our fundamental research in this area, in particular: 1) traffic forecasting and 2) ride matching. 

 

USC - Viterbi School of Engineering - Viterbi Faculty DirectoryBio: Cyrus Shahabi is a Professor of Computer Science, Electrical Engineering and Spatial Sciences; Helen N. and Emmett H. Jones Professor of Engineering; the chair of the Computer Science Department; and the director of the Integrated Media Systems Center (IMSC) at USC’s Viterbi School of Engineering.  He was the co-founder of two USC spin-offs, Geosemble Technologies and Tallygo, which were acquired in July 2012 and March 2019, respectively. He received his B.S. in Computer Engineering from Sharif University of Technology in 1989 and then his M.S. and Ph.D. Degrees in Computer Science from the University of Southern California in May 1993 and August 1996, respectively. He has authored two books and more than three hundred research papers in databases, GIS, and multimedia, with more than 12 US Patents.

Dr. Shahabi has received funding from several agencies such as NSF, NIJ, NASA, NIH, DARPA, AFRL, NGA and DHS, as well as from several industries such as Chevron, Google, HP, Intel, Microsoft, NCR, NGC and Oracle. He was an Associate Editor of IEEE Transactions on Parallel and Distributed Systems (TPDS) from 2004 to 2009, IEEE Transactions on Knowledge and Data Engineering (TKDE) from 2010-2013, and VLDB Journal from 2009-2015. He is currently the chair of ACM SIGSPATIAL for the 2017-2020 term and also on the editorial board of the ACM Transactions on Spatial Algorithms and Systems (TSAS) and ACM Computers in Entertainment. He is the founding chair of IEEE NetDB workshop and also the general co-chair of SSTD’15, ACM GIS 2007, 2008 and 2009. He chaired the founding nomination committee of ACM SIGSPATIAL for its first term (2011-2014 term). He has been PC co-chair of several conferences such as APWeb+WAIM’2017, BigComp’2016, MDM’2016, DASFAA 2015, IEEE MDM 2013 and IEEE BigData 2013, and regularly serves on the program committee of major conferences such as VLDB, SIGMOD, IEEE ICDE, ACM SIGKDD, IEEE ICDM, and ACM Multimedia.

Dr. Shahabi is a fellow of IEEE, and was a recipient of the ACM Distinguished Scientist award in 2009, the 2003 U.S. Presidential Early Career Awards for Scientists and Engineers (PECASE), the NSF CAREER award in 2002, and the 2001 Okawa Foundation Research Grant for Information and Telecommunications. He was also a recipient of the US Vietnam Education Foundation (VEF) faculty fellowship award in 2011 and 2012, an organizer of the 2011 National Academy of Engineering “Japan-America Frontiers of Engineering” program, an invited speaker in the 2010 National Research Council (of the National Academies) Committee on New Research Directions for the National Geospatial-Intelligence Agency, and a participant in the 2005 National Academy of Engineering “Frontiers of Engineering” program.

 

 

——————————————————————————————————————————————————————————————————

 

February 27, 2019:

Spyros Blanas, Professor, Department of Computer Science and Engineering, Ohio State University

Talk Title: Scaling Database Systems to High-performance Computers 

Abstract:

We are witnessing the increasing use of warehouse-scale computers to analyze massive datasets quickly. This poses two challenges for database systems. The first challenge is interoperability with established analytics libraries and tools. Massive datasets often consist of images (arrays) in file formats like FITS and HDF5. We will first present ArrayBridge, an open-source I/O library that allows SciDB, TensorFlow and HDF5-based programs to co-exist in a pipeline without converting between file formats. The second challenge is scalability, as warehouse-scale computers expose communication bottlenecks in foundational data processing operations. We will present GRASP, a parallel aggregation algorithm for high-cardinality aggregation that avoids unscalable all-to-all communication and leverages similarity to complete the aggregation faster than repartitioning. Finally, we will present an RDMA-aware data shuffling algorithm that transmits data up to 4X faster than MPI. We conclude by highlighting additional challenges that need to be overcome to scale database systems to massive computers.

 

 

spyros blanas from web.cse.ohio-state.eduBio: Spiros Blanas is an associate professor in the Department of Computer Science and Engineering at The Ohio State University. His research interest is high-performance database systems. He is particularly interested in understanding and optimizing the interaction between the database kernel and the underlying hardware. His current research goal is to build a data management system for high-end computing facilities. He has received a Google Faculty Research Award and an IEEE TCDE Rising Star Award. Before joining Ohio State, He received his Ph.D. at the University of Wisconsin–Madison, where he was a member of the Database Systems group and the Microsoft Jim Gray Systems Lab. Part of his dissertation was commercialized in Microsoft's flagship data management product, SQL Server 2014, as the Hekaton in-memory transaction processing engine. He holds a five-year diploma in Computer Engineering from the Technical University of Crete, in Greece.

 

——————————————————————————————————————————————————————————————————

 

December 3, 2018: 

Semih Salihoglu, Professor, Department of Computer Science, University of Waterloo

Talk Title: How can (worst-case optimal) joins be so interesting?

 

Abstract:

Worst-case optimality is perhaps the weakest notion of optimality for algorithms. A recent surprising theoretical development in databases has been the realization that the traditional join algorithms, which are based on binary joins, are not even worst-case optimal. Upon this realization, several surprisingly simple join algorithms have been developed that are provably worst-case optimal. Unlike traditional algorithms, which join subsets of tables at a time, worst-case join algorithms perform the join one attribute (or column) at a time. This talk gives an overview of several lines of work that my colleagues and I have been doing on worst-case join algorithms focusing on their application to subgraph queries. I will cover work from both distributed and serial settings. In the distributed 

setting, worst-case optimality is a yard-stick for two costs of an algorithm: (i) the load, i.e., amount of data per machine; and (ii) the total communication. Both load and communication complexity are at a trade-off with number of rounds an algorithm runs. I will describe how to achieve worst-case optimality in total communication and the performance of this algorithm on subgraph queries. It is an open theoretical problem to design constant-round algorithms with worst-case optimal load. In the serial setting, I will describe the optimizer of a prototype graph database called Graphflow that we are building at University of Waterloo. Graphflow's optimizer for subgraph queries mixes worst-case optimal join-style column-at-a-time processing seamlessly with traditional binary joins.

 

Graph WranglerBio: Semih Salihoglu is an Assistant Professor at University of Waterloo. His research focuses on graph databases, distributed systems for processing graphs, and algorithms and theories for distributed evaluation of database queries. He holds a PhD from Stanford University and is a recipient of the 2018 VLDB best paper award.

 

 

 

 

——————————————————————————————————————————————————————————————————