The Database Systems Seminar
Wednesday, 10/2/2024 (DB Systems Seminar)
10:30am, LWSN 3102AB (In-person
only)
Da Yan, Associate Professor, Indiana
University, Bloomington
Talk Title: T-thinker: A
Task-Based Parallel Computing Model for Compute-Intensive Graph Analytics and
Beyond
Abstract
Pioneered
by Google's Pregel, the think-like-a-vertex (TLAV) computing model has
dominated the area of parallel and distributed graph processing. However, TLAV
models are only scalable for data-intensive iterative graph algorithms such as
random walks and graph traversal. Unfortunately, researchers were using TLAV
models to solve compute-intensive graph problems, leading to performance not
much beyond that of a serial algorithm due to the IO bottleneck incurred by
unnecessarily materializing a lot of intermediate data. This talk advocates a
new parallel computing model called T-thinker, which adopts the
think-like-a-task (TLAT) computing paradigm to divide the computing workloads
of compute-intensive problems while allowing backtracking search to avoid data
materialization as much as possible. We will explain how the T-thinker model
can achieve ideal speedup ratio for many compute-intensive problems such as
mining dense subgraphs, frequent subgraph pattern mining, and subgraph
matching/enumeration. A number of TLAT-based systems will be covered including
G-thinker, G-thinkerQ, T-FSM, PrefixFPM,
G2-AIMD and T-DFS, which tackles compute-intensive graph problems in various
settings such as on a shared-memory multi-core machine, on a distributed
cluster, and on multiple GPUs. We will also explain how the T-thinker model
applies beyond the graph domain to problems such as training big models
consisting of many decision trees, and massively parallel spatial data
processing.
Bio:
Da Yan is
an Associate Professor in the Department of Computer Sciences of the Luddy School of Informatics, Computing, and Engineering
(SICE) at Indiana University Bloomington. He received his Ph.D. degree in
Computer Science from the Hong Kong University of Science and Technology in
2014, and he received my B.S. degree in Computer Science from Fudan University
in Shanghai in 2009. He is a DOE Early Career Research Program (ECRP) awardee
in 2023, and the sole winner of the Hong Kong 2015 Young Scientist Award in
Physical/Mathematical Science. His research interests include parallel and
distributed systems for big data analytics, data mining, and machine learning
(esp. deep learning). He frequently publishes in top DB and AI conferences such
as SIGMOD, VLDB, ICDE, KDD, ICML, ICLR, AAAI, IJCAI, EMNLP, and in top journals
such as ACM TODS, VLDB Journal, IEEE TKDE, IEEE TPDS, ACM Computing Surveys. He
also serves extensively in the major top DB and AI conferences and journals as
reviewers, co-organized events such as the BIOKDD workshop with SIGKDD, Dagstuhl seminars, and a few top conferences, and he served
as guest editors of journals such as IEEE/ACM TCBB, BMC Bioinformatics, and
IEEE CG&A.
——————————————————————————————————————————————————————————————————
Wednesday, 10/4/2023 (DB Systems Seminar)
10:30am, LWSN 3102AB (In-person
only)
Faisal Nawab, Assistant Professor,
University of California, Irvine
Talk Title: Enabling Emerging
Edge and IoT Applications with Edge-Cloud Data Management
Abstract
The potential of Edge and IoT
applications encompasses realms like smart cities, mobility solutions, and
immersive technologies. Yet, the actualization of these promising applications
stumbles upon a fundamental impediment: the prevailing cloud data management
technologies are often tethered to remote data centers. This architectural
choice introduces daunting challenges, including substantial wide-area latency,
burdensome connectivity and communication bandwidth demands, and regulatory
constraints related to personal and sensitive data.
This talk presents our research in
introducing edge-cloud data management that provides a framework for managing
data across edge nodes to overcome the limits of cloud-only data management. We
encounter various challenges to achieving this vision such as managing the
sheer amount of edge nodes, their sporadic availability, and device constraints
in terms of compute, storage, and trust. To navigate these multifaceted
challenges, our work redesigns distributed data management technologies to
adapt to the edge environment. This includes introducing design concepts in the
domains of hierarchical and asymmetric edge-cloud data management,
decentralized edge coordination techniques, and edge-friendly mechanisms to
maintain security and trust. The talk includes a demonstration of 'AnyLog'–an edge-cloud data management solution that
integrates our research findings.
Bio: Faisal
Nawab is an assistant professor in the computer science department at the
University of California, Irvine. He is the director of EdgeLab,
which is dedicated to building edge-cloud data management solutions for
emerging edge and IoT applications. Faisal's research is influenced by
practical industry problems through his involvement with the startup 'AnyLog' where he acts as the lead architect of designing an
edge-cloud database. Faisal has received recognition for his work, winning the
"Next-Generation Data Infrastructure" award from Facebook, being
named the runner-up for the IEEE TEMS Blockchain Early-Career Award, and being
awarded several NSF grants, and industry funding from Meta and Roblox.
——————————————————————————————————————————————————————————————————
April
18, 2022 (DB
Systems Seminar)
Stratos Idreos, Harvard University
Talk
Title: The Data Systems Grammar
Abstract
Data
structures are everywhere. They define the behavior of modern data systems and
data-driven algorithms. For example, with data systems that utilize the correct
data structure design for the problem at hand, we can reduce the monthly bill
of large-scale data systems applications on the cloud by hundreds of thousands
of dollars. We can accelerate data science tasks by being able to dramatically
speed up the computation of statistics over large amounts of data. We can train
drastically more neural networks within a given time budget, improving
accuracy.
However,
knowing the right data structure and data system design for any given scenario
is a notoriously hard problem; there is a massive space of possible designs
while there is no single design that is perfect across all data, queries, and
hardware scenarios. We will discuss our quest for the first principles of data
structures and data system design. We will show signs that it is possible to
reason about this massive design space, and we will show early results from a
prototype self-designing data system which can take drastically different
shapes to optimize for the workload, hardware, and available cloud budget using
machine learning and what we call machine knowing. These shapes include data
structure and system designs which are discovered automatically and do not
exist in the literature or industry.
Bio:
Stratos Idreos is an associate professor of Computer Science at
Harvard University where he leads the Data Systems Laboratory. His research
focuses on making it easy and even automatic to design workload and hardware
conscious data structures and data systems with applications on relational,
NoSQL, and data science problems. For his PhD thesis on adaptive indexing,
Stratos was awarded the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award
and the 2011 ERCIM Cor Baayen award from the European
Research Council on Informatics and Mathematics. In 2015 he was awarded the
IEEE TCDE Rising Star Award from the IEEE Technical Committee on Data
Engineering for his work on adaptive data systems and in 2020 he received the
ACM SIGMOD Contributions award for his work on reproducible research. He is
also a recipient of the National Science Foundation Career award and the
Department of Energy Early Career award. Stratos was PC Chair of ACM SIGMOD
2021 and IEEE ICDE 2022, and he is the founding editor of the ACM/IMS Journal
of Data Science and the chair of the ACM SoCC
Steering Committee.
——————————————————————————————————————————————————————————————————
February
14, 2022 (DB Systems Seminar)
Mahmoud
Sakr, Free University of Brussels, Belgium
Talk
Title: How to build a Moving Object Database System
Abstract:
The increased
availability of geospatial trajectory data, fueled by the advances in embedded
sensors, has opened opportunities and ambitions to build new applications that
make use of location intelligence in domains like maritime, social mobility,
urban, and logistics. MobilityDB, an open source SQL moving object database, provides data
management support for such kind of applications. It uses the extensibility
features of PostgreSQL to implement an abstract data type model of moving
objects. It defines the TEMPORAL type constructor, that extends the base types
of PostgreSQL and the geometry types of PostGIS
respectively into temporal and spatiotemporal types for representing temporal
integers, temporal Booleans, temporal geometries, etc. On top of these types
there is a rich API of data management functions in SQL. MobilityDB
is engineered as an extension to PostgreSQL, which can be dynamically loaded,
without a need to restart the server. In contrast to a fork, an extension is
fully compatible with PostgreSQL and PostGIS and
their ecosystem, which enables users to seamlessly use MobilityDB
in deployment environments. In this talk, I'll describe the architecture of MobilityDB, then dive into the R&D challenges of
selected functions and open problems.
Bio: Dr. Mahmoud Sakr is assistant professor at the Brussels school of Engineering
in the Free University of Brussels (English for Université Libre de Bruxelles ULB). His main research scope is mobility data
science. He is a main contributor and a co-founder of the MobilityDB
MOD, an OSGeo community project that extends
PostgreSQL and PostGIS with temporal and
spatiotemporal data types. It provides data management of geospatial
trajectories in SQL. He is also a main contributor and co-chair of the Moving
Feature Standards Working Group MF-SWG of the Open Geospatial Consortium OGC.
He contributes to the making standards on moving object data representation,
exchange and analysis. He contributed as initial convener in amending the
MF-SWG charter to cover these topics. He participates in the Erasmus Mundus
Joint Master programme on Big Data Management and
Analytics BDMA, and in the Marie Sklodowska-Curie
ITN-European Joint Doctorate programme on Data
Engineering for Data Science DEDS. He holds a PhD (dr.rer.nat)
from the FernUniversität in Hagen, Germany. Beside
the academic activities, he is actively participating and giving talks in
open-source community conferences including FOSS4G, PGConf,
and FOSDEM. For more information: https://cs.ulb.ac.be/members/mahmoud/
——————————————————————————————————————————————————————————————————
November 5, 2019 (Samuel Conte Distinguished Lecture Series and DB Systems Seminar):
Michael
Stonebraker, Adjunct Professor of Computer Science,
MIT
Talk
Title: We Are Often Working on the Wrong Problem (10
misconceptions about what is important)
Abstract:
In
the DBMS/Data Systems area, many of us seem to have lost our way.
This
talk discusses 10 different problem areas in
which there is considerable current research.
Then,
I present why I believe much of the work is misguided, either because
our assumptions about these problems are incorrect or because we are not paying
attention to real users. Topics
considered include machine learning (deep and conventional), public blockchain,
data warehouses, schema evolution
and the cloud.
Bio: Dr. Stonebraker has been a pioneer of data base research and technology for more than forty years. He was the main architect of the INGRES relational DBMS, and the object-relational DBMS, POSTGRES. These prototypes were developed at the University of California at Berkeley where Stonebraker was a Professor of Computer Science for twenty five years. More recently at M.I.T. he was a co-architect of the Aurora/Borealis stream processing engine, the C-Store column-oriented DBMS, the H-Store transaction processing engine, the SciDB array DBMS, and the Data Tamer data curation system. Presently he serves as Chief Technology Officer of Paradigm4 and Tamr, Inc.
Professor
Stonebraker was awarded the ACM System Software Award
in 1992 for his work on INGRES. Additionally, he was awarded the first
annual SIGMOD Innovation award in 1994, and was elected to the National Academy
of Engineering in 1997. He was awarded the IEEE John Von Neumann award in
2005 and the 2014 Turing Award, and is presently an Adjunct Professor of
Computer Science at M.I.T.
——————————————————————————————————————————————————————————————————
October
29, 2019 (Samuel Conte Distinguished Lecture Series and DB Systems Seminar):
Raghu
Ramakrishnan, CTO for Data and Technical Fellow at Microsoft
Talk
Title: Data in the Cloud
Abstract:
The
cloud has forced a rethinking of database architectures. Does this offer
an opportunity to address the siloed nature of data management systems?
The question is especially important given the rise of machine learning and
data governance. In this talk, I'll discuss these issues through the lens of
the Microsoft data journey, both internal and external.
Bio: Raghu Ramakrishnan is
CTO for Data, and a Technical Fellow at Microsoft since 2012. From 1987 to
2006, he was a professor at University of Wisconsin-Madison, where he wrote the
widely used text “Database Management Systems”. In 1999, he founded QUIQ,
a company powering crowd-sourced question-answering as a cloud service. He
joined Yahoo! in 2006 as a Yahoo! Fellow and served as Chief Scientist for the
portal, cloud, and search divisions. Ramakrishnan has received several awards,
including the ACM SIGMOD Edgar F. Codd Innovations Award, the ACM SIGKDD
Innovations Award, the ACM SIGMOD Contributions Award, 10-year Test-of-Time
Awards from the ACM SIGMOD, ACM SOCC, ICDT and VLDB conferences, the IIT Madras
Distinguished Alumnus Award, the NSF Presidential Young Investigator Award, and
the Packard Fellowship in Science and Engineering. He is a Fellow of the ACM
and IEEE and has served as Chair of ACM SIGMOD.
——————————————————————————————————————————————————————————————————
October
25, 2019:
Cyrus
Shahabi, Professor and Chair, Department of Computer
Science, Univ. of Southern California
Talk
Title: Transportation Data, Applications &
Research for Smart Cities
Abstract:
In
this talk, I first introduce the Integrated Media Systems Center (IMSC), a data
science research center at USC that
focuses in data-driven solutions for real-world applications. IMSC is motivated
by the need to address fundamental
Data Science problems related to applications with major societal impact.
Towards this end, I delve into one
specific application domain, Transportation, and discuss the design and
development of a large-scale transportation
data platform and its application to address real world problems in Smart
Cities. I will then continue covering
some of our fundamental research in this area, in particular: 1) traffic
forecasting and 2) ride matching.
Bio: Cyrus
Shahabi is a Professor of Computer Science,
Electrical Engineering and Spatial Sciences; Helen N. and Emmett H. Jones
Professor of Engineering; the chair of the Computer Science Department; and the
director of the Integrated Media Systems Center (IMSC) at USC’s Viterbi School
of Engineering. He was the co-founder of two USC spin-offs, Geosemble Technologies and Tallygo,
which were acquired in July 2012 and March 2019, respectively. He received his
B.S. in Computer Engineering from Sharif University of Technology in 1989 and
then his M.S. and Ph.D. Degrees in Computer Science from the University of
Southern California in May 1993 and August 1996, respectively. He has authored
two books and more than three hundred research papers in databases, GIS, and
multimedia, with more than 12 US Patents.
Dr.
Shahabi has received funding from several agencies
such as NSF, NIJ, NASA, NIH, DARPA, AFRL, NGA and DHS, as well as from several
industries such as Chevron, Google, HP, Intel, Microsoft, NCR, NGC and Oracle.
He was an Associate Editor of IEEE Transactions on Parallel and Distributed
Systems (TPDS) from 2004 to 2009, IEEE Transactions on Knowledge and Data
Engineering (TKDE) from 2010-2013, and VLDB Journal from 2009-2015. He is
currently the chair of ACM SIGSPATIAL for the 2017-2020 term and also on the
editorial board of the ACM Transactions on Spatial Algorithms and Systems
(TSAS) and ACM Computers in Entertainment. He is the founding chair of IEEE NetDB workshop and also the general co-chair of SSTD’15,
ACM GIS 2007, 2008 and 2009. He chaired the founding nomination committee of
ACM SIGSPATIAL for its first term (2011-2014 term). He has been PC co-chair of
several conferences such as APWeb+WAIM’2017, BigComp’2016, MDM’2016, DASFAA
2015, IEEE MDM 2013 and IEEE BigData 2013, and
regularly serves on the program committee of major conferences such as VLDB,
SIGMOD, IEEE ICDE, ACM SIGKDD, IEEE ICDM, and ACM Multimedia.
Dr.
Shahabi is a fellow of IEEE, and was a recipient of
the ACM Distinguished Scientist award in 2009, the 2003 U.S. Presidential Early
Career Awards for Scientists and Engineers (PECASE), the NSF CAREER award in
2002, and the 2001 Okawa Foundation Research Grant for Information and
Telecommunications. He was also a recipient of the US Vietnam Education
Foundation (VEF) faculty fellowship award in 2011 and 2012, an organizer of the
2011 National Academy of Engineering “Japan-America Frontiers of Engineering”
program, an invited speaker in the 2010 National Research Council (of the
National Academies) Committee on New Research Directions for the National
Geospatial-Intelligence Agency, and a participant in the 2005 National Academy
of Engineering “Frontiers of Engineering” program.
——————————————————————————————————————————————————————————————————
February
27, 2019:
Spyros
Blanas, Professor, Department of Computer Science and
Engineering, Ohio State University
Talk
Title: Scaling Database Systems to High-performance
Computers
Abstract:
We
are witnessing the increasing use of warehouse-scale computers to analyze
massive datasets quickly. This poses two
challenges for database systems. The first challenge is interoperability with
established analytics libraries and tools.
Massive datasets often consist of images (arrays) in file formats like FITS and
HDF5. We will first present ArrayBridge, an
open-source I/O library that allows SciDB, TensorFlow
and HDF5-based programs to co-exist in a pipeline
without converting between file formats. The second challenge is scalability,
as warehouse-scale computers expose
communication bottlenecks in foundational data processing operations. We will
present GRASP, a parallel aggregation
algorithm for high-cardinality aggregation that avoids unscalable all-to-all
communication and leverages similarity
to complete the aggregation faster than repartitioning. Finally, we will
present an RDMA-aware data shuffling
algorithm that transmits data up to 4X faster than MPI. We conclude by
highlighting additional challenges that
need to be overcome to scale database systems to massive computers.
Bio: Spiros Blanas is an associate professor in the Department of Computer Science and Engineering at The Ohio State University. His research interest is high-performance database systems. He is particularly interested in understanding and optimizing the interaction between the database kernel and the underlying hardware. His current research goal is to build a data management system for high-end computing facilities. He has received a Google Faculty Research Award and an IEEE TCDE Rising Star Award. Before joining Ohio State, He received his Ph.D. at the University of Wisconsin–Madison, where he was a member of the Database Systems group and the Microsoft Jim Gray Systems Lab. Part of his dissertation was commercialized in Microsoft's flagship data management product, SQL Server 2014, as the Hekaton in-memory transaction processing engine. He holds a five-year diploma in Computer Engineering from the Technical University of Crete, in Greece.
——————————————————————————————————————————————————————————————————
December
3, 2018:
Semih
Salihoglu, Professor, Department of Computer Science,
University of Waterloo
Talk
Title: How can (worst-case optimal) joins be so
interesting?
Abstract:
Worst-case
optimality is perhaps the weakest notion of optimality for algorithms. A recent
surprising theoretical development
in databases has been the realization that the traditional join algorithms,
which are based on binary joins,
are not even worst-case optimal. Upon this realization, several surprisingly
simple join algorithms have been developed
that are provably worst-case optimal. Unlike traditional algorithms, which join
subsets of tables at a time, worst-case
join algorithms perform the join one attribute (or column) at a time. This talk
gives an overview of several
lines of work that my colleagues and I have been doing on worst-case join
algorithms focusing on their application
to subgraph queries. I will cover work from both distributed and serial
settings. In the distributed
setting,
worst-case optimality is a yard-stick for two costs of an algorithm: (i) the load, i.e., amount of data per machine;
and (ii) the total communication. Both load and communication complexity are at
a trade-off with number of
rounds an algorithm runs. I will describe how to achieve worst-case optimality
in total communication and the performance
of this algorithm on subgraph queries. It is an open theoretical problem to
design constant-round algorithms
with worst-case optimal load. In the serial setting, I will describe the
optimizer of a prototype graph database
called Graphflow that we are building at University
of Waterloo. Graphflow's optimizer for subgraph
queries mixes worst-case
optimal join-style column-at-a-time processing seamlessly with traditional
binary joins.
Bio: Semih Salihoglu is an
Assistant Professor at University of Waterloo. His research focuses on graph
databases, distributed systems for processing graphs, and algorithms and
theories for distributed evaluation of database queries. He holds a PhD from
Stanford University and is a recipient of the 2018 VLDB best paper award.
——————————————————————————————————————————————————————————————————