The Database Systems Seminar
November 5, 2019 (Samuel Conte Distinguished Lecture Series):
Michael Stonebraker, Adjunct Professor of Computer Science, MIT
We Are Often Working on the Wrong Problem (10 misconceptions about what is important)
In the DBMS/Data Systems area, many of us seem to have lost our way. This talk discusses 10 different problem areas
in which there is considerable current research. Then, I present why I believe much of the work is misguided, either
because our assumptions about these problems are incorrect or because we are not paying attention to real users.
Topics considered include machine learning (deep and conventional) , public blockchain, data warehouses, schema
evolution and the cloud.
October 29, 2019 (Samuel Conte Distinguished Lecture Series):
Raghu Ramakrishnan, CTO for Data and Technical Fellow at Microsoft
Data in the Cloud
The cloud has forced a rethinking of database architectures. Does this offer an opportunity to address the siloed nature of data management systems? The question is especially important given the rise of machine learning and data governance. In this talk, I'll discuss these issues through the lens of the Microsoft data journey, both internal and external.
October 25, 2019:
Cyrus Shahabi, Professor and Chair, Department of Computer Science, Univ. of Southern California
Transportation Data, Applications & Research for Smart Cities
In this talk, I first introduce the Integrated Media Systems Center (IMSC), a data science research center at USC
that focuses in data-driven solutions for real-world applications. IMSC is motivated by the need to address
fundamental Data Science problems related to applications with major societal impact. Towards this end, I delve into
one specific application domain, Transportation, and discuss the design and development of a large-scale
transportation data platform and its application to address real world problems in Smart Cities. I will then continue
covering some of our fundamental research in this area, in particular: 1) traffic forecasting and 2) ride matching.
February 27, 2019:
Spyros Blanas, Professor, Department of Computer Science and Engineering, Ohio State University
Scaling Database Systems to High-performance Computers
We are witnessing the increasing use of warehouse-scale computers to analyze massive datasets quickly. This poses
two challenges for database systems. The first challenge is interoperability with established analytics libraries and
tools. Massive datasets often consist of images (arrays) in file formats like FITS and HDF5. We will first present
ArrayBridge, an open-source I/O library that allows SciDB, TensorFlow and HDF5-based programs to co-exist in a
pipeline without converting between file formats. The second challenge is scalability, as warehouse-scale computers
expose communication bottlenecks in foundational data processing operations. We will present GRASP, a parallel
aggregation algorithm for high-cardinality aggregation that avoids unscalable all-to-all communication and leverages
similarity to complete the aggregation faster than repartitioning. Finally, we will present an RDMA-aware data
shuffling algorithm that transmits data up to 4X faster than MPI. We conclude by highlighting additional challenges
that need to be overcome to scale database systems to massive computers.
December 3, 2018:
Semih Salihoglu, Professor, Department of Computer Science, University of Waterloo
How can (worst-case optimal) joins be so interesting?
Worst-case optimality is perhaps the weakest notion of optimality for algorithms. A recent surprising theoretical
development in databases has been the realization that the traditional join algorithms, which are based on binary
joins, are not even worst-case optimal. Upon this realization, several surprisingly simple join algorithms have been
developed that are provably worst-case optimal. Unlike traditional algorithms, which join subsets of tables at a time,
worst-case join algorithms perform the join one attribute (or column) at a time. This talk gives an overview of
several lines of work that my colleagues and I have been doing on worst-case join algorithms focusing on their
application to subgraph queries. I will cover work from both distributed and serial settings. In the distributed
setting, worst-case optimality is a yard-stick for two costs of an algorithm: (i) the load, i.e., amount of data per
machine; and (ii) the total communication. Both load and communication complexity are at a trade-off with number
of rounds an algorithm runs. I will describe how to achieve worst-case optimality in total communication and the
performance of this algorithm on subgraph queries. It is an open theoretical problem to design constant-round
algorithms with worst-case optimal load. In the serial setting, I will describe the optimizer of a prototype graph
database called Graphflow that we are building at University of Waterloo. Graphflow's optimizer for subgraph queries
mixes worst-case optimal join-style column-at-a-time processing seamlessly with traditional binary joins.