Zhang wins ACM SIGOPS Thesis Award
Assistant Professor Yongle Zhang
Congratulations to Assistant Professor Yongle Zhang at Purdue University, Department of Computer Science for winning the Dennis M. Ritchie Doctoral Dissertation Award for 2022.
The title of Zhang’s thesis is Automatic Failure Diagnosis for Distributed Systems and he was advised by Associate Professor Ding Yuan at the University of Toronto, Department of Electrical and Computer Engineering
The Dennis M. Ritchie Doctoral Dissertation Award was created in 2013 by ACM SIGOPS to recognize research in software systems and to encourage the creativity that Dennis Ritchie embodied, providing a reminder of Ritchie’s legacy and what a difference one person can make in the field of software systems research. The Award committee for 2022 consists of Irene Zhang (chair), James Bornholt, Haibo Chen, and Alain Tchana.
Automatic Failure Diagnosis for Distributed Systems
Author: Yongle Zhang
Advisor: Ding Yuan
University of Toronto, Department of Electrical and Computer Engineering
Distributed software systems have become the backbone of Internet services. Failures in production distributed systems have severe consequences. A 63-minute outage of Amazon in 2018 caused a 100-million loss in revenue. Therefore, diagnosing such failures in distributed systems is particularly critical because it can reduce the service downtime and associated cost. However, failure diagnosis at data center scale is notoriously difficult because these systems are complex: there are numerous threads, processes, and nodes communicating concurrently. Despite decades of efforts dedicated to automated failure diagnosis, existing diagnosis techniques are either intrusive and incur non-negligible performance overhead in a production environment, or face scalability challenges when applied to complex software systems. This dissertation aims to automate human diagnosis procedure for distributed system failures. It makes two main contributions towards improving automated failure diagnosis techniques. The first contribution of this dissertation is a technique that can automatically locate the root cause in a failed distributed system execution. Identifying the root cause in a failed execution of a distributed system with billions of executed instructions is like finding a needle in a haystack. This dissertation designs and evaluates a tool, called Kairux, capable of pinpointing the root cause of a failure in a distributed system, in a fully automated way. The second contribution is a technique that can automatically reproduce failure from production distributed systems. Given a failure report, the first step of developers’ diagnosis is typically to reproduce the failure. To automate this step, this dissertation designs and evaluates a technique, called Pensieve, that mimics developers’ analysis of a chain of causally dependent events that lead to the failure using log analysis and program analysis. This dissertation provides the implementation of a practical tool capable of reconstructing near-minimal failure reproduction steps from log files and system bytecode, without human involvement. By evaluating on some of the most complex, real-world failures from widely-deployed dis-tributed systems such as HBase, HDFS, and ZooKeeper, this dissertation shows that Pensieve is capable of formulating a minimal set of operations necessary to reproduce the failure, and Kairux can further pinpoint each failure’s respective root cause.