Below are some of the research projects actively being
pursued in my group. Besides these, there are a number of other
currently less developed ideas my students and I are seriously exploring, broadly in the
areas of programming language design, program analysis, type systems,
protocol and specification inference, testing, and verification.
Specification, Verification, and Implementation of Weak Consistency and Isolation.
Geo-distributed web applications often favor high availability over strong consistency. In response to this bias, modern-day replicated data stores often eschew sequential consistency in favor of weaker eventual consistency data semantics. While most operations supported by a typical web application can be engineered, with sufficient care, to function under EC, there are oftentimes critical operations that require stronger consistency guarantees. Unfortunately, existing approaches to tunable consistency suffers from poorly-defined implementation-specific semantics, often expressed at a level of abstraction far removed from the application. Programming modern-day distributed systems is thus a complex error-prone endeavor.
We have embarked on a number of initiatives in this space that explore new logical formalisms, language design abstractions, and program verification techniques
to help specify, define, and reason about applications deployed in these environments. For example, MRDTs are a new programming abstraction that allows the synthesis of any OCaml datatype into a distributed object through the use of three-way merge functions supplied with the type. Q9 is a programming and verification environment for reasoning about distributed OCaml applications that adopts symbolic execution and bounded model checking methods to validate the correctness of geo-replicated OCaml applications with respect to user-specified safety properties. Quelea uses axiomatically-defined contracts to realize a declarative programming model for eventually consistent data stores that abstracts the actual implementation of the data store via high-level programming and system-level models that are agnostic to a specific implementation. By doing so, Quelea frees application programmers from having to reason about their application in terms of low-level implementation specific data store semantics. Instead, programmers can now reason in terms of an abstract model of the data store, and develop applications by defining and composing high-level replicated data types. Acidifier is a rely-guarantee proof system that verifies the correctness of database applications under weak isolation semantics, a semantics that admits behaviors weaker than serializability. CLOTHO is an automated white-box testing framework for weakly-consistent distributed database applications. Atropos is a refactoring tool that automatically transforms database schema to a more efficient representation suitable for weakly-consistent storage systems. We have also explored parameterized verification of classical distributed data structures such as CRDTs using novel proof techniques that are amenable for verification using automated theorem provers, and have also considered new specification and verification methods that allow transplanting algorithms and implementations developed for shared-memory concurrency to a gepo-replicated distributed environment.
Data-Driven Specification Inference.
In this line of work, we examine how data, in the form of inputs, traces,
observations, and samples, can be used to capture salient aspects of
a system. The systems we have considered include sophisticated functional
programs that make use of complex heap-allocated data structures, loop
intensive C programs, and machine learning pipelines. In all of these
instances, we have demonstrated how data-driven techniques can be used
to discover deep specifications that are nonetheless amenable to
In the software space, our motivation centers around specification
inference. Having precise specifications that describe a program's intended
behavior is a critical pre-requisite to ensure reliabiity,
extensibiity, and maintainability of modern-day software. Generating
these specifications manually is a challenging, often unsuccessful,
exercise; unfortunately, existing static and dynamic analysis
techniques often produce poor quality specifications that are
ineffective in aiding program verification tasks to guarantee a
program behaves as expected.
We have pursued a recent line of work on automated synthesis of
specifications that overcome many of the deficiencies that plague
existing specification inference methods. Our main contribution is a
formulation of the problem as a sample driven one, in which
specifications, represented as terms in a decidable refinement type
representation, are discovered from observing a program's sample runs
in terms of either program execution paths or input-output values, and
automatically verified through the use of expressive refinement type
Our approach is realized as a series of inductive synthesis
frameworks, which use various logic-based or classification-based
learning algorithms to provide sound and precise machine-checked
specifications. Experimental results indicate that the learning
algorithms are both efficient and effective, capable of automatically
producing sophisticated specifications in nontrivial hypothesis
domains over a range of complex real-world programs, going well beyond
the capabilities of existing solutions.
In related work, we have generalized these approaches to automatic
invariant discovery that solves verification conditions described
within a CHC system. Using a combination of learning algorithms (support
vector machines, decision tree learning, etc.), our solver is able to
operate without having to constrain the search space from which an
unknown predicate is drawn, or the shape and number of coefficients or
variables in a presumed invariant. Integrating the solver within a
CEGAR-style verification loop leads to an implementation that outperforms
the state-of-the-art on challenging benchmark suites.
We have also applied similiar decision-tree learning techniques to solve
multi-abduction specification inference tasks. Here, we assume client
programs interacting with black-box library methods. Our goal is to verify
the client with respect to a safety property. We apply data-driven learning
to postulate meaningful specifications that are both consistent with observed
tests, but not overfitted to them.
Trustworthy Machine Learning Systems.
In the machine learning space, we have considered verification of
reinforcement learning (RL) systems where the safety properties of
interest are embedded in a specfication, typically given as a set of
differential equations or expressed using some form of formal logic such
as Linear Temporal Logic (LTL), and in which the neural network implementing
the policy is viewed as a black-box. We have considered verification
approaches that use the network's input-output behavior
to help construct a safe (deterministic) synthesized program. By
framing the problem as an instance of inductive synthesis, we have
demonstrated the ability to automatically generate provably sound
non-intrusive safety shields that guarantee that the networks trained
against an RL policy will never enter an unsafe state. We have also
considered fully automated correct-by-construction training methods
that integrate abstraction and refinement procedures directly within
the training procedure.
In recent work, We have extended these ideas to consider issues of
robustness, adversarial learning, and applied our techniques to
complex problems in multi-agent environments, robotics path-planning,
and cyber-physical systems, in both model-based and model-free
settings. Our techniques marry techniques from control theory (e.g.,
Lyupanov methods), program verification, neuro-symbolic reasoning, and
multi-agent composition techniques.
Program Synthesis, Analyses and Type Systems.
In recent work, we have explored new approaches to component-based
deductive synthesis techniques that uses Hoare-style pre- and post-conditions
affixed to library methods to guide a specification-aware synthesis
search procedure, We've also shown to equip a conflict-driven learning
procedure with information that characterizes the nature of these specifications
to enable greater scalability and efficiency. Our tool, called Cobalt, has
been shown to be highly effective across a number of different domains both
with respect to expressivity and scalability.
We have developed a compositional and lightweight
dependent type inference system for ML capable of inferring useful
safety properties of ML programs that go beyond the kinds of
properties expressible using classical Hindley-Milner polymorphic
type inference. Our technique infers refinements, requires no
programmer annotations or guidance, is context-sensitive, and
is not based on a whole-program analysis.
We are also looking at a variety of different path-sensitive
static analysis techniques to enable effective automatic mining of
specifications, and to facilitate more effective program
optimizations. Our work on specification inference is in the
context of C and Java, while our current focus on program optimizations
is centered around developing scalable path sensitive analyses for the
SSA intermediate representation in MLton.
Verified Compilation and Reasoning for Concurrency.
With colleagues at Cambridge, INRIA, and MPI-SWS, we have built a
compiler for ClightTSO, a C-like language for expressing shared-memory
concurrency above x86 processors. Our approach defines a TSO (total store
order) relaxed-memory semantics for ClightTSO that exposes the processor's
memory model for expressing high-performance (potentially racy) code. We
have implemented a verifying compiler for the language as an extension to
Compcert. We have recently extended this work to consider an alternative
definition of the Java Memory Model that is more closely aligned with the
semantics of hardware memory models such as TSO and Power.
These efforts have also considered the verification of modern runtime services like
garbage collectors found in managed high-level languages like Java.
Along with colleagues at Cambridge, and Microsoft Research, we have also
explored formal specification and verification techniques for deterministic
parallelism. We define a high-level specification that expresses the
necessary conditions for correct execution, and examine ways to tie this
high-level specification to low-level implementations. We are particularly
interested in modular reasoning technique that allow us to reason about
program and library correctness without breaking abstraction boundaries. We
have recently applied techniques like frame inference and abduction to
build program analyses that automatically inject synchronization barriers to
enforce determinism in data parallel applications.
Abstractions, Compilation, and Runtime Support for
High-Level Concurrent Programming.
We are working on a number of topics related to the design and
implementation of scalable concurent and parallel functional
programs. Our research centers around extensions to
message-passing dialects of ML (such as CML), software transactions and
transactional events, and associated program analyses. We
are also working on runtime infrastructure to build lightweight and
parallel and concurrent garbage collectors for ML-like languages.
Much of our implementation effort centers around Multi-MLton, a multicore
aware extension of the MLton whole-program optimizing compiler for Standard ML.
In addition, we have an active effort exploring compilation techniques
and runtime support for safe futures, a lightweight annotation for
expressing deterministic parallelism; we 've studied the behavior and
analysis of safe futures for both ML as well as Java.
for Debugging and Testing Concurrent Programs.
Debugging and testing shared memory concurrent
programs is difficult because of complex non-deterministic interactions
introduced by scheduler-induced thread interleavings. We have
explored a number of different dynamic analyses to enable efficent
debugging, replay, and regression testing of large concurrent C and Java
programs. Our techniques involve analyzing core dumps and
execution traces, aligning correct and faulty executions to identify
salient sources of divergence that can be used to identify potential
Programming for Graph-Structured Applications.
We investigate linguistic extensions to map/reduce
abstractions for programming large-scale distributed systems, with
special focus on applications that manipulate large, unstructured
graphs. We target real-world graph analysis tasks found in comparative
analysis of biological networks as an important case study.
We address the following specific questions: (i) how can highly
unstructured graph-based formalisms be cast in the map/reduce
framework? (ii) how effectively can these specifications leverage
existing map/reduce infrastructure? (iii) how can these abstractions
and their execution environments be enhanced to provide the semantic
expressiveness necessary for programmability and scalable performance?
(iv) how can these analysis tasks be integrated into comprehensive
scientific resources usable by the wider applications community?
Answers to these questions entail exploration of linguistic extensions
to existing map/reduce abstractions, defining new implementations on
wide-area multicore/SMP platforms, and crafting an expressive graph
analysis toolkit suitable for realistic deployment in important domains
such as systems biology.