Fault-Tolerant Computing in large-Scale Distributed Systems

Principal Investigator: Kihong Park

Achieving various forms of fault-tolerance in large-scale distributed systems---a prime example being the Internet---has become an important problem. Using a relaxed definition of faults, which includes temporary congestion of servers and other transient adverse conditions, we seek to design distributed control algorithms which are resilient when subject to certain patterns of worst-case faults called k-sparse errors. Sparsity has a recursive definition which roughly says that bad events are well-separated in space-time, and space-time locii where this condition is violated can be covered by small balls which are then recursively well-separated in space-time at the next scale and so on. Of intense theoretical interest is the question whether simple distributed algorithms can be designed which remain fault-tolerant in the presence of k-sparse errors, locally containing and correcting faults up to level k. Technically, this involves proving ergodicity and mixing rate of both finite and infinite systems. On a more practical side, the goal is to implement local algorithms on the Internet for dynamic server allocation which can tolerate k-sparse worst-case server down times or congestion, fault-tolerant routing and information access, intrusion detection in computer security and containment of its spread, among other applications.