PI: Vitek
Large-scale compute-cycle sharing for complex and large scientific applications is a long standing challenge in distributed computing. Many cycle-sharing systems such as SETI@Home, Distributed.Net and Entropia have been developed and have delivered orders of magnitude improvements in available resources. Nevertheless they are limited in applicability by (i) a lack of a resilient computational model, (ii) their inability to spawn and control the lifecycle of subcomputations, and (iii) the lack of access and usage control and usage of shared resources. Our goal is to develop technologies that will allow researchers in diverse disciplines to conduct large scale computational experiments on an open distributed infrastructure. In particular we will investigate a cluster computing platform with support for:
Fault Determination: We take a broad definition of faults as situations that arise because of unexpected change in resource availability or performance assumptions. This notion encompasses the traditional notion of program/node failures, as well as quality of services and other non-functional characteristics of distributed program behavior. Our emphasis is on an aspectual declarative fault specification and recovery language called RESCUE. A distinctive feature of our approach is the ability for applications to define arbitrary predicates that describe fault conditions.
Fault Recovery: Taking corrective action upon detection of a fault condition often involves complex logic, such as state migration, computation offloading, compensation, or roll-back. RESCUE allows end-user specified recovery actions to be defined within a high-level language framework.
Customizable Recovery Semantics: In general, there is no single correct recovery strategy: almost every application has slightly different correctness criteria and performance characteristics. \rescue provides a low-level interface to an application's compiler and execution environment to allow experts to define the semantics of application-specific recovery facilities.