Research Assistants: J.-C. Gomez, R. Pasquini
Sponsors: ONR, ARO, NSF, DoE
Conch is a topology-based message passing system for fail-safe
distributed computing on heterogeneous platforms. It supports a
variety of user-defined topologies, including bus, star, ring, mesh,
tree etc. The system is able to detect process failure during
execution and make up-calls to an application. With appropriate
application intervention, startup of new process(es), state
restoration and execution resumption
is supported. Conch trees are used to enable high-efficiency
data-combining and reduction operations in EcliPSe simulations.
Thus, the Conch substrate is a key component of the EcliPSe
architecture and supports EcliPSe's fault-tolerance. The Conch system is
connection-based, using TCP/IP for messaging, and simple signal-based
threading to implement computation and communication subtasks.
All message routing is accomplished by communication threads. The
system currently supports computations on a heterogeneous mix of
SPARC (SunOS 4.x, SunOS 5.x), Sequent Symmetry, Intel i860,
Silicon Graphics IRIX, and IBM RS/6000 environments.