Parallel Simulation Infrastructure
The most significant component of the ACES system is the versatile application interface. This interface exploits multithreading at the kernel level so as to allow the rapid conversion of a model into executable code (in the C or Fortran languages). A user is expected to be proficient in C or Fortran, but is not required to know parallel programming - no small advantage for computational scientists who cannot invest the amount of time required for learning to implement parallel executables for different problems.
Major components of the ACES system include:
To enable a user to interact with a distributed application - for the purposes of debugging or performance tuning, or for viewing graphical results or animations - we have constructed an application-independent DISplay visualization and user interaction interface, motivated largely by our work in Distributed Interactive Simulation. DISplay runs as a server, under the control of a connection server, for user interaction with sequential or parallel computations. Client calls from an application are made to the DISplay server, which uses a well-defined protocol to process requests and display "tasks" and "interaction dialogs". It connects to an X server for the necessary display and user interaction commands.
For EcliPSe and ACES applications that execute on workstation networks and the Paragon mesh, large data-transfers between the application and the DISplay component warrant a network with high transfer rates. The Sun SPARCstation 20 Model 612 MP will provide us with a two-processor machine which can run the DISplay server and X server functions in parallel, improving our 2D and 3D visualization performance significantly. Further, the multiprocessor will allow us to explore a variety of multithreading strategies to overlap data-transfer and interactive computation for interactive simulation with Sol.
Our ongoing work also addresses alternative protocols, where systems such as PVM. and Conch avail of distributed services and algorithms that are built directly into protocol suites that provide appropriate functionality at high performance. Our work will be based on multiparty session-level protocols that represent a much needed departure from traditional pairwise protocols which fall short of meeting the requirements of distributed and concurrent computing.
This protocol architecture is based on the notion of multiple entities forming a session or connection, with facilities for dynamic participant enrollment. The protocol will integrally support communication functions as well as synchronization - oft needed primitives such as distributed mutual exclusion, barrier synchronization, and global operations will be built into the protocol.
In terms of data transfer itself, the proposed communication architecture and protocol will provide for multiway data delivery facilities at varying (user-selectable) qualities of service, e.g. exactly once vs. at most once vs. best effort delivery. In addition, the protocol will con tain mechanisms to provide for jitter-free, isochronous transmission, and quarantined delivery - features essential for image and multimedia data distribution.
A particularly important and succesful aspect of this work is semi-transparent fault-tolerance at the application level. Given the overall abstract scenario of multiple geographically distributed computing entities engaged in data storage, exchange, and transformation, our research seeks to provide a framework in which these activities can be completed in the presence of failures of interconnection channels or intermediate/terminal computing nodes.
Our early approach is derived from our recent successes with heterogeneous source-level checkpoint-restart mechanisms - demonstrably useful in domain decomposition and distributed simulation applications with EcliPSe.