Parallel Simulation Infrastructure


Our research focus is on fail-safe scalable computing on heterogeneous processor clusters, combining hardware multiprocessors (i860 hypercube, Intel Paragon, KSR) and workstations. Our efforts are manifest in the ACES, which is primarily motivated by questions of interdisciplinary computing science:
  • physics (simulations of polymer systems),
  • chemistry (MD simulations, particle simulations), and
  • operations research (discrete-event simulations of large systems).
  • Major emphasis is placed on a supporting infrastructure for domain-specific object libraries, layered upon the Sol kernel for multithreaded parallel simulation. Working in close collaboration with physicists at Purdue and Emory, and scientists at Oak Ridge National Labs, we have demonstrated the success of the ACES model for domain-layering in particle-problems (scale-invariant phenomena), self-avoiding random walks, and stochastic multidimensional (d > 7) integration.
    ACES

    The most significant component of the ACES system is the versatile application interface. This interface exploits multithreading at the kernel level so as to allow the rapid conversion of a model into executable code (in the C or Fortran languages). A user is expected to be proficient in C or Fortran, but is not required to know parallel programming - no small advantage for computational scientists who cannot invest the amount of time required for learning to implement parallel executables for different problems.

    Major components of the ACES system include:

  • Facilities for replicative and parallel/distributed stochastic simulation,
  • Fault-tolerance for long-running distributed computations, and
  • Visual interaction with an application and its host (distributed) system.

  • DISplay

    To enable a user to interact with a distributed application - for the purposes of debugging or performance tuning, or for viewing graphical results or animations - we have constructed an application-independent DISplay visualization and user interaction interface, motivated largely by our work in Distributed Interactive Simulation. DISplay runs as a server, under the control of a connection server, for user interaction with sequential or parallel computations. Client calls from an application are made to the DISplay server, which uses a well-defined protocol to process requests and display "tasks" and "interaction dialogs". It connects to an X server for the necessary display and user interaction commands.

    For EcliPSe and ACES applications that execute on workstation networks and the Paragon mesh, large data-transfers between the application and the DISplay component warrant a network with high transfer rates. The Sun SPARCstation 20 Model 612 MP will provide us with a two-processor machine which can run the DISplay server and X server functions in parallel, improving our 2D and 3D visualization performance significantly. Further, the multiprocessor will allow us to explore a variety of multithreading strategies to overlap data-transfer and interactive computation for interactive simulation with Sol.


    Ongoing Work

    Our ongoing work also addresses alternative protocols, where systems such as PVM. and Conch avail of distributed services and algorithms that are built directly into protocol suites that provide appropriate functionality at high performance. Our work will be based on multiparty session-level protocols that represent a much needed departure from traditional pairwise protocols which fall short of meeting the requirements of distributed and concurrent computing.

    This protocol architecture is based on the notion of multiple entities forming a session or connection, with facilities for dynamic participant enrollment. The protocol will integrally support communication functions as well as synchronization - oft needed primitives such as distributed mutual exclusion, barrier synchronization, and global operations will be built into the protocol.

    In terms of data transfer itself, the proposed communication architecture and protocol will provide for multiway data delivery facilities at varying (user-selectable) qualities of service, e.g. exactly once vs. at most once vs. best effort delivery. In addition, the protocol will con tain mechanisms to provide for jitter-free, isochronous transmission, and quarantined delivery - features essential for image and multimedia data distribution.

    A particularly important and succesful aspect of this work is semi-transparent fault-tolerance at the application level. Given the overall abstract scenario of multiple geographically distributed computing entities engaged in data storage, exchange, and transformation, our research seeks to provide a framework in which these activities can be completed in the presence of failures of interconnection channels or intermediate/terminal computing nodes.

    Our early approach is derived from our recent successes with heterogeneous source-level checkpoint-restart mechanisms - demonstrably useful in domain decomposition and distributed simulation applications with EcliPSe.