Many PGAS languages and libraries rely on high performance transport layers such as GASNet and MPI to achieve low communication latency, portability and scalability. As systems increase in scale, failures are expected to become normal events rather than exceptions. Unfortunately, GASNet and standard MPI do not provide fault tolerance capabilities. This limitation hinders PGAS languages and other high-level programming models from supporting resilience at scale. For this reason, Resilient X10 has previously been supported over sockets only, not over MPI.
This paper describes the use of a fault tolerant MPI implementation, called ULFM (User Level Failure Mitigation), as a transport layer for Resilient X10. By providing fault tolerant collective and agreement algorithms, on demand failure propagation, and support for InfiniBand, ULFM provides the required infrastructure to create a high performance transport layer for Resilient X10.
We show that replacing X10’s emulated collectives with ULFM’s blocking collectives results in significant performance improvements. For three iterative SPMD-style applications running on 1000 X10 places, the improvement ranged between 31% and 60%. The per-step overhead for resilience was less than 10%.
A proposal for adding ULFM to the coming MPI-4 standard is currently under assessment by the MPI Forum. Our results show that adding user-level fault tolerance support in MPI makes it a suitable base for resilience in high-level programming models.