Josh Milthorpe
ACM SIGPLAN Workshop on X10
Publication year: 2015
X10 programs have achieved high efficiency on petascale clusters by making significant use of parallelism between places, however, there has been less focus on exploiting local parallelism within a place. This paper introduces a standard mechanism – foreach – for efficient local parallel iteration in X10, including support for worker-local data. Library code transforms parallel iteration into an efficient pattern of activities for execution by X10’s work-stealing runtime. Parallel reductions and worker-local data help to avoid unnecessary synchronization between worker threads. The foreach mechanism is compared with leading programming technologies for shared-memory parallelism using kernel codes from high-performance scientific applications. Experiments on a typical Intel multicore architecture show that X10 with foreach achieves parallel speedup comparable with OpenMP and TBB for several important patterns of iteration. foreach is composable with X10′ s asynchronous partitioned global address space model, and therefore represents a step towards a parallel programming model that can express the full range of parallelism in modern high-performance computing systems.