Joseph John, Josh Milthorpe, Thomas Herault, George Bosilca
Future Generation Computer Systems
Publication year: 2024
Today, multi-GPU computing nodes are the mainstay of most high-performance computing systems. Despite significant progress in programmability, building an application that efficiently utilizes all the GPUs in a computing node is still a significant challenge, especially using the existing shared-memory and message-passing paradigms. In this aspect, the task-based dataflow programming model has emerged as an alternative for multi-GPU computing nodes.
Most task-based dataflow runtimes have dynamic task mapping, where tasks are mapped to different GPUs based on the current load, but once the mapping has been established, there is no rebalancing of tasks even if an imbalance is detected. In this paper, we examine how automatic dynamic work sharing between GPUs within a compute node can improve the performance of an application through better workload distribution. We demonstrate performance improvement through dynamic work sharing using a Block-Sparse GEneral Matrix Multiplication (BSpGEMM) benchmark. Although we use PaRSEC, a task-based dataflow runtime, as the vehicle for this research, the ideas discussed here are transferable to any task-based dataflow runtime.