SYCL is a single-source programming model for heterogeneous systems, which promises improved maintainability, productivity, and opportunity for compiler optimization, when compared to programming models like OpenCL and CUDA. Several implementations of the SYCL standard have been developed over the past few years, including several backends into contemporary accelerator languages, like OpenCL, CUDA, and HIP. These implementations vary wildly in their support for specific features of the standard and in their performance. As SYCL grows in popularity, developers need to know how features are implemented across popular implementations in order to make proper design choices.
In this paper, we evaluate the existing SYCL implementations across a range of hardware and prominent SYCL features to understand SYCL’s performance portability. This work uses the newest
SYCL benchmark suite (SYCL-Bench) to evaluate all four existing implementations, comparing support of language features between backends, and highlighting those that are missing or performing poorly. We offer a detailed evaluation of the major SYCL parallel constructs in the context of a matrix multiplication benchmark. Our results show that basic kernel parallelism is the best choice for performance on current SYCL implementations, and identify opportunities for improvement in several of the target SYCL runtimes.