Download PDFOpen PDF in browser

An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads

EasyChair Preprint 6531

12 pagesDate: September 1, 2021

Abstract

Task graph parallelism has emerged as an important tool to efficiently execute large machine learning workloads on GPUs. Users describe a GPU workload in a task dependency graph rather than aggregated GPU operations and dependencies, allowing the runtime to run whole-graph scheduling optimization to significantly improve the performance. While the new CUDA graph execution model has demonstrated significant success on this front, the counterpart for SYCL, a general-purpose heterogeneous programming model using standard C++, remains nascent. Unlike CUDA graph, SYCL runtime leverages out-of-order queues to implicitly create a task execution graph induced by data dependencies. For explicit task dependencies, users are responsible for creating SYCL events and synchronizing them at a non-negligible cost. Furthermore, there is no specialized graph execution model that allows users to offload a task graph directly onto a SYCL device in a similar way to CUDA graph. This paper conducts an experimental study of SYCL’s default task graph parallelism by comparing it with CUDA graph on large-scale machine learning workloads in the recent HPEC Graph Challenge. Our result highlights the need for a new SYCL graph execution model in the standard.

Keyphrases: CUDA Graph, SYCL, task graph parallelism

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@booklet{EasyChair:6531,
  author    = {Cheng-Hsiang Chiu and Dian-Lun Lin and Tsung-Wei Huang},
  title     = {An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads},
  howpublished = {EasyChair Preprint 6531},
  year      = {EasyChair, 2021}}
Download PDFOpen PDF in browser