This article analyzes the computational and communication overhead patterns in LLM Serving systems under different parallelism strategies.
In the paper, authors introduced a new work scheduler to improve both work efficiency and parallelism for the Single Source Shortest Path search.
This paper shows that by using the execution statistics of standalone workloads and the fairness of execution when these workloads are executed with three representative microbenchmarks, we can get a reasonably accurate prediction.
In this paper, authors introduced AntMan, a system to accommodate the fluctuating resource demands of deep learning training jobs.
This paper introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific knowledge to improve latency and efficiency of training deep learning models in a GPU cluster.
This paper configured multiple experiments to explore the rules of the GPU kernel-level scheduling.