Contents

ATC'22 | Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory

Abstract

The memory space across GPUs is not effectively utilized enough when consolidating various workloads that exhibit highly varying resource demands. This is because the current memory management techniques were designed solely for individual GPUs rather than shared multi-GPU environments. This study introduces a novel approach to provide an illusion of virtual memory space for GPUs, called hierarchical unified virtual memory (HUVM), by incorporating the temporarily idle memory of neighbor GPUs. Since modern GPUs are connected to each other through a faster interconnect. On top of HUVM, the authors design a new memory manager, called memHarvester, to effectively and efficiently harvest the temporarily available neighbor GPUs' memory. For diverse consolidation scenarios with DNN training and graph analytics workloads, the experimental result shows up to 2.71× performance improvement compared to the prior approach in multi-GPU environments.

Introduction & Background

Many people built multi-GPU servers to satisfy the demand from deep learning to graph applications. To reduce the cost, most of the people share the server. But, memory space across GPUs is not fully utilized.

Observation

Each workload has highly varying memory demands.

  • Deep learning: larger batch size, higher throughput and memory demand.
  • Graph processing: larger graph leads to graph partitioning. GPU supports memory oversubscription but 2x ~ 64x longer to complete with 40% oversubscription.
Opportunity
NVLink provides fast interconnect to neighbor GPU.

Approach

We can harvest neighbor GPU memory. There are multiple data paths for exchanging data between host and device. The figure below shows that we can move data by PCIe or neighbor GPU.

/posts/huvm/image.png

Goals of HUVM
  • Effective Harvesting
    • Harvest small and temporarily available spare memory of neighbor GPU
    • Reduce eviction/fetch latency with spare memory
  • Minimal Interference
    • Minimize performance impact of workloads running in neighbor GPU
  • Framework-agnostic
    • No modification of applications or frameworks

The path diversity is shown in the figure below. Then the following shows the details of each technique.

/posts/huvm/image-2.png

Pre-eviction

  • Effective Harvesting (when we need to fetch data, we don’t need to evict data first.) & Framework-agnostic
  • Avoid PCIe contention (make use of PCIe lane for other GPU)

/posts/huvm/image-3.png

Large page eviction

Minimal Interference (Write back the page on GPU1 from GPU0 to host. Then we can reduce the influence on GPU1 and reclaim the memory) & Framework-agnostic

/posts/huvm/image-4.png

Parallel fetch

Effective Harvesting (Use more PCIe lanes) & Framework-agnostic /posts/huvm/image-5.png

Multi-path prefetch

Effective Harvesting (proactive prefetch) & Minimal Interference (If GPU1 PCIe lane is in contention, it will switch the data path to PCIe lane between GPU0 and host.) & Framework-agnostic /posts/huvm/image-7.png /posts/huvm/image-6.png

Putting It All Together

Figure 4(a) shows the baseline that doesn’t use any techniques.

In Figure 4(b), GPU-0 needs to reserve three pages so that we can directly fetch A and B rather than evict first X and Y. At the same time, we need to pre-evict X and Y for the reserved device memory.

Figure 4 (c) shows the timeline of parallel fetch.

/posts/huvm/image-8.png

Figure 5 shows the multi-path fetch. /posts/huvm/image-9.png

Evaluation

Setup

  • NVIDIA UVM driver version 460.67
  • 4 x V100 with NVSwitch and NVLink 2.0
  • Benchmarks
    • cuGraph v21.12
    • PyTorch v1.10.1
  • Comparisons
    • The stock version of the unified virtual memory (Base)
    • The prior approach employing the pre-eviction and prefetch techniques for the host memory (Pre-ef-host)

Workload running scenarios

/posts/huvm/image-10.png

Results

Speed up

/posts/huvm/image-11.png

Effectiveness of individual techniques

  1. Spare memory harvesting (H) utilizes the spare memory as an eviction buffer and a victim cache to reduce the latency of migrating chunks by using NVLink rather than PCIe;
  2. Pre-eviction (PE) eliminates the eviction latency from the critical path by reducing on-demand page faults;
  3. Large page support (LP) reduces the time of making removable pages by writing back the chunks to host in batch;
  4. Parallel fetch (PLF) reduces the latency of handling on-demand page faults by fetching the pages in the fault batch in parallel with both PCIe and NVLink.
  5. Multi-path parallel prefetcher (MPF) which drastically improves performance compared with local prefetcher (LPF). /posts/huvm/image-12.png

Sensitivity

/posts/huvm/image-13.png

Model Parallelism

Single training workloads run on multiple GPUs. (pipeline parallelism) /posts/huvm/image-14.png