Contents

MICRO'21 | NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs

Introduction

Definition
Code instrumentation refers to methods and techniques that add extra code to a computer program to collect, log, and monitor information about program execution.
Applications
For software profiling, optimization, testing, error detection, memory leak detection, and virtualization.

Static instrumentation3

Characteristics

  • Add extra code to the source code at compile time
  • Range from simple manual coding techniques to automated compiler or assembler-based instrumentation code editing
  • Requires full access to the source code and its build environment as the system needs to be recompiled
  • Executing the augmented system dumps report data

Limitations

  • Increases both source code size and size of application binaries
  • Cannot instrument external libraries, modules, and subsystems that are linked to the application

Dynamic Code Insertion / Instrumentation

Characteristics

  • Enables injecting customized analysis routines into arbitrary locations within a system binary to record a wide variety of performance data. Alterations can be inserted while the system is running.
  • Instrumentation alterations can be focused on relevant parts or execution time frames so that highly accurate and focused statistics can be gathered.
  • Increases the breadth of behavior information - library functions

Limitations

  • Increases execution time of the instrumented applications, which may cause different system behavior
  • A “random” insertion of code into a binary can affect the flow of instructions through a processor pipeline, thus modifying the performance characteristics of the application
  • Analysis and instrumentation routines created with one tool are often incompatible with all others
NVBit
  • GPU architectures currently only have limited support for similar capabilities through static compile-time tools
  • This work presents NVBit, a fast, dynamic, and portable binary instrumentation framework

Design

/posts/nvbit/2023-04-15-23-59-26.png /posts/nvbit/2023-04-15-23-59-49.png /posts/nvbit/2023-04-15-23-58-04.png

User-level API

  • Callback API: triggered at particular events.
  • Inspection API:
    • functions used to retrieve instructions and related CUfunctions.
    • Class Instr used to abstract machine level SASS instruction.
  • Instrumentation API: function injection and arguments passing.
  • Control API: enable/disable running of the instrumented function and reset instrumentation.
  • Device API: read or write any register from injected functions.

Core Components

  • Driver Interposer
  • Tool Functions Loader
    • Responsible for loading all the device functions within the dynamic library of the NVBit tool itself
  • Hardware Abstraction Layer (HAL)
    • Handle different HW family version
  • Instruction Lifter
    • Disassemble
  • Code Loader/Unloader
    • At run-time, the user can decide to enable or disable instrumentation for a particular CUfunction.
  • Code Generator
    • shown in the below figure

/posts/nvbit/2023-04-16-17-43-25.png

Evaluation

JIT-Compilation Overhead

Six parts of JIT-compilation overhead:

  1. retrieving the original GPU code
  2. disassembling the GPU program
  3. converting the binary into the format presented to the developer via the NVBit API
  4. executing the user code to inject instrumentation functions and arguments
  5. running the Code Generator to produce the final instrumented code
  6. swapping the original code with the instrumented code.

While the components (1), (2), (3) and (6) depend on the characteristics of the application, the components (4) and (5) depend on how much of the application is being instrumented.

The authors used OpenACC SpeccAccel benchmarks, and injected count instructions instrumentation function. The overhead shown in Figure 5 is acceptable. More discussions can be found in the paper. /posts/nvbit/2023-04-17-23-11-44.png In fact, most of the overhead comes from the body of the instrumentation. The details can be found in the next subsection.

Reduce Overhead by Sampling

Observation: Some kernels repeat many times in some applications (e.g. some matrix multiplication in DL).

Sample: Launch the instrumented version only once for each set of unique grid dimension values.

Instrumentation function: performs an analysis of all the instructions executed to construct a histogram of the Top-5 instructions.

/posts/nvbit/2023-04-17-23-26-14.png

Using sampling techniques get high efficiency with low error. The error comes from the dynamic path of code execution (i.e. branch and loop depend on input data).

Dynamic is necessary

Instrumentation function: Address divergence analysis

In the figure, Green ≈ Dynamic instrumentation and Orange ≈ Static instrumentation.

In some scenarios, static instrumentation is inaccurate.

/posts/nvbit/2023-04-17-23-29-00.png