MICRO'21 | NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs

2023-04-14 634 words 3 minutes

Contents

Introduction

Definition

Code instrumentation refers to methods and techniques that add extra code to a computer program to collect, log, and monitor information about program execution.

Applications

For software profiling, optimization, testing, error detection, memory leak detection, and virtualization.

Static instrumentation3

Characteristics

Add extra code to the source code at compile time
Range from simple manual coding techniques to automated compiler or assembler-based instrumentation code editing
Requires full access to the source code and its build environment as the system needs to be recompiled
Executing the augmented system dumps report data

Limitations

Increases both source code size and size of application binaries
Cannot instrument external libraries, modules, and subsystems that are linked to the application

Dynamic Code Insertion / Instrumentation

Characteristics

Enables injecting customized analysis routines into arbitrary locations within a system binary to record a wide variety of performance data. Alterations can be inserted while the system is running.
Instrumentation alterations can be focused on relevant parts or execution time frames so that highly accurate and focused statistics can be gathered.
Increases the breadth of behavior information - library functions

Limitations

Increases execution time of the instrumented applications, which may cause different system behavior
A “random” insertion of code into a binary can affect the flow of instructions through a processor pipeline, thus modifying the performance characteristics of the application
Analysis and instrumentation routines created with one tool are often incompatible with all others

NVBit

GPU architectures currently only have limited support for similar capabilities through static compile-time tools
This work presents NVBit, a fast, dynamic, and portable binary instrumentation framework

Design

User-level API

Callback API: triggered at particular events.
Inspection API:
- functions used to retrieve instructions and related CUfunctions.
- Class Instr used to abstract machine level SASS instruction.
Instrumentation API: function injection and arguments passing.
Control API: enable/disable running of the instrumented function and reset instrumentation.
Device API: read or write any register from injected functions.

Core Components

Driver Interposer
Tool Functions Loader
- Responsible for loading all the device functions within the dynamic library of the NVBit tool itself
Hardware Abstraction Layer (HAL)
- Handle different HW family version
Instruction Lifter
- Disassemble
Code Loader/Unloader
- At run-time, the user can decide to enable or disable instrumentation for a particular CUfunction.
Code Generator
- shown in the below figure

Evaluation

JIT-Compilation Overhead

Six parts of JIT-compilation overhead:

retrieving the original GPU code
disassembling the GPU program
converting the binary into the format presented to the developer via the NVBit API
executing the user code to inject instrumentation functions and arguments
running the Code Generator to produce the final instrumented code
swapping the original code with the instrumented code.

While the components (1), (2), (3) and (6) depend on the characteristics of the application, the components (4) and (5) depend on how much of the application is being instrumented.

The authors used OpenACC SpeccAccel benchmarks, and injected count instructions instrumentation function. The overhead shown in Figure 5 is acceptable. More discussions can be found in the paper. In fact, most of the overhead comes from the body of the instrumentation. The details can be found in the next subsection.

Reduce Overhead by Sampling

Observation: Some kernels repeat many times in some applications (e.g. some matrix multiplication in DL).

Sample: Launch the instrumented version only once for each set of unique grid dimension values.

Instrumentation function: performs an analysis of all the instructions executed to construct a histogram of the Top-5 instructions.

Using sampling techniques get high efficiency with low error. The error comes from the dynamic path of code execution (i.e. branch and loop depend on input data).

Dynamic is necessary

Instrumentation function: Address divergence analysis

In the figure, Green ≈ Dynamic instrumentation and Orange ≈ Static instrumentation.

In some scenarios, static instrumentation is inaccurate.