MICRO'21 | NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs
Introduction
Static instrumentation3
Characteristics
- Add extra code to the source code at compile time
- Range from simple manual coding techniques to automated compiler or assembler-based instrumentation code editing
- Requires full access to the source code and its build environment as the system needs to be recompiled
- Executing the augmented system dumps report data
Limitations
- Increases both source code size and size of application binaries
- Cannot instrument external libraries, modules, and subsystems that are linked to the application
Dynamic Code Insertion / Instrumentation
Characteristics
- Enables injecting customized analysis routines into arbitrary locations within a system binary to record a wide variety of performance data. Alterations can be inserted while the system is running.
- Instrumentation alterations can be focused on relevant parts or execution time frames so that highly accurate and focused statistics can be gathered.
- Increases the breadth of behavior information - library functions
Limitations
- Increases execution time of the instrumented applications, which may cause different system behavior
- A “random” insertion of code into a binary can affect the flow of instructions through a processor pipeline, thus modifying the performance characteristics of the application
- Analysis and instrumentation routines created with one tool are often incompatible with all others
- GPU architectures currently only have limited support for similar capabilities through static compile-time tools
- This work presents NVBit, a fast, dynamic, and portable binary instrumentation framework
Design
User-level API
- Callback API: triggered at particular events.
- Inspection API:
- functions used to retrieve instructions and related CUfunctions.
- Class Instr used to abstract machine level SASS instruction.
- Instrumentation API: function injection and arguments passing.
- Control API: enable/disable running of the instrumented function and reset instrumentation.
- Device API: read or write any register from injected functions.
Core Components
- Driver Interposer
- Tool Functions Loader
- Responsible for loading all the device functions within the dynamic library of the NVBit tool itself
- Hardware Abstraction Layer (HAL)
- Handle different HW family version
- Instruction Lifter
- Disassemble
- Code Loader/Unloader
- At run-time, the user can decide to enable or disable instrumentation for a particular CUfunction.
- Code Generator
- shown in the below figure
Evaluation
JIT-Compilation Overhead
Six parts of JIT-compilation overhead:
- retrieving the original GPU code
- disassembling the GPU program
- converting the binary into the format presented to the developer via the NVBit API
- executing the user code to inject instrumentation functions and arguments
- running the Code Generator to produce the final instrumented code
- swapping the original code with the instrumented code.
While the components (1), (2), (3) and (6) depend on the characteristics of the application, the components (4) and (5) depend on how much of the application is being instrumented.
The authors used OpenACC SpeccAccel benchmarks, and injected count instructions
instrumentation function. The overhead shown in Figure 5 is acceptable. More discussions can be found in the paper.
In fact, most of the overhead comes from the body of the instrumentation. The details can be found in the next subsection.
Reduce Overhead by Sampling
Observation: Some kernels repeat many times in some applications (e.g. some matrix multiplication in DL).
Sample: Launch the instrumented version only once for each set of unique grid dimension values.
Instrumentation function: performs an analysis of all the instructions executed to construct a histogram of the Top-5 instructions.
Using sampling techniques get high efficiency with low error. The error comes from the dynamic path of code execution (i.e. branch and loop depend on input data).
Dynamic is necessary
Instrumentation function: Address divergence analysis
In the figure, Green ≈ Dynamic instrumentation and Orange ≈ Static instrumentation.
In some scenarios, static instrumentation is inaccurate.