All Posts
2026
Breaking GPU Hardware Limits: Micro-benchmark Methodology, PTX Assembly, and Hopper Architecture
04-27
CUDA Performance Profiling Cornerstone: Toolchains, Warp Scheduling, and Nsight Compute
04-26
Math Agents: Mathematical Reasoning and Formal Proofs in LLMs
04-08
Tool Agents: Empowering LLMs to Use Tools and Explore Environments
04-08
Coding Agents: Evaluation, Frameworks, and Code LLMs
04-08
When LLMs Learn Memory, Reasoning, and Planning: The Three Core Capabilities of Language Agents
03-12
LLM Reasoning: Prompting, Multi-Path Search, and Iterative Self-Improvement
03-08
RLHF and Test-Time Compute: Reinforcement Learning and Inference-Time Optimization for LLMs
03-08
LLM Basics: Pretraining, Prompting, Fine-tuning and Reinforcement Learning
03-08
2025
The Evolution of Attention: From MHA to MLA and KV Cache Optimization
12-30
Computational and Communication Modeling of LLM Serving System
11-18
2022
PPoPP'21 | A Fast Work-Efficient SSSP Algorithm for GPUs
11-14
TACO'22 | Performance and Power Prediction for Concurrent Execution on GPUs
06-17
OSDI'20 | AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
01-12
OSDI'18 | Gandiva: Introspective Cluster Scheduling for Deep Learning
01-12
RTSS'17 | GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed
01-10
NeurIPS'20 | BRP-NAS: Prediction-based NAS using GCNs
01-03
2021
SoCC'20 | InferLine: latency-aware provisioning and scaling for prediction serving pipelines
12-27
Docker Containers and Images
12-22
MobiSys'21 | nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices
12-20