All Posts - Weile Luo's homepage

All Posts

2026

Breaking GPU Hardware Limits: Micro-benchmark Methodology, PTX Assembly, and Hopper Architecture 04-27

CUDA Performance Profiling Cornerstone: Toolchains, Warp Scheduling, and Nsight Compute 04-26

Math Agents: Mathematical Reasoning and Formal Proofs in LLMs 04-08

Tool Agents: Empowering LLMs to Use Tools and Explore Environments 04-08

Coding Agents: Evaluation, Frameworks, and Code LLMs 04-08

When LLMs Learn Memory, Reasoning, and Planning: The Three Core Capabilities of Language Agents 03-12

LLM Reasoning: Prompting, Multi-Path Search, and Iterative Self-Improvement 03-08

RLHF and Test-Time Compute: Reinforcement Learning and Inference-Time Optimization for LLMs 03-08

LLM Basics: Pretraining, Prompting, Fine-tuning and Reinforcement Learning 03-08

2025

The Evolution of Attention: From MHA to MLA and KV Cache Optimization 12-30

Computational and Communication Modeling of LLM Serving System 11-18

2022

PPoPP'21 | A Fast Work-Efficient SSSP Algorithm for GPUs 11-14

TACO'22 | Performance and Power Prediction for Concurrent Execution on GPUs 06-17

OSDI'20 | AntMan: Dynamic Scaling on GPU Clusters for Deep Learning 01-12

OSDI'18 | Gandiva: Introspective Cluster Scheduling for Deep Learning 01-12

RTSS'17 | GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed 01-10

NeurIPS'20 | BRP-NAS: Prediction-based NAS using GCNs 01-03

2021

SoCC'20 | InferLine: latency-aware provisioning and scaling for prediction serving pipelines 12-27

Docker Containers and Images 12-22

MobiSys'21 | nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices 12-20

1
2