A comprehensive analysis of the architectural evolution in LLM inference systems: from early Continuous Batching and Chunked Prefill, to Prefill-Decoding (PD) Disaggregation in DistServe and Mooncake, Attention-Expert (AE) Disaggregation for MoE models, and the extreme Attention Offloading mechanism in Adrenaline.
From vision architecture basics (ViT, CLIP) to Large Multimodal Models (LMMs), and finally to Multimodal Agents capable of visual grounding and tree search in real-world web environments. A comprehensive analysis of the evolution and challenges of multimodal agents.
From Agentic Search to Full-Stack AI Scientists, a comprehensive breakdown of the four core components of Deep Research: Query Planning, Information Acquisition, Memory Management, and Answer Generation, featuring detailed explanations of cutting-edge methods like RAG-Star, HippoRAG, and Self-RAG.
CUDA Micro-benchmark Series (Part 2): Exploring how to benchmark ultimate GPU compute and bandwidth, understanding Memory Consistency, and mastering PTX inline assembly and Hopper (H100) TMA/WGMMA asynchronous features.
CUDA Micro-benchmark Series (Part 1): An in-depth exploration of CUDA compilation workflows, binary analysis tools, GPU Warp scheduling mechanisms, and how to conduct deep performance and stall profiling using Nsight Compute.
A deep dive into the frontier of mathematical LLMs: from the current SFT and GRPO recipes, to the introduction of formal mathematics (Lean), dissecting the AlphaProof workflow, symbolic reasoning pruning (LIPS), and the evaluation challenges in autoformalization.