From vision architecture basics (ViT, CLIP) to Large Multimodal Models (LMMs), and finally to Multimodal Agents capable of visual grounding and tree search in real-world web environments. A comprehensive analysis of the evolution and challenges of multimodal agents.
From Agentic Search to Full-Stack AI Scientists, a comprehensive breakdown of the four core components of Deep Research: Query Planning, Information Acquisition, Memory Management, and Answer Generation, featuring detailed explanations of cutting-edge methods like RAG-Star, HippoRAG, and Self-RAG.
CUDA Micro-benchmark Series (Part 2): Exploring how to benchmark ultimate GPU compute and bandwidth, understanding Memory Consistency, and mastering PTX inline assembly and Hopper (H100) TMA/WGMMA asynchronous features.
CUDA Micro-benchmark Series (Part 1): An in-depth exploration of CUDA compilation workflows, binary analysis tools, GPU Warp scheduling mechanisms, and how to conduct deep performance and stall profiling using Nsight Compute.
A deep dive into the frontier of mathematical LLMs: from the current SFT and GRPO recipes, to the introduction of formal mathematics (Lean), dissecting the AlphaProof workflow, symbolic reasoning pruning (LIPS), and the evaluation challenges in autoformalization.
A comprehensive deep dive into Tool Agents: from Toolken vocabulary injection and CodeAct execution, to DocPrompting, Toolformer self-learning, visual Set-of-Mark grounding, and autonomous environment exploration.