From CoT and analogical prompting to self-consistency, ORM/PRM verification, tree-of-thoughts, multi-round self-reflection and token budget allocation, with the Bitter Lesson in mind.
From reward design, policy gradient, and PPO to RLHF/RLVR, then inference-time sampling and verification, Archon architecture search, and when to use RL vs test-time scaling.
An overview of core methods for training and using large language models: compute and scaling, prompting, fine-tuning, and reinforcement learning.
A detailed breakdown of the evolution of Attention mechanisms in Large Language Models: from the original MHA, to MQA and GQA for reducing KV Cache, to DeepSeek's innovative MLA (Multi-head Latent Attention) which elegantly solves memory bottlenecks via low-rank projections while remaining compatible with RoPE.
This article analyzes the computational and communication overhead patterns in LLM Serving systems under different parallelism strategies.
In the paper, authors introduced a new work scheduler to improve both work efficiency and parallelism for the Single Source Shortest Path search.