RLHF and Test-Time Compute: Reinforcement Learning and Inference-Time Optimization for LLMs

2026-03-08 164 words One minute

Contents

1. RL Recap: Why Do We Need “Training-Time Reinforcement Learning”?

1.1 Three Core Issues

Task mismatch: The language model only maximizes $p(\text{probable response} \mid \text{prompt})$, i.e. “the most likely next token.” What we actually want is:
- Helpful answers
- Non-offensive answers
- Correct solutions, code that passes tests, etc.
Data mismatch: Pretraining data contains many undesirable outputs (toxic comments, buggy code), while high-quality “good reasoning chains and answers” are scarce.
Exposure bias: During training the model always sees “correct prefixes” and rarely sees its own errors; at test time, once one step is wrong, later steps drift further because it never learned how to recover from mistakes.

1.2 Basic RL Idea

Use reinforcement learning to turn the “task metric” directly into a reward and let the model learn in a loop where it produces its own outputs. The core is the MDP quadruple $(S, A, E, R)$: state, action, environment, reward.

Example: language generation
- State: current prompt and tokens generated so far, $s_t = (x, y_{
- Action: next token $a_t = y_t$
- Policy: $p_\theta(y_t \mid y_{
- Environment: append the token to the sequence
- Reward: score the full sequence $r(x, y)$, e.g. whether the answer is correct.
Alternatively use a “one-step MDP”: generate the whole response at once and score it with a single reward.

2. Reward Design: Rule-Based, Model-Based, and Preference

2.1 Rule-based rewards (verifiable rules)

These rewards can be checked automatically for correctness.

Math: Have the model output a final answer and implement a checker against the reference. Define $r(x, y) = 1$ (correct) or $0$ (wrong).
Code: Run unit tests; reward = fraction of tests passed: $r(x, y) = \text{fraction of passed tests}$.
Generate a 5-line poem: Compute the number of lines $\text{num\_lines}$ and design a reward that penalizes deviation from 5, e.g. $r(x, y)= -|\text{num\_lines} - 5|$.

Such rewards are simple and objective, and work well for math, coding, and other tasks with clear ground truth.

2.2 Model-based rewards: scoring models

When rules are hard to write, use a separately trained “scoring model” as the reward function.

Direct assessment model
- Input: prompt $x$ + model output $y$.
- Output: a scalar $R$ (or a 0–1 probability) for “how helpful / safe” the output is.
- Example: In NVIDIA’s Aegis content-safety dataset, each sample is labeled safe or not; train a safety classifier and use it as the reward model.
Preference model / reward model
- Collect human preference data: for the same prompt, two responses $y_+, y_-$ with a label indicating which is better.
- Training objective: make the reward model $r_\theta$ assign higher score to the better response:

$$ \mathscr{L} = - \sum_{(y_+, y_-) \in D} \log \sigma(r_\theta(y_+) - r_\theta(y_-)) $$

This encourages $r_\theta(y_+) \gg r_\theta(y_-)$.
Examples: Anthropic’s HH-RLHF dialogue dataset (chosen/rejected pairs for the preference model); UltraRM-13B is an open-source dialogue reward model trained this way.

Code reward model
- CodeScaler proposes an execution-free reward model to score code without running tests, used in RL training and test-time Best-of-N selection to save compute.

3. Policy Gradient: How to Update the LLM with the Reward

The goal is to maximize expected reward $J(\theta)$. In the “generate once” setting with fixed input $x$, the policy gradient is:

$$ \nabla_{\theta} J(\theta) = \mathbb{E}_{y \sim p_\theta(\cdot \mid x)}\left[ r(x, y)\, \nabla_\theta \log p_\theta(y \mid x)\right] $$

Intuition: high-reward samples push $\log p_\theta(y|x)$ up (increase probability); low-reward samples push it down.

Approximate the expectation by sampling one or more $\hat{y}$ and form the loss:

$$ \mathcal{L}_{PG} = - r(x, \hat{y}) \log p_\theta(\hat{y} \mid x) $$

Gradient descent on this loss corresponds to the policy gradient.

Credit assignment for multi-step sequences

When reward is only given at the end (e.g. game win/loss), it’s unclear which actions mattered. A common approach is discounting:

$$ \hat{r}_t = \gamma^{T-t} r_T, \quad \gamma \in (0,1) $$

Actions closer to the end get larger $\hat{r}$; earlier actions are discounted more.

4. Three Techniques to Stabilize RL Training

4.1 Preventing reward hacking: KL penalty

Reward hacking: The model exploits loopholes in the reward instead of learning the true task.

Example: If the reward only encourages “non-offensive,” the model may learn to always output an empty string.
Analogy: When training “wolf catches sheep,” bad reward design can make the wolf learn to “hit the wall and end early” to minimize penalty.

Fix: Add a KL penalty to the objective so the new policy doesn’t drift too far from the reference:

$$ \arg\max_\theta \mathbb{E}[r(x,y)] - \beta D_{KL}(p_\theta || p_0) $$

In practice, the KL term is often approximated as an extra reward:

$$ r^{KL} = -\beta \log \frac{p_\theta(y|x)}{p_0(y|x)} $$

where $p_0$ is the reference (e.g. initial or supervised) policy.

4.2 Reward scaling: advantage

Using $r(x,y)$ directly as the coefficient of $\log p_\theta$ is noisy and can destabilize learning. A common approach is to use a baseline:

$$ \mathcal{L} = - (r(x, y) - b(x,y)) \log p_\theta(y|x) $$

$b(x,y)$ estimates the expected reward (“average” level).
What matters is the advantage $A = r - b$: $A > 0$ means better than expected → increase probability; $A < 0$ → decrease probability.

Ways to implement the baseline:

Output average: Sample multiple $y$ for the same $x$ and use mean reward as baseline (e.g. GRPO).
Running mean: Maintain a running mean of rewards across batches.
Learned: Train a value network to predict expected reward.

4.3 Controlling update size: PPO

Large updates can break the policy. PPO limits the ratio between new and old policy so each step doesn’t change too much:

Ratio:

$$ \text{ratio}(x,y) = \frac{p_\theta(y|x)}{p_{\theta_{\text{old}}}(y|x)} $$

PPO loss:

$$ L_{PPO} = \min\big(\text{ratio} \cdot A,\ \text{clip}(\text{ratio}, 1-\epsilon,1+\epsilon)\cdot A\big) $$

If the ratio goes outside $[1-\epsilon,1+\epsilon]$, it is clipped to avoid overly large updates.

The open-source library verl implements PPO, approximating KL with log-prob differences and averaging over tokens with EOS masking.

5. RLHF and RLVR: How RL Is Used with LLMs

5.1 Classic RLHF pipeline

The standard three-step pipeline used in InstructGPT / ChatGPT:

Collect demonstration data and do supervised fine-tuning (SFT) to get an initial policy.
Collect comparison data (same prompt, multiple response pairs, human preference labels) and train a reward model.
Run PPO against the reward model to get the final policy.

5.2 RLVR: Verifiable reward

RLVR replaces a complex reward model with verifiable 0/1 reward:

Applicable to: arithmetic, math competitions, instruction-following tasks with clear checks.
Flow: policy generates chain-of-thought + final answer; a rule/program checks correctness → reward = 0 or 1; optimize with PPO + output-average baseline (e.g. GRPO).

DeepSeek-R1 follows an RLVR-style setup: the model exhibits “aha moments” during training, catching its own logic errors and reallocating compute to harder subproblems.

5.3 “Does RL really improve reasoning?” — Two follow-up works

NeurIPS 2025: Does RL Really Incentivize Reasoning Capacity in LLMs?
- Main finding: For large sampling counts (large $k$), a base model with heavy sampling (pass@k) can outperform the RL model.
- RLVR solutions largely stay within the base model’s output distribution; RL acts more like “improving sampling efficiency” than “expanding the reasoning frontier.”
- Different RL algorithms (PPO, GRPO, Reinforce++) perform similarly and are still far from “optimal sampling efficiency.”
- In contrast, knowledge distillation can inject new knowledge and expand the set of solvable problems.
ICLR 2026: Beyond Pass@1 – Self-Play with Variational Problem Synthesis Sustains RLVR
- Proposes self-play + variational problem synthesis (SvS); on AIME and other math benchmarks, RL training can clearly improve pass@1 and pass@32.
- Conclusion: With the right setup and data design, RL can push the reasoning frontier, but it requires careful design.

6. Test-Time Scaling: Spending Compute at Inference

The main theme here: instead of scaling pretraining indefinitely, use “more sampling, search, and verification at inference” to improve performance.

6.1 Repeated sampling + verification: Large Language Monkeys

“LLMs as monkeys at typewriters”—the Large Language Monkeys work proposes:

Generate many candidate solutions (coverage)
- From the same prompt, use temperature sampling to get $k$ candidates.
- If at least one is correct, we “cover” the problem.
Use a verifier to pick the correct one (precision)
- Code: unit tests.
- Math: proof checker.
- General: reward model, voting, etc.

Key observation: In domains with automatic verifiers (math/code), repeated sampling alone can greatly improve effective capability; e.g. Llama-3-8B with enough samples can surpass GPT-4o. pass@k vs $k$ also follows a scaling-law-like curve (power law), so we can predict how much sampling is needed for a given accuracy.

6.2 Prerequisite for inference-time scaling: Reliable verifiers

For math/code, automatic verification is relatively reliable.
For general NL tasks (open QA, writing):
- Simple majority vote or reward-model ranking can differ a lot from the ideal “did we ever generate the correct answer” (coverage).
- Reason: The correct answer may appear in only a tiny fraction of samples (e.g. 1–2 out of 10k); majority vote can be dominated by wrong samples.

Conclusion: Test-time scaling works best when paired with strong verification or ranking.

7. Test-Time Compute vs Pretraining Compute

The GDM 2024 work “Scaling LLM Test-Time Compute Optimally…” compares two ways to spend compute:

More pretraining FLOPs: larger model or longer training.
More test-time FLOPs: more sampling / search / revision with the same model.

On the MATH benchmark:

For easy and medium problems: with a fixed total FLOP budget, shifting some compute to test-time often beats only scaling pretraining.
For the hardest problems: test-time scaling has limited returns; a stronger base model is needed.

8. Test-Time Strategies: Parallel Sampling vs Sequential Revision

8.1 Two paradigms

Parallel sampling (Best-of-N)
- Generate $N$ full solutions at once; use verifier or reward model to pick the best.
- Good for hard problems: explore many different solution paths in parallel.
Sequential revisions
- Generate one solution; if the verifier says it’s wrong, ask the model to revise given the error; repeat for several rounds.
- For easier problems the base model is often close; multiple self-corrections can converge to the right answer.

Empirically:

Easy problems: Pure sequential revision works best.
Hard problems: A mix of parallel and sequential is best; purely sequential can get stuck in wrong paths, purely parallel wastes compute on bad samples.

8.2 Process Reward Model + beam search

For finer-grained search, use a Process Reward Model (PRM):

Score each step of the reasoning process, not only the final answer.
Use the PRM in beam search: generate multiple partial steps; PRM scores and keeps top-$k$ paths; extend each path with the next step; repeat. This prunes “bad” branches and keeps “good” ones on the reasoning tree.

This “process reward + search” is more efficient than plain Best-of-N on MATH.

9. Archon: Systematic Search over Test-Time Architectures

To systematically design “inference-time pipelines (multiple models + multi-stage ops),” ICML 2025’s Archon proposes an architecture search framework.

9.1 Basic operations (inference-time ops)

Each step is an “op”:

Type	Role	Input → Output	Typical use
Generator	Generate candidate responses from instruction	prompt → candidates	All tasks
Fuser	Merge multiple candidates into one	prompt + candidates → fused	Multi-response fusion
Critic	Write pros/cons for each candidate	prompt + candidates → critiques	For Ranker/Fuser
Ranker	Rank and select top-$k$	prompt + candidates → ranked	Choose best few
Verifier	Verify correctness of reasoning	prompt + candidates → verified set	Math/code etc.
Unit Test Generator	Generate unit tests	prompt → tests	Code tasks
Unit Test Evaluator	Run tests and score	prompt + tests + candidates → scores	Code tasks

9.2 Archon architecture rules

The first layer must be a Generator (generate candidates first).
Each layer has exactly one op type.
Critic must come before Ranker/Fuser; Unit Test Generator must be followed by Evaluator.
Multiple layers can be stacked into a “deep reasoning architecture.”

Bayesian optimization is used to search for the best architecture under a given set of models, ops, and call budget (much more efficient than random or greedy search).

9.3 Results

On several benchmarks (MT-Bench, AlpacaEval, Arena-Hard, MixEval, MATH, Code Contests), Archon’s task-specific architectures significantly outperform single GPT-4o, Claude 3.5 Sonnet, and existing LM systems (e.g. MoA, ADAS, AFlow) in pass@1 / win-rate, while keeping call count and cost reasonable.

10. When to Use RL vs Test-Time Scaling?

Summary from the notes:

Use RL when
- The goal is to optimize sequence-level task metrics (e.g. “Is the full response helpful?” “Is the full chain-of-thought + answer correct?”).
- There is a non-trivial MDP (multi-turn dialogue, web interaction, robot control, etc.) and reward comes from the full interaction.
Use test-time scaling when
- The base model is already good but you want to squeeze more out of it on verifiable domains (math, code);
- You’d rather not spend heavily on new pretraining and instead use more inference budget plus a good verifier and architecture search;
- Especially when you have automatic verification (unit tests, theorem provers, compilers, NL test generators, etc.).