Contents

LLM Reasoning: Prompting, Multi-Path Search, and Iterative Self-Improvement

1 Overall Framework: Three Ways to Add Compute

The notes open by recalling “reasoning-oriented” models like o1, o3, and DeepSeek-R1, and a central fact: performance increasingly depends on test-time compute. For example, o3 can spend over $1000 in inference cost per ARC-AGI problem.

The course is organized in three parts:

  1. Prompting: Given one problem, spend more tokens so the model produces a longer, more detailed single reasoning path (depth along one path).
  2. Search & Selection: For the same problem, generate many candidate paths and search/select over this “width” across paths.
  3. Iterative Self-Improvement: Let the model reflect and revise over multiple rounds, forming “depth over multiple attempts.”

The rest of the content follows these three themes.


2 Prompting: Thinking Deeper Along One Path

2.1 Standard Prompting vs Chain-of-Thought (CoT)

  • Traditional few-shot prompts only show “final answer” examples; the model learns the output format but not the style of intermediate reasoning, and does poorly on reasoning benchmarks.
  • CoT prompts include full reasoning in the examples (e.g. first count tennis balls in two cans, then add, then get 11); the model then learns “reason step by step, then conclude.”

This gives two main benefits:

  1. Adaptive compute: Harder problems naturally get longer CoT traces.
  2. Explicit strategy: We can ask in the prompt for “decompose the problem, then solve step by step,” steering the model toward decomposition, planning, etc.

2.2 Analogical Prompting: Let the Model Find “Analogous Examples”

Humans often think “have I seen a similar problem?” Analogical prompting mimics this by having the LLM generate relevant examples and knowledge before solving the new problem.

Rough flow:

  1. Give a hard problem (e.g. a Codeforces-style task) and instruct the model to:
    • Explain the core algorithmic idea;
    • Write a short tutorial;
    • Provide 3 related examples (problem statement, solution, code);
    • Then return to the original problem and give solution and code.
  2. The model produces high-level knowledge (e.g. what “prefix-product” is, its complexity) and several structurally similar examples with code—self-generated exemplars.

Experiments (GSM8K, Codeforces, BIG-Bench temporal reasoning) show: analogical prompting with self-generated exemplars and knowledge > hand-crafted few-shot CoT > 0-shot CoT > 0-shot, especially for weaker models like text-davinci-003 (e.g. 61% vs 54%).

2.3 The “Strategy Space” Opened by CoT

CoT is not only longer text; it supports more reasoning patterns:

  • Least-to-Most Prompting:
    • Stage 1: Have the model decompose the problem into subproblems (e.g. “how many minutes per slide,” then “how many slides in 15 minutes”).
    • Stage 2: Solve subproblems in order; earlier answers are fed into context for later ones.
    • On compositional generalization (SCAN, CFQ), Least-to-Most can generalize to longer, more complex command sequences with only 0.1% of training data.
  • Dynamic Least-to-Most: For complex semantic parsing (e.g. CFQ), use the LLM to do syntactic/constituent parsing, pick the best exemplars per substructure, then generate the full SPARQL query in order; outperforms various fully supervised T5 models on MCD splits.
  • Self-Discover: Instead of hand-picking a strategy, list a “strategy library” (CoT, decomposition, reflection, etc.) and let the LLM decide “what structure fits this task,” then compose a task-specific reasoning flow and adapt the steps; overall better than plain CoT on MATH, BIG-Bench-Hard, etc.

Summary: The prompting thread is—use CoT and its variants (analogical, least-to-most, self-discovered strategy) to lengthen and structure a single reasoning path, so the same model spends more tokens thinking on one sample.


3 Search & Selection: Many Paths, One Best

A single CoT path is still error-prone because the LLM often gets it wrong on the first try. So: generate many candidates, then pick the best.

3.1 Self-Consistency: Majority Vote on Final Answers

Self-Consistency:

  1. Use CoT prompting and sample multiple times to get different reasoning paths (temperature or nucleus sampling for diversity).
  2. Look only at the “final answer” of each path; ignore whether intermediate steps agree.
  3. Take the most frequent answer as the final one.

Experiments show:

  • Unlike “take only the single highest log-prob path” (Sample-and-Rank), Self-Consistency keeps improving with more samples;
  • On AddSub/MultiArith, GSM8K, etc., it gives roughly 5–20+ point gains across models (e.g. GPT-3 code-davinci-002 on GSM8K: 60.1% → 78.0%).

Intuition: If many random paths agree, that answer is more likely correct. Higher consistency means the model is more “confident.”

3.2 Clustering by Execution Consistency: AlphaCode

For programming contests, AlphaCode:

  1. Sample at scale to generate very many solution programs.
  2. Have the model generate many new test inputs; run all candidates on these tests.
  3. Cluster programs that produce identical outputs (same “semantics”).
  4. From the largest clusters, pick one representative each and run on the real judge; choose the one that passes.

This “cluster by execution consistency + pick representative” beats simple filtering (e.g. only drop timeout/compile-error), but still falls short of an oracle that always selects the truly correct program.

3.3 Verifier-Based Selection: ORM / PRM as Scorers

Search & selection needs a “judge,” which brings in Outcome Reward Model (ORM) and Process Reward Model (PRM) from previous blog. Here the focus is math reasoning verifiers.

3.3.1 “Training Verifiers to Solve Math Word Problems” (OpenAI 2021)

  • Build GSM8K: ~8.5K grade-school math problems, human-written, high quality, multi-step, with natural-language solution and final answer.
  • Method 1: Fine-tune a generator only on GSM8K to output solutions directly.
  • Method 2: Fine-tune + verify:
    1. Use the fine-tuned generator to produce many candidates per problem (e.g. 100).
    2. Train a “verifier” to judge each solution right/wrong.
    3. At inference, score candidates with the verifier and pick the highest.

Findings:

  • With very small training sets, the verifier overfits and can hurt;
  • With enough data, “fine-tune + verification” clearly beats fine-tune only;
  • For a fixed total parameter budget, “large generator + small verifier” beats “small generator + large verifier”—generation capacity matters more;
  • More candidates at inference keep improving performance, with clear gains up to around 400.

3.3.2 PRM800K & “Let’s Verify Step by Step” (OpenAI 2024)

This work argues: Process supervision (PRM) is stronger than outcome supervision (ORM).

  • PRM800K dataset:
    • Use GPT-4 to generate many solution traces.
    • Use active learning to select samples where “the model thinks it’s right but the answer is wrong,” then have humans label each line as positive/negative/neutral;
    • This is 2.6× more data-efficient than random selection.
  • Train two verifiers:
    • ORM: Does the full solution lead to the correct answer?
    • PRM: Is each step a “good” step toward the correct answer?

Results:

  • On GSM8K/MATH, PRM as verifier generally beats ORM and Self-Consistency, especially with more samples;
  • PRM generalizes better out-of-distribution;
  • But PRM needs heavy step-by-step human labeling and is costly to scale.

3.3.3 Math-Shepherd (DeepSeek 2024): Automatic Step Labels from the Model

Math-Shepherd aims to train a PRM-like process verifier without manual step-level labels.

Idea:

  1. For each intermediate step $s_i$, continue generation from it with N different continuations and see what fraction reach the correct answer.
  2. Use this “probability of eventually being correct” as the step label (hard estimate HE: any correct; soft estimate SE: frequency of correct).
  3. Train a PRM on these automatic labels; then use it for:
    • Verification: score 256 candidate paths and pick the best;
    • RL: step-by-step PPO, rewarding each step in the generator.

Results (LLaMA2/Mistral fine-tuned on MetaMATH):

  • As verifier, Math-Shepherd beats ORM and Self-Consistency on GSM8K/MATH500;
  • As RL reward, step-by-step PPO further improves greedy decoding;
  • Combining RL and verification works best (e.g. Mistral-7B 43.5% on MATH500).

3.3.4 AutoPSV: Upgrading ORM to PRM Automatically

AutoPSV trades off cost and quality:

  • First train an outcome-supervised verifier (OSV, like ORM) that only checks the final answer.
  • Use the OSV to score “correctness probability” at each step and look at how confidence changes between steps:
    • Big increase → good step;
    • Big drop → bad step.
  • Use these as automatic labels to train a process-supervised verifier (PSV).

On GSM8K/MATH and HellaSwag/Winogrande/ANLI, OSV+PSV almost always beats OSV alone or Self-Consistency, showing that “label steps by confidence change” captures PRM-style granularity at ORM-like cost.

3.4 Tree-of-Thoughts (ToT): Tree Search with Step-Level Scores

Earlier selection mostly scored after generating a full solution. ToT goes further: explore and prune during generation.

Example: 24-game:

  1. At each intermediate state, have the LLM generate several possible next “thoughts.”
  2. Use another prompt to have the LLM score each state (thought evaluation).
  3. Use BFS or MCTS on this “thought tree,” expanding top-$b$ nodes by score and pruning low-scoring branches.
  4. Find a path that reaches 24.

Experiments: With the same token budget, ToT ($b=5$) reaches ~74% success on 24-game, far above IO prompt / CoT / CoT-Self-Consistency (single digits to ~10%).

Summary: Search & selection is—many paths + a better judge (Self-Consistency / ORM / PRM / AutoPSV) + tree search to exploit more width and local depth of the base model.


4 Iterative Self-Improvement: Multi-Round Self-Revision

Even with search and verifiers, LLMs still err. This part asks: how to let the model keep “reflecting + rewriting” at inference time.

4.1 Reflexion / Self-Refine: Explicit “Self-Evaluate + Revise”

Basic loop:

  1. The model outputs an answer (with or without CoT).
  2. The model produces a “reflection” on what might be wrong and how to improve.
  3. Feed the original answer + reflection into context and generate a new version; repeat for several rounds.
  • With external feedback (e.g. HotPotQA correctness, ALFWorld reward), Reflexion steadily improves success on decision-making, reasoning, and coding tasks.

4.2 Self-Debugging for Code (+ Effi-Learner)

Code is especially suited to self-improvement because:

  • Unit tests give clear error messages;
  • We can also measure performance (time/memory) for efficiency optimization.

Feedback can be:

  • Simple: “Your code failed the tests, please fix”;
  • With test output (which input failed);
  • Line-level explanation;
  • Execution trace (intermediate state per line).

On Spider, TransCoder, MBPP, etc., richer feedback leads to higher correctness after repair; Codex, GPT-3.5/4, StarCoder all benefit. Effi-Learner uses performance profiles (time, memory hotspots) to drive multi-round optimization so the model can “get it right and fast.”

4.3 Failures and Improvements in Self-Correction

  • “LLMs Cannot Self-Correct Reasoning Yet” shows: Without oracle feedback, the model’s own judgment of right/wrong often makes things worse—it may reject correct answers or accept wrong ones; multi-agent debate without a strong evaluator does not reliably beat Self-Consistency.
  • LeCo (Learning from Correctness) has the model assign confidence to each reasoning step, find low-confidence wrong steps, then regenerate the rest using “high-confidence earlier steps” as fixed premises, gradually approaching the correct answer with less reliance on hand-crafted prompts and external tools.

4.4 How to Allocate Token Budget?

Together with test-time scaling:

  • For a given FLOP budget, smaller models can use more samples / more revisions; larger models can spend more reasoning steps per sample.
  • The “best strategy” depends heavily on model and task: some models are good at sequential revision; others suit parallel sampling + verifier selection better.

5 Big Picture: Bitter Lesson and Reasoning Design

The notes close with Sutton’s Bitter Lesson:

  • Prefer methods that scale with compute over encoding too much “what we found” in hand-crafted rules and prompts;
  • CoT, Self-Consistency, Verifiers, Tree-of-Thoughts, Test-Time Scaling all fit “more compute → better performance” and are central to reasoning-oriented LLM research.