Contents

LLM Basics: Pretraining, Prompting, Fine-tuning and Reinforcement Learning

Contents

I. Pretraining and Compute

1. What is Compute?

Simple understanding: How much “computational work” is needed to train a large model.

Formula: $C \approx 6ND$

  • N: Number of model parameters (e.g., 7 billion parameters)
  • D: Number of tokens (words) fed to the model for training (e.g., 2 trillion tokens)
  • C: Total compute (measured in FLOPs—floating-point operations)

Example:

  • Llama 2 (7B) model:
    • 7B parameters × 2T tokens × 6 = $8.4 \times 10^{22}$ FLOPs
    • This is a huge number and requires many GPUs running for many days

Key point: We can increase compute in two ways:

  1. Increase parameter count N (make the model larger)
  2. Increase training data size D (feed more text)

2. Scaling Laws — The “Formula” for Predicting Model Performance

Core finding: There is a predictable relationship between model performance (lower loss is better) and compute.

How to use it?

Scenario 1: Predicting large model performance

  • You don’t want to directly train a 70B parameter model (too expensive)
  • You first train several smaller models (1B, 3B, 7B parameters)
  • Plot “compute” vs “loss” for each small model
  • Fit with linear regression to get: $\text{loss} = a \times \text{compute}^{-b}$
  • Use this to predict: if you train a 70B model, what will the loss be?

Scenario 2: Choosing the optimal configuration (Chinchilla paper contribution)

Question: Given a fixed compute budget (e.g., $10^{23}$ FLOPs), how should you allocate it?

  • Option A: Train a large model (e.g., 50B parameters) but feed less data (1T tokens)
  • Option B: Train a medium model (e.g., 6.7B parameters) but feed more data (1.5T tokens)

Chinchilla’s answer:

  • Experiments with scaling laws showed: Option B is better
  • Conclusion: Model size and data size should be “balanced”; don’t only scale up the model

In practice:

  • DeepSeek used scaling laws to choose optimal batch size (how much data per training step) and learning rate
  • This avoids blind trial-and-error and saves money and time

II. Prompting — Change the Input, Not the Model

1. What is Prompting?

Essence: Use carefully designed “input text” to get the model to do what you want.

Example:

  • Task: Sentiment classification (decide if a movie review is positive or negative)
  • Without prompting: You’d need to train a separate classifier
  • With prompting: Just ask the model:
1
2
Please classify this review as 'positive' or 'negative':
"This movie is amazing!"

Model answer: positive


2. Three Steps of Prompting

Step 1: Fill the template (Prompt Template)

Plug your input into a “template”:

1
2
3
Template: "Translate this to French: [x]"
Input x: "Hello"
→ Prompt: "Translate this to French: Hello"

Step 2: Predict the answer (Answer Prediction)

Feed the prompt to the model; the model generates the response:

1
2
Prompt: "Translate this to French: Hello"
→ Model output: "Bonjour"

Step 3: Post-processing

Extract the information you need from the model output:

  • Formatting: Render results as tables, JSON, etc.
  • Keyword extraction: e.g., extract “positive” from “The answer is positive”
  • Mapping: Map “fantastic, great, awesome” to the “Positive” category

3. Few-shot Prompting

Definition: Include a few examples in the prompt so the model “understands” what you want.

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
Instruction: Classify movie reviews as 'positive' or 'negative'.

Examples:
Input: I really don't like this movie.
Output: negative

Input: This movie is great!
Output: positive

Now classify this:
Input: This movie is a banger.
Output: ???

The model infers from the examples that the answer is positive.


4. “Strange” Few-shot Phenomena

Phenomenon 1: Sometimes omitting answers works better

  • Finding: On some tasks, giving only inputs (no outputs) works better
  • Example:
1
2
3
4
5
# Full examples (input + output)
Accuracy: 75%

# Input only
Accuracy: 80%
  • Reason: The model may be “retrieving” the task rather than learning the pattern

Phenomenon 2: More examples can hurt

  • Experiment: On some tasks, 4 examples work best; 10 examples hurt performance
  • Reason: Too many examples can “confuse” the model

Phenomenon 3: Very sensitive to example order

  • Experiment:
    • Order A (positive, negative, positive, negative): 85% accuracy
    • Order B (positive, positive, negative, negative): 50% accuracy (near random)
  • Takeaways:
    • Label balance: Balance positive and negative examples
    • Label coverage: In multi-class tasks, cover all classes
    • Example order: Different orderings can change performance a lot

5. Prompt Engineering — How to Design Good Prompts?

Manual design tips

Principle 1: Format should match the model’s training format

  • If the model was trained with chat format (system, user, assistant), use chat format
  • If it was plain text completion, use plain text

Principle 2: Instructions should be clear and specific

  • ❌ Bad: Explain prompt engineering. Keep it short.
  • ✅ Better: Use 2-3 sentences to explain prompt engineering to a high school student.

Automatic optimization methods

Method 1: Use an LLM to generate prompts

  • Have GPT-4 write a better prompt for you
  • Example (math task):
    • Hand-written prompt: Let's think step by step. (71.8% accuracy)
    • LLM-generated: Take a deep breath and work on this problem step-by-step. (80.2% accuracy)

Method 2: Prompt Tuning

  • Don’t optimize “words”; optimize the “embeddings (vector representations)” of the prompt
  • Freeze model parameters; only train the prompt vectors

Method 3: Prefix Tuning

  • Optimize the key and value vectors in the Transformer attention layers
  • More flexible than prompt tuning

6. Advanced Prompting Techniques

Chain-of-Thought (CoT)

Idea: Have the model “explain its reasoning” before giving the final answer.

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Question: Roger has 5 tennis balls. He buys 2 more. How many does he have now?

# Without CoT
Answer: 7

# With CoT
Answer: Let's think step by step.
- Roger starts with 5 balls
- He buys 2 more balls
- 5 + 2 = 7
So Roger has 7 balls.

Effect: Large gains on complex reasoning (math, logic) tasks.

Zero-shot CoT: You don’t even need examples; adding Let's think step by step. often triggers step-by-step reasoning.


Program-aided Language Models (PAL)

Idea: Have the model generate code to compute the answer instead of a direct numeric answer.

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Question: A store has 23 apples. They sell 17. How many left?

# Direct answer (easy to get wrong)
Answer: 6

# PAL approach
Code:
apples = 23
sold = 17
remaining = apples - sold
print(remaining)

Execute code → Output: 6

Advantage: More accurate on numeric computation, especially for complex calculations.


Self-Ask — Decomposing questions

Idea: Break a complex question into sub-questions and query a search engine step by step.

Example:

1
2
3
4
5
6
7
8
9
Question: Who was the president when the iPhone was released?

Sub-question 1: When was the iPhone released?
→ Search → Answer: 2007

Sub-question 2: Who was the US president in 2007?
→ Search → Answer: George W. Bush

Final Answer: George W. Bush

Prompt Chains

Idea: Chain multiple model calls together.

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Step 1: Use LLM to extract key information
Input: "I want to book a flight to Paris next Monday"
→ Output: {"destination": "Paris", "date": "2026-03-10"}

Step 2: Call flight API
Query API with extracted info → Get flight options

Step 3: Use LLM to generate reply
Input: Flight options
→ Output: "I found 3 flights to Paris on March 10th..."

III. Fine-tuning — Changing Model Parameters

1. What is Standard Fine-tuning?

Flow:

  1. Take a pretrained model (e.g., GPT-3)
  2. Continue training on your task data
  3. Model parameters are updated to fit your task better

Formula: $\min_\theta \sum_{(x,y) \in D} -\log p_\theta(y|x)$

  • $x$: input (e.g., paper text)
  • $y$: output (e.g., summary)
  • Goal: Maximize the probability of generating the correct output

2. Effects of Fine-tuning

Benefits:

  • Data efficient: Start from a pretrained model; only need a small amount of task-specific data
  • Strong performance: Can surpass general models on the target task

Drawbacks (distribution narrowing):

  • The model becomes “specialized”; generalization drops
  • Symptoms:
    • A summarization model can’t translate anymore
    • The model enforces a specific format (the one seen during training)
    • Few-shot ability disappears (can’t learn new tasks from a few examples)

3. Instruction Tuning — Making the Model “Versatile”

Core idea: Don’t train on one task; train on many tasks, each in an “instruction” format.

Data format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
  "instruction": "Translate this sentence to French",
  "input": "Hello, how are you?",
  "output": "Bonjour, comment ça va?"
}

{
  "instruction": "Classify the sentiment",
  "input": "This movie is terrible!",
  "output": "negative"
}

Key finding (FLAN paper):

  • Trained on 62 NLP tasks (translation, classification, QA, etc.)
  • At test time, on unseen new tasks, the model still does well
  • Conclusion: Instruction tuning teaches the model to “understand task instructions”

4. Where Does Instruction Data Come From?

Method 1: Adapt existing datasets (FLAN)

  • Take existing datasets (e.g., translation data)
  • Use templates to form instructions: "Translate this to French: [input]"

Method 2: Human-written (SuperNaturalInstructions)

  • Crowdsourced: 1,600 tasks, with human-written instructions and examples per task

Method 3: Model-generated (Self-Instruct)

  • Use GPT-3 to generate 50,000+ instruction examples
  • Flow:
  1. Give the model a few seed instructions
  2. Have the model generate new instructions
  3. Have the model generate inputs and outputs for each instruction
  4. Train the model on this generated data

5. Chat Tuning

Goal: Train a “chatbot”.

Data format:

1
2
3
4
5
6
7
[
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What's the capital of France?"},
  {"role": "assistant", "content": "The capital of France is Paris."},
  {"role": "user", "content": "What's the population?"},
  {"role": "assistant", "content": "Paris has about 2.1 million people."}
]

System prompt example (Claude 3.5):

1
2
3
4
5
6
- The assistant is Claude, created by Anthropic.
- Current date: Nov 22, 2024
- Knowledge cutoff: April 2024
- When solving math problems, think step by step
- If request is harmful, politely decline and suggest alternatives
- Use Markdown formatting for clear responses

6. Knowledge Distillation — Using a “Strong Student” to Teach a “Weak Student”

Core idea: Use a strong model (teacher, e.g., GPT-4) to train a small model (student, e.g., 7B).

Token-level distillation

  • Goal: Student learns the teacher’s “probability distribution”
  • Formula: $\min KL(q(y|x) \| p_\theta(y|x))$
    • $q(y|x)$: teacher’s output distribution
    • $p_\theta(y|x)$: student’s output distribution
  • Effect: Student gets “soft labels” (not just right/wrong, but probabilities)

Sequence-level distillation

  • Goal: Student is trained on data generated by the teacher
  • Flow:
  1. Use GPT-4 to generate many high-quality answers
  2. Fine-tune the small model on this data
  • Examples:
    • Alpaca: GPT-3.5 generated 52k instruction examples; used to train a 7B model
    • Vicuna: Real ChatGPT conversations from ShareGPT; used to train a 13B model

7. Efficient Fine-tuning (Save Money and VRAM)

Problem: Full-parameter fine-tuning is too expensive

  • Example: Fine-tuning a 65B model (16-bit) needs:
    • Parameters: 130 GB
    • Gradients: 130 GB
    • Optimizer state: 260 GB
    • Total: 520 GB VRAM!

Solution: LoRA (Low-Rank Adaptation)

Idea: Don’t train all parameters; only train a “small correction matrix”.

Formula: $W' = W + A \cdot B$

  • $W$: original weight matrix (frozen)
  • $A$: small matrix ($d \times r$), $r$ small (e.g., 8)
  • $B$: small matrix ($r \times d$)
  • $A \cdot B$: low-rank matrix; far fewer parameters than $W$

Example:

  • Original $W$: 4096 × 4096 = 16M parameters
  • LoRA $A, B$: (4096 × 8) + (8 × 4096) = 65k parameters
  • About 250× fewer parameters!

After training: Add $A \cdot B$ to $W$ to get the new model; no extra cost at inference.


HydraLoRA (multi-task variant)

Problem: One LoRA for multiple tasks can lead to “task interference”.

Solution:

  • Shared A matrix: Captures what’s common across tasks
  • Multiple B matrices: One B per task for task-specific behavior
  • Effect: Fewer parameters and better multi-task performance

IV. Reinforcement Learning

Why Isn’t Fine-tuning Enough?

Problem 1: Task Mismatch

Language model objective: Predict “the next token most likely to appear”

1
p(probable response | prompt)

What we actually want:

  • Is the answer helpful?
  • Is the answer safe (non-toxic)?
  • Does the code pass tests (correct)?

Mismatch:

  • “Most likely” ≠ “most useful”
  • Many answers on the web are wrong or toxic, but the model learns from them

Problem 2: Data Mismatch

Issues with training data:

  • Reddit: Lots of toxic, aggressive content
  • GitHub code: Many snippets have bugs
  • Web text: Lots of incorrect information

Data we lack:

  • High-quality reasoning (chain-of-thought)
  • Perfect answers for all questions
  • Fully correct code

Problem 3: Exposure Bias

Issue: During training, the model never sees “its own mistakes”.

Example:

  • Training: Every step is given the correct answer
1
2
Question: 5 + 3 = ?
Teacher forcing: "8" (correct)
  • Testing: The model generates on its own; it might be wrong from step one
1
2
Step 1: 5 + 3 = 9 (wrong!)
Step 2: 9 + 2 = ... (builds on error; errors compound)

Result: The model doesn’t know how to “recover”; small errors snowball.


How Does RL Help?

Core idea: Have the model generate answers, then use a “reward signal” to tell it what’s good or bad.

Three advantages of RL:

  1. Directly optimize the task objective
    • No longer “predict next token”; instead “maximize reward”
    • Reward can be: code passes tests, answer gets human upvote, dialogue goal achieved
  2. Data is generated by the model
    • No need for a fixed dataset
    • Model explores, errs, learns; generates its own training data
  3. Training sees errors
    • Model generates bad answer → gets low reward → learns to avoid
    • At test time, it can handle similar situations better

RL Flow (RLHF details)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
1. Model generates answer
   Input: "Write a poem about AI"
   Output: [Generated poem]

2. Reward function scores it
   - Human rating: 7/10 (not bad)
   - Or automatic: toxicity check pass +1, relevance +0.8

3. Update model parameters
   - High-reward generations → increase probability
   - Low-reward generations → decrease probability

4. Repeat

V. Summary: Comparing the Three Core Approaches

Method What changes Pros Cons When to use
Prompting Input text only No training, fast, flexible Limited performance, fragile Quick prototypes, general
Fine-tuning Model parameters Strong performance, data efficient Needs data & compute, less generalization Domain/task-specific
RL Optimize via reward Direct objective, complex tasks Unstable training, reward design Interactive, multi-step

Quick Reference

Prompting tips

  • ✅ Clear, specific instructions
  • ✅ Format matches model training
  • ✅ Few-shot: balance labels and watch order
  • ✅ CoT: add Let's think step by step
  • ✅ PAL: have the model generate code to compute

Fine-tuning tips

  • ✅ Instruction tuning: multi-task generalization
  • ✅ Knowledge distillation: strong model teaches weak model
  • ✅ LoRA: save VRAM (often 100–250× fewer parameters)
  • ✅ Watch out for distribution narrowing

RL tips

  • ✅ Design a good reward function
  • ✅ Use for tasks that need interactive feedback
  • ✅ RLHF: combine with human feedback