LLM Basics: Pretraining, Prompting, Fine-tuning and Reinforcement Learning
I. Pretraining and Compute
1. What is Compute?
Simple understanding: How much “computational work” is needed to train a large model.
Formula: $C \approx 6ND$
- N: Number of model parameters (e.g., 7 billion parameters)
- D: Number of tokens (words) fed to the model for training (e.g., 2 trillion tokens)
- C: Total compute (measured in FLOPs—floating-point operations)
Example:
- Llama 2 (7B) model:
- 7B parameters × 2T tokens × 6 = $8.4 \times 10^{22}$ FLOPs
- This is a huge number and requires many GPUs running for many days
Key point: We can increase compute in two ways:
- Increase parameter count N (make the model larger)
- Increase training data size D (feed more text)
2. Scaling Laws — The “Formula” for Predicting Model Performance
Core finding: There is a predictable relationship between model performance (lower loss is better) and compute.
How to use it?
Scenario 1: Predicting large model performance
- You don’t want to directly train a 70B parameter model (too expensive)
- You first train several smaller models (1B, 3B, 7B parameters)
- Plot “compute” vs “loss” for each small model
- Fit with linear regression to get: $\text{loss} = a \times \text{compute}^{-b}$
- Use this to predict: if you train a 70B model, what will the loss be?
Scenario 2: Choosing the optimal configuration (Chinchilla paper contribution)
Question: Given a fixed compute budget (e.g., $10^{23}$ FLOPs), how should you allocate it?
- Option A: Train a large model (e.g., 50B parameters) but feed less data (1T tokens)
- Option B: Train a medium model (e.g., 6.7B parameters) but feed more data (1.5T tokens)
Chinchilla’s answer:
- Experiments with scaling laws showed: Option B is better
- Conclusion: Model size and data size should be “balanced”; don’t only scale up the model
In practice:
- DeepSeek used scaling laws to choose optimal batch size (how much data per training step) and learning rate
- This avoids blind trial-and-error and saves money and time
II. Prompting — Change the Input, Not the Model
1. What is Prompting?
Essence: Use carefully designed “input text” to get the model to do what you want.
Example:
- Task: Sentiment classification (decide if a movie review is positive or negative)
- Without prompting: You’d need to train a separate classifier
- With prompting: Just ask the model:
|
|
Model answer: positive
2. Three Steps of Prompting
Step 1: Fill the template (Prompt Template)
Plug your input into a “template”:
|
|
Step 2: Predict the answer (Answer Prediction)
Feed the prompt to the model; the model generates the response:
|
|
Step 3: Post-processing
Extract the information you need from the model output:
- Formatting: Render results as tables, JSON, etc.
- Keyword extraction: e.g., extract “positive” from “The answer is positive”
- Mapping: Map “fantastic, great, awesome” to the “Positive” category
3. Few-shot Prompting
Definition: Include a few examples in the prompt so the model “understands” what you want.
Example:
|
|
The model infers from the examples that the answer is positive.
4. “Strange” Few-shot Phenomena
Phenomenon 1: Sometimes omitting answers works better
- Finding: On some tasks, giving only inputs (no outputs) works better
- Example:
|
|
- Reason: The model may be “retrieving” the task rather than learning the pattern
Phenomenon 2: More examples can hurt
- Experiment: On some tasks, 4 examples work best; 10 examples hurt performance
- Reason: Too many examples can “confuse” the model
Phenomenon 3: Very sensitive to example order
- Experiment:
- Order A (positive, negative, positive, negative): 85% accuracy
- Order B (positive, positive, negative, negative): 50% accuracy (near random)
- Takeaways:
- Label balance: Balance positive and negative examples
- Label coverage: In multi-class tasks, cover all classes
- Example order: Different orderings can change performance a lot
5. Prompt Engineering — How to Design Good Prompts?
Manual design tips
Principle 1: Format should match the model’s training format
- If the model was trained with chat format (system, user, assistant), use chat format
- If it was plain text completion, use plain text
Principle 2: Instructions should be clear and specific
- ❌ Bad:
Explain prompt engineering. Keep it short. - ✅ Better:
Use 2-3 sentences to explain prompt engineering to a high school student.
Automatic optimization methods
Method 1: Use an LLM to generate prompts
- Have GPT-4 write a better prompt for you
- Example (math task):
- Hand-written prompt:
Let's think step by step.(71.8% accuracy) - LLM-generated:
Take a deep breath and work on this problem step-by-step.(80.2% accuracy)
- Hand-written prompt:
Method 2: Prompt Tuning
- Don’t optimize “words”; optimize the “embeddings (vector representations)” of the prompt
- Freeze model parameters; only train the prompt vectors
Method 3: Prefix Tuning
- Optimize the key and value vectors in the Transformer attention layers
- More flexible than prompt tuning
6. Advanced Prompting Techniques
Chain-of-Thought (CoT)
Idea: Have the model “explain its reasoning” before giving the final answer.
Example:
|
|
Effect: Large gains on complex reasoning (math, logic) tasks.
Zero-shot CoT: You don’t even need examples; adding Let's think step by step. often triggers step-by-step reasoning.
Program-aided Language Models (PAL)
Idea: Have the model generate code to compute the answer instead of a direct numeric answer.
Example:
|
|
Advantage: More accurate on numeric computation, especially for complex calculations.
Self-Ask — Decomposing questions
Idea: Break a complex question into sub-questions and query a search engine step by step.
Example:
|
|
Prompt Chains
Idea: Chain multiple model calls together.
Example:
|
|
III. Fine-tuning — Changing Model Parameters
1. What is Standard Fine-tuning?
Flow:
- Take a pretrained model (e.g., GPT-3)
- Continue training on your task data
- Model parameters are updated to fit your task better
Formula: $\min_\theta \sum_{(x,y) \in D} -\log p_\theta(y|x)$
- $x$: input (e.g., paper text)
- $y$: output (e.g., summary)
- Goal: Maximize the probability of generating the correct output
2. Effects of Fine-tuning
Benefits:
- Data efficient: Start from a pretrained model; only need a small amount of task-specific data
- Strong performance: Can surpass general models on the target task
Drawbacks (distribution narrowing):
- The model becomes “specialized”; generalization drops
- Symptoms:
- A summarization model can’t translate anymore
- The model enforces a specific format (the one seen during training)
- Few-shot ability disappears (can’t learn new tasks from a few examples)
3. Instruction Tuning — Making the Model “Versatile”
Core idea: Don’t train on one task; train on many tasks, each in an “instruction” format.
Data format:
|
|
Key finding (FLAN paper):
- Trained on 62 NLP tasks (translation, classification, QA, etc.)
- At test time, on unseen new tasks, the model still does well
- Conclusion: Instruction tuning teaches the model to “understand task instructions”
4. Where Does Instruction Data Come From?
Method 1: Adapt existing datasets (FLAN)
- Take existing datasets (e.g., translation data)
- Use templates to form instructions:
"Translate this to French: [input]"
Method 2: Human-written (SuperNaturalInstructions)
- Crowdsourced: 1,600 tasks, with human-written instructions and examples per task
Method 3: Model-generated (Self-Instruct)
- Use GPT-3 to generate 50,000+ instruction examples
- Flow:
- Give the model a few seed instructions
- Have the model generate new instructions
- Have the model generate inputs and outputs for each instruction
- Train the model on this generated data
5. Chat Tuning
Goal: Train a “chatbot”.
Data format:
|
|
System prompt example (Claude 3.5):
|
|
6. Knowledge Distillation — Using a “Strong Student” to Teach a “Weak Student”
Core idea: Use a strong model (teacher, e.g., GPT-4) to train a small model (student, e.g., 7B).
Token-level distillation
- Goal: Student learns the teacher’s “probability distribution”
- Formula:
$\min KL(q(y|x) \| p_\theta(y|x))$
- $q(y|x)$: teacher’s output distribution
- $p_\theta(y|x)$: student’s output distribution
- Effect: Student gets “soft labels” (not just right/wrong, but probabilities)
Sequence-level distillation
- Goal: Student is trained on data generated by the teacher
- Flow:
- Use GPT-4 to generate many high-quality answers
- Fine-tune the small model on this data
- Examples:
- Alpaca: GPT-3.5 generated 52k instruction examples; used to train a 7B model
- Vicuna: Real ChatGPT conversations from ShareGPT; used to train a 13B model
7. Efficient Fine-tuning (Save Money and VRAM)
Problem: Full-parameter fine-tuning is too expensive
- Example: Fine-tuning a 65B model (16-bit) needs:
- Parameters: 130 GB
- Gradients: 130 GB
- Optimizer state: 260 GB
- Total: 520 GB VRAM!
Solution: LoRA (Low-Rank Adaptation)
Idea: Don’t train all parameters; only train a “small correction matrix”.
Formula: $W' = W + A \cdot B$
- $W$: original weight matrix (frozen)
- $A$: small matrix ($d \times r$), $r$ small (e.g., 8)
- $B$: small matrix ($r \times d$)
- $A \cdot B$: low-rank matrix; far fewer parameters than $W$
Example:
- Original $W$: 4096 × 4096 = 16M parameters
- LoRA $A, B$: (4096 × 8) + (8 × 4096) = 65k parameters
- About 250× fewer parameters!
After training: Add $A \cdot B$ to $W$ to get the new model; no extra cost at inference.
HydraLoRA (multi-task variant)
Problem: One LoRA for multiple tasks can lead to “task interference”.
Solution:
- Shared A matrix: Captures what’s common across tasks
- Multiple B matrices: One B per task for task-specific behavior
- Effect: Fewer parameters and better multi-task performance
IV. Reinforcement Learning
Why Isn’t Fine-tuning Enough?
Problem 1: Task Mismatch
Language model objective: Predict “the next token most likely to appear”
|
|
What we actually want:
- Is the answer helpful?
- Is the answer safe (non-toxic)?
- Does the code pass tests (correct)?
Mismatch:
- “Most likely” ≠ “most useful”
- Many answers on the web are wrong or toxic, but the model learns from them
Problem 2: Data Mismatch
Issues with training data:
- Reddit: Lots of toxic, aggressive content
- GitHub code: Many snippets have bugs
- Web text: Lots of incorrect information
Data we lack:
- High-quality reasoning (chain-of-thought)
- Perfect answers for all questions
- Fully correct code
Problem 3: Exposure Bias
Issue: During training, the model never sees “its own mistakes”.
Example:
- Training: Every step is given the correct answer
|
|
- Testing: The model generates on its own; it might be wrong from step one
|
|
Result: The model doesn’t know how to “recover”; small errors snowball.
How Does RL Help?
Core idea: Have the model generate answers, then use a “reward signal” to tell it what’s good or bad.
Three advantages of RL:
- Directly optimize the task objective
- No longer “predict next token”; instead “maximize reward”
- Reward can be: code passes tests, answer gets human upvote, dialogue goal achieved
- Data is generated by the model
- No need for a fixed dataset
- Model explores, errs, learns; generates its own training data
- Training sees errors
- Model generates bad answer → gets low reward → learns to avoid
- At test time, it can handle similar situations better
RL Flow (RLHF details)
|
|
V. Summary: Comparing the Three Core Approaches
| Method | What changes | Pros | Cons | When to use |
|---|---|---|---|---|
| Prompting | Input text only | No training, fast, flexible | Limited performance, fragile | Quick prototypes, general |
| Fine-tuning | Model parameters | Strong performance, data efficient | Needs data & compute, less generalization | Domain/task-specific |
| RL | Optimize via reward | Direct objective, complex tasks | Unstable training, reward design | Interactive, multi-step |
Quick Reference
Prompting tips
- ✅ Clear, specific instructions
- ✅ Format matches model training
- ✅ Few-shot: balance labels and watch order
- ✅ CoT: add
Let's think step by step - ✅ PAL: have the model generate code to compute
Fine-tuning tips
- ✅ Instruction tuning: multi-task generalization
- ✅ Knowledge distillation: strong model teaches weak model
- ✅ LoRA: save VRAM (often 100–250× fewer parameters)
- ✅ Watch out for distribution narrowing
RL tips
- ✅ Design a good reward function
- ✅ Use for tasks that need interactive feedback
- ✅ RLHF: combine with human feedback