Contents

When LLMs Learn Memory, Reasoning, and Planning: The Three Core Capabilities of Language Agents

If you ask what the hottest AI topic of 2024 was, the answer always circles back to “Agent.” Bill Gates calls it the biggest revolution in computing since the command line and GUI; Sam Altman declared “2025 will be the year of the Agent.” At the same time, some dismiss it: “Today’s Agent is just a thin wrapper around an LLM—far from useful.”

Who is right? This article helps you understand from the ground up: what a language agent is, how it remembers, reasons, and plans—and what it can do today and what still falls short.


What Is an “Agent,” and What Is a Language Agent?

The simplest definition comes from Russell & Norvig’s AI: A Modern Approach: “An entity that perceives its environment through sensors and acts through actuators.” By that definition, DQN on Atari, AlphaGo, and Siri are all agents.

Early text agents (e.g. ELIZA in 1966) used hand rules to simulate a therapist and only worked in a tiny domain. Around 2015, neural agents like LSTM-DQN could play text games with RL but still needed heavy task-specific training with poor generalization.

What we call a Language Agent today is a new kind of agent that uses language as the main medium—both reasoning in language and talking to the outside world in language. The difference from a “plain LLM” is: an LLM is a powerful text predictor, while a Language Agent embeds that ability in a full loop of perception–action–reflection.

Three Generations of Agents

Logical Agent Neural Agent Language Agent
Expressiveness Low, limited by formal languages Medium, range of small NNs High, almost everything sayable about the world
Reasoning Logical deduction, rigid Parametric inference, stochastic, implicit Linguistic inference, fuzzy but flexible
Adaptability Low, manual knowledge bases Medium, data-driven but sample-inefficient High, strong LLM priors + linguistic generalization

Worth noting: inside the LLM there is only one mechanism—next-token generation—doing perception, intuitive judgment, and symbolic reasoning that in humans belong to different systems. Unlike humans, who instantly grasp emotion and causality from a scene, the LLM must generate word by word—each step is forward computation and costly.


First Core Capability: Memory

“Memory is everything. Without it we have nothing.” — Eric Kandel (Nobel Prize, neuroscience of memory)

Why Do Agents Need Memory?

Current LLM “memory” is just the context window—a working memory of the last few thousand tokens. Real tasks span long time horizons and large event streams; the window cannot hold them all. Even when it can, the model struggles to retrieve “where I put that key three days ago” from thousands of tokens.

We need long-term memory (LTM): persistent across sessions, readable and writable on demand.

Three Kinds of Long-Term Memory

Inspired by cognitive science, LTM can be split by what is stored:

1. Episodic Memory
“What happened to me.”

In Stanford’s Generative Agents, 25 LLM-driven NPCs in a simulated town each have a growing event stream (“8:00 wake up, 8:30 breakfast, 9:00 write paper…”). Because the context cannot hold all history, the authors score each memory by recency, importance (emotional weight), and relevance (semantic similarity to the current query) and inject the top memories into context. NPCs can then naturally “recall” things from days ago.

2. Semantic Memory
“What I know”—abstraction and summarization over events.

In Generative Agents, the agent periodically reflects: the LLM turns scattered events into higher-level insights (“Klaus is very passionate about his research”) and stores them. These abstractions often matter more for later decisions than raw events.

3. Procedural Memory
“What I can do”—executable skills.

Voyager (an LLM that plays Minecraft autonomously) stores each learned skill as JavaScript in a skill library. For a new task, it retrieves the most relevant skills by embedding and composes or edits them. This is procedural memory carried by code.

RAG Is the De Facto Standard Today—but Not Enough

Retrieval-Augmented Generation (RAG) is the most common LTM pattern today: treat an external corpus (Wikipedia, docs) as read-only long-term memory and retrieve relevant passages into context at query time.

RAG has two fundamental limits:

  • It only reads others’ memories—the corpus may not match the agent’s own experience or style.
  • Standard retrieval fails on multi-hop questions—e.g. “Which Stanford professor studies Alzheimer’s neuroscience?” requires linking two entities; vector similarity finds “relevant paragraphs,” not cross-entity reasoning.

HippoRAG: Long-Term Memory Inspired by the Brain

NeurIPS 2024’s HippoRAG is inspired by hippocampal indexing theory and maps three brain regions to the system:

  • Neocortex (LLM): Perception and reasoning; uses Open IE to extract triples (“Thomas studies Alzheimer’s,” “Stanford employs Thomas”).
  • Perirhinal cortex (retrieval encoder): Encodes entities and relations as vectors—a bridge between neocortex and hippocampus.
  • Hippocampus (knowledge graph + Personalized PageRank): Builds a graph from triples; random walks on the graph do pattern completion—from “Stanford” and “Alzheimer’s” walking to “Thomas.”

Results: On multi-hop QA, HippoRAG Recall@5 reaches 72.9%, ~7 points above strong dense retrieval (ColBERT 65.6%); with IRCoT multi-step retrieval it reaches 78.2%.

A Unified View: Learning Is Writing

Memory is not only external databases. Broadly, “the agent learned something” can be written in many forms:

  • Fine-tuning weights: Burn knowledge into parameters (strongest but most expensive to change).
  • Prompt templates: Codify experience into reusable system prompts.
  • Agent codebase updates: Skill libraries (as in Voyager).
  • Append-only event logs: Episodic streams like Generative Agents.

Second Core Capability: Reasoning

“THINK.” — Thomas J. Watson, IBM founder

What Is “Reasoning” for an Agent?

For an LLM, reasoning is intermediate generation—tokens produced before the final answer that form an “inner monologue.” For an agent, that generation is internal action—it does not change the world, only the internal context.

A key philosophical question: Does reasoning help, and why?

Reasoning Helps the Agent Act

Imagine a cooking agent that finds the salt empty. Without reasoning, it maps state directly to action. With reasoning, it might think: “This dish needs saltiness; salt is gone; soy sauce can substitute; soy sauce is in the right cabinet…"—then opens the cabinet.

Reasoning makes observation-to-action mappings generalizable and explainable instead of brittle reflexes.

Action Improves Reasoning

Pure reasoning has blind spots. Ask a model trained through 2021 “Who is the UK prime minister now?” and it will answer confidently but wrongly. Hallucination in a closed loop cannot self-correct.

Action updates reasoning: search, DB lookup, tool use give fresh facts and correct the premises of reasoning in real time.

ReAct: Interleaving Reasoning and Action

ReAct (Reason + Act), ICLR 2023, formalizes this: alternate Thought and Action in the prompt format in a loop.

Example: “Besides Apple Remote, what device can control the program it was originally designed to interact with?”

  • CoT only: The model infers “Apple Remote was for Apple TV” and says iPhone/iPad can control Apple TV—wrong (it misremembers; the remote was for Front Row, not Apple TV).
  • ReAct:
    • Thought 1: I need to search what program Apple Remote was designed for.
    • Action 1: Search[Apple Remote]
    • Obs 1: “…originally designed to control Front Row…”
    • Thought 2: It was for Front Row. Search Front Row.
    • … eventually Obs 3: Front Row can be controlled by Apple Remote or keyboard function keys.
    • Thought 4: Answer—keyboard function keys. Correct.

ReAct’s essence: reasoning decides what to search next (avoids blind action); search updates the knowledge base (avoids stacking hallucinations).

Formal View: Expanded Action Space

Classically, action space $A$ is defined by the environment (click, type, scroll). ReAct extends it to:

$$ \hat{A} = A \cup \mathcal{L} $$

where $\mathcal{L}$ is all possible language sequences. Each “Thought” $\hat{a}_t \in \mathcal{L}$ only updates internal context.

That makes the reasoning action space infinite, but LLMs bring strong priors from human text to navigate it—why reasoning matters so much in the LLM era.


Third Core Capability: Planning

“If you fail to plan, you are planning to fail.” — Benjamin Franklin

What Is Planning for a Language Agent?

Classic AI planning (e.g. robot motion) often has formal goals, finite discrete actions, and algorithms (e.g. A) that can guarantee optimality.*

Language Agent planning is different:

  • Goals are fuzzy natural language—e.g. “Plan a 5-day Seattle–California trip, $6000 budget, pet-friendly lodging.”
  • Action space is open—what buttons, APIs, or searches exist varies per site.
  • Success is hard to verify automatically—unlike chess, “done” often needs human judgment.

Three Planning Paradigms and Trade-offs

(a) Reactive
Decide each step on the fly with no explicit lookahead.

  • Pros: Fast, simple.
  • Cons: Short-sighted, local optima, errors compound.

(b) Tree search with real interactions
Branch in the real environment (like MCTS), explore, backtrack, pick the best path.

  • Pros: Systematic exploration.
  • Cons: Many web actions are irreversible (e.g. confirm purchase); privacy/safety risk; every probe is slow.

(c) Model-based planning
Use an LLM to predict “if I do this, what happens next,” search in that virtual model, then execute the chosen path in the real world.

  • Pros: Explore without real execution—faster, safer, more systematic.
  • Cons: Needs a good world model that predicts state transitions accurately.

WebDreamer: The LLM as a World Model for the Web

Who can be that world model? The web is too dynamic to hand-code.

TMLR 2025 WebDreamer argues: The LLM itself can model the web.

Pretraining has seen huge amounts of screenshots, demos, and HTML; it can predict fairly well “what happens if I click this.”

Workflow:

  • Stage I – Simulation: For the task (“find cheapest 512GB drive in data storage”), the LLM predicts follow-on states for candidate action paths and scores them (e.g. “Office products” 0.4, “Electronics” → “Computer accessories” 0.8).
  • Stage II – Execution: Execute the highest-scoring path in a real browser.

This replaces expensive real-world trial-and-error with fast “imagination” in the LLM, improving efficiency and safety.

Tools for Planning: From Search to Travel Assistants

Real planning needs many tools. In ICML 2024’s TravelPlanner, a query like “Seattle to California, Nov 6–10, $6000, all-suite, pet-friendly” requires:

  • CitySearch, FlightSearch (no direct SEA–SFO, find $120 via LAX),
  • AccommodationSearch with filters,
  • RestaurantSearch, AttractionSearch, etc.
  • Plus commonsense constraints (routes, variety, no schedule conflicts). This is planning plus tool use, not a single API call.

Putting It Together: A Panorama of Language Agents

Short-term memory (context) is the stage for reasoning—the agent thinks there, alternating Thought and Action in an internal chain.

Long-term memory (external store / weights) is the vault—RAG, HippoRAG, etc. inject history, knowledge, and skills when needed.

Planning organizes future action sequences: what to do next, in what order, and how to recover from failure. Strong models can be more reactive (one prompt per step); weaker ones benefit more from explicit tree search or world-model assist.

Beyond these three, Language Agents are growing in multimodal perception, tools, multi-agent collaboration, continual learning, safety, embodiment…


We Are Still at the Beginning

In the spirit of Sutton’s Bitter Lesson: don’t hard-code human-specific knowledge; find general methods that scale with compute. Memory, reasoning, and planning are evolving that way—less hand rule, more adaptive mechanism.

From ELIZA to ReAct, from RAG to HippoRAG, from hand-written PDDL to WebDreamer’s LLM world model—that path spans decades, yet Language Agents have really taken off only in the last few years. As Gates suggests, this may be computing’s biggest revolution—and it is still immature, waiting for the next generation to fill in the map of those core capabilities.