Tool Agents: Empowering LLMs to Use Tools and Explore Environments
Relying solely on the parametric knowledge acquired during pre-training, Large Language Models (LLMs) struggle to answer real-time queries, perform complex mathematical calculations, or interact with external systems. This is where we need to upgrade LLMs into Tool Agents—giving them the ability to call search engines, execute code, query APIs, and even interact with graphical user interfaces.
This post reviews the core paradigms of modern Tool Agents, from how tools are executed and how models learn to use them, to how agents understand and explore complex environments.
1. Tool Execution Paradigms
How exactly does an LLM execute a tool? The community has explored several innovative pathways:
1.1 Tool Tokens (ToolkenGPT)
Conventional tool usage requires the model to generate a JSON-formatted string. ToolkenGPT introduced a highly creative alternative: treating each tool as a special vocabulary token (a “Toolken”).
- Mechanism: The embeddings of tools are inserted into the LLM’s vocabulary head.
- Execution: The model infers normal tokens; once a Toolken is predicted, it switches to a “tool mode” to generate the required arguments. The result is then injected back into the context to continue reasoning.
- Advantages: It requires only the lightweight learning of Toolken embeddings without fine-tuning the entire LLM, and it supports dynamically adding massive amounts of new tools simply by expanding the vocabulary.
1.2 Code Tags (CodeAct)
Pure text formats like JSON have severe limitations in expressiveness: they natively lack support for loops, conditionals, exception handling, and state persistence. Therefore, CodeAct proposes a bold shift: Python code itself is the most general, Turing-complete “universal tool.” Instead of calling discrete, isolated tools, the model writes a snippet of Python code and executes it in a secure sandbox. Upon receiving errors or results, the model can debug, revise, and re-execute, forming a closed loop of “Generate -> Run -> Observe -> Revise”.
2. How Do Models Learn to Use Tools?
2.1 Prompting: DocPrompting via Retrieval
Given the constantly growing and evolving landscape of code repositories and APIs, no model can memorize them all during training. DocPrompting offers a general-purpose Natural Language to Code (NL→Code) framework:
- Retriever: Given a user’s intent, it retrieves relevant API documentation snippets from a library.
- Generator: It feeds these documents into the prompt to help the model generate the correct tool-calling code. This is essentially a specialized RAG (Retrieval-Augmented Generation) pipeline for APIs.
2.2 Self-Learning: Toolformer
If we want to avoid writing lengthy prompts every time, can the model “teach itself” to use tools? Toolformer introduces an elegant self-supervised learning method:
- Sampling: Provide the model with a few tool-use examples. The model scans large text corpora and generates candidate API calls at positions where tools might be helpful (e.g.,
<API>Calculator(2+2)</API>). - Execution: Actually call the API to obtain the result.
- Filtering (Core): Compare the next-token prediction loss in two scenarios:
- Loss without the tool result ($L^-$)
- Loss with the tool result ($L^+$)
- Only keep the call if $L^- - L^+ \ge \text{threshold}$, meaning the tool’s output significantly improved the model’s prediction. Finally, the model is fine-tuned on this high-quality “text + tool call” augmented data.
2.3 Tool Induction: When Tools are Unavailable
What if no ready-made toolset is provided? The TroVE framework enables LLMs to automatically induce reusable high-level functions from a stream of tasks. It builds compact, efficient, and automatically verifiable toolboxes without the need for additional training or human supervision.
3. Environment Representation and Exploration
Before an agent can use tools effectively, it must first understand the “environment” it inhabits.
3.1 Representing the Environment
- Text: e.g., ALFWorld translates embodied environments into pure text descriptions.
- Images: e.g., Touchdown performs spatial reasoning in visual street environments.
- Textual Web: e.g., WebArena abstracts the DOM trees of web pages into text sequences.
- Set of Marks (SoM): Tailored for visual models (like GPT-4V), this method overlays numbered tags (Marks) on interactive elements in an image (such as a webpage or screen). The model simply outputs the corresponding number to target an element, unleashing extraordinary visual grounding capabilities in multi-modal LLMs.
3.2 Environment Exploration and Memory
Models are not omniscient; they must learn on the fly.
- Environment-specific Prompts (SteP): Manually crafted prompts that provide navigation guidelines specific to an environment.
- Unsupervised Workflow Memory: The agent remembers successful workflows and automatically prompts itself with them in future tasks.
- Exploration & Relabeling (BAGEL): Similar to “curiosity” mechanisms in Reinforcement Learning, the model is rewarded for exploring unpredictable parts of the state space. BAGEL bootstraps agents by sampling instructions, exploring the environment, and then re-labeling those exploration trajectories with more accurate instructions.
4. Conclusion
Tool Agents are undergoing a paradigm shift from “rigid JSON API filling” to “writing Turing-complete code to interact with environments.” Whether it’s synthesizing data via self-play like Toolformer, or interpreting GUIs using visual grounding like Set-of-Mark, the boundaries of what an “armed” agent can do are constantly expanding. Coupled with Memory and Reasoning, proper Tools are what finally allow Language Agents to break out of the chatbox and become active executors in the digital world.