Contents

Deep Research: How LLMs Evolve into Full-Stack AI Scientists

With the recent launches of Deep Research features by OpenAI and Google, the capability boundaries of Large Language Models (LLMs) have been drastically expanded once again. Deep Research is not just a “chatbot that can browse the web”; it is a closed-loop autonomous workflow.

It can accept complex research questions, autonomously plan queries, acquire and filter evidence from massive heterogeneous data sources, maintain and revise a working memory, and ultimately synthesize a long-form research report with explicit citations. This post will systematically break down the technical evolution and the four core components of Deep Research, diving deep into the representative frontier algorithms behind them.


1. What is Deep Research?

The evolution of Deep Research can be divided into three phases:

  1. Agentic Search: Specializes in finding the correct sources and extracting answers with minimal synthesis (e.g., multi-hop QA like HotpotQA). The model needs the ability to persistently browse and discern facts on the internet.
  2. Integrated Research: Moves beyond isolated facts to produce coherent, structured reports by integrating heterogeneous evidence through an iterative loop of “sub-question planning - retrieval - synthesis” (e.g., market analysis, complex itinerary planning).
  3. Full-stack AI Scientist: Advances beyond mere information aggregation. It can generate hypotheses, conduct experimental validation or ablation studies, critique existing claims, and propose novel perspectives (e.g., automated paper reviewing, scientific equation discovery like LLM-SRBench).

Differences from Traditional RAG (Retrieval-Augmented Generation):

  • Flexible Interaction with the Digital World: Traditional RAG relies on static, pre-indexed corpora, whereas Deep Research actively interacts with dynamic environments (search engines, web APIs, code executors) through multi-step tool use.
  • Long-horizon Planning: Capable of autonomously planning, revising, and optimizing workflows, managing complex task contexts over extended periods.
  • Reliable Language Interfaces: Introduces verifiable mechanisms that align natural language outputs with grounded evidence, significantly reducing hallucinations in open-ended tasks.

2. Core Component 1: Query Planning

When faced with a complex problem, the first step in Deep Research is Query Planning: transforming a logically intricate question into a structured sequence of executable sub-queries (or sub-tasks).

2.1 Parallel Planning

Rewrites or decomposes the original query into multiple sub-questions in a single pass, which are then processed simultaneously. It is highly efficient but lacks an iterative feedback mechanism and ignores logical dependencies across sub-queries.

  • Least-to-Most Prompting: Guides the LLM via few-shot examples to decompose a complex task into an ordered sequence of simpler, self-contained sub-queries.
  • Chain-of-Verification: First generates multiple independent sub-questions in parallel, then grounds each one with well-established evidence for cross-verification.
  • DeepRetrieval: Abandons the traditional “retrieve-then-read” paradigm by training a small language model as a Rewriter, fine-tuning it via Reinforcement Learning (incorporating downstream metrics like Recall and NDCG@k).
  • MMOA-RAG: Treats query rewriting, document retrieval, selection, and generation as individual agents in a cooperative multi-agent system, using Multi-Agent PPO (MAPPO) to align all components toward a global reward (e.g., final answer accuracy).

2.2 Sequential Planning

Decomposes the query through multiple iterative steps, where each round of decomposition builds upon the outputs of previous rounds. It is suitable for tasks requiring stepwise disambiguation, but overly long reasoning chains can incur high computational costs and error propagation.

  • LLatrieval: An iterative query planner. When retrieved documents fail verification, it asks the LLM to pinpoint missing knowledge and generate a new query, repeating until the context fully supports a verifiable answer.
  • DRAGIN: Dynamic retrieval based on attention mechanisms. It utilizes self-attention scores to select the most context-relevant tokens from the entire generation history to formulate a concise and focused query.
  • RAISE: Designed specifically for scientific reasoning. It sequentially decomposes scientific questions, generates logic-aware queries, and retrieves step-specific knowledge to drive planning.
  • Search-R1: Integrates Chain-of-Thought (CoT) to dynamically decide when and how to search. It employs “retrieved token masking” to prevent the model from being penalized or rewarded for text copied directly from search results, focusing the reward solely on the final outcome.

2.3 Tree-Based Planning

Combines the strengths of parallel and sequential planning by structuring sub-queries as a Tree or Directed Acyclic Graph (DAG), utilizing advanced search algorithms for exploration and pruning.

  • RAG-Star: Leverages Monte Carlo Tree Search (MCTS). It selects the most promising node using the UCT criterion, evaluates the expansion quality via a reward model, and back-propagates the score to grow a reasoning tree.
  • DeepSieve: Decomposes complex queries into a DAG. The LLM acts as a knowledge router, selecting data sources for each path and triggering re-routing or re-decomposition upon failure.
  • DeepRAG: Structures the decision-making process as a binary tree (retrieve vs. reason internally). It introduces a “Chain of Calibration” to evaluate the reliability of its own internal reasoning, triggering retrieval only when self-assessment reveals high uncertainty.

3. Core Component 2: Information Acquisition

Once the system knows what to search for, the next step is acquiring and filtering the information.

3.1 Retrieval Tools

Beyond traditional lexical retrieval (BM25), semantic retrieval (Dense Retrieval), and commercial web search (like WebGPT calling Google APIs), Multimodal Retrieval is becoming a frontier trend:

  • LayoutLM / Donut (Text-Aware Retrieval with Layout): LayoutLM integrates text, layout bounding boxes, and Faster R-CNN visual features; Donut is an OCR-free end-to-end framework that maps document images directly to JSON structures.
  • CLIP / BLIP (Visual Retrieval via Text-Image Similarity): CLIP achieves zero-shot transfer through contrastive learning on 400M image-text pairs; BLIP introduces the CapFilt (Captioning and Filtering) mechanism to bootstrap pre-training data quality.
  • ChartFormer (Structure-Aware Retrieval): Employs end-to-end instance segmentation to identify chart elements (bars, lines, legends) and uses question-guided deformable co-attention to focus on relevant chart regions.

3.2 Adaptive Retrieval (Retrieval Timing)

Every retrieval incurs computational overhead, and low-quality documents can mislead the model. Adaptive Retrieval aims to make the model “know what it doesn’t know,” triggering retrieval only when its internal knowledge is insufficient.

  • FLARE (Probabilistic Strategy): Iteratively generates a temporary next sentence and triggers retrieval if it contains low-probability (low-confidence) tokens, then regenerates.
  • Rowen (Consistency-based Strategy): Evaluates the semantic consistency of responses generated in different languages or across multiple passes; if inconsistencies arise, external retrieval is triggered to rectify hallucinations.
  • CtrlA (Internal States Probing): Extracts “Honesty” and “Confidence” feature directions from the LLM’s internal states, monitoring confidence levels during inference to trigger retrieval.
  • Self-RAG / Search-o1 (Verbalized Strategy): Self-RAG trains the model via teacher-student distillation to explicitly generate special reflection tokens (e.g., [Retrieve], [IsSup]); Search-o1 pauses when it detects a knowledge gap, triggers a search, and uses “Reason-in-Documents” to distill facts.

3.3 Information Filtering

The internet is full of noise, and uncleaned content can easily induce hallucinations in LLMs.

  • Document Selection: Re-ranks candidate documents, retaining only the Top-k most helpful ones.
  • RECOMP / xRAG (Content Compression): RECOMP compresses long documents into textual summaries; xRAG achieves extreme context compression by projecting the entire document’s Dense Embedding into a single “Soft Token” fed directly into the LLM, bypassing long-text processing.
  • HtmlRAG (Rule-based Cleaning): Strips out semantically empty CSS and JavaScript code from web pages, and combines this with a two-stage block-tree pruning strategy to preserve structured HTML for the model.

4. Core Component 3: Memory Management

During a lengthy research process (which may take hours), the Agent must maintain a coherent context. This relies on robust memory management mechanisms.

4.1 Memory Consolidation and Indexing

  • MemoryBank (Unstructured Consolidation): Stores past conversations, summarized events, and user portraits to build AI companions like SiliconFriend.
  • HippoRAG (Structured Consolidation): Inspired by neurobiology, it uses the LLM (Neocortex) to extract knowledge graph triplets, and performs random walks over the graph (Hippocampus) to achieve pattern completion.
  • A-Mem (Graph-based Indexing): Adopts a graph index where the Agent autonomously links related memory nodes, progressively growing a flexible access network.

4.2 Memory Updating and Forgetting

  • Mem0 (Non-Parametric Updating): Modifies memories in an external database via explicit Tool Calls (e.g., DELETE, ADD), offering high flexibility.
  • Memory-R1 (Parametric Updating): Trains a memory manager via Reinforcement Learning to consolidate facts using a single UPDATE instruction, avoiding the memory fragmentation issues common in vanilla managers.
  • MemGPT (Passive Forgetting): Forgetting is a critical mechanism for filtering noise. MemGPT employs a FIFO (First-In-First-Out) queue, simulating the natural decay of human memory by automatically moving the oldest messages out of the main context into long-term storage.

5. Conclusion

From simple Agentic Search to Full-stack AI Scientists capable of autonomously proposing and validating hypotheses, Deep Research is redefining the upper limits of AI productivity.

It breaks down complex problems through Query Planning, precisely acquires information using Adaptive Multimodal Retrieval, maintains long-horizon thought processes via Dynamic Memory Management, and ultimately generates high-quality reports with rigorous citations. In the future, as reinforcement learning (like Search-R1) further integrates into search and reasoning, Deep Research is poised to become an indispensable super digital brain for researchers and professionals alike.