Coding Agents: Evaluation, Frameworks, and Code LLMs

2026-04-08 1110 words 6 minutes

Contents

AI-assisted coding is fundamentally revolutionizing how we write software. From chatbots like ChatGPT to in-IDE copilots (GitHub Copilot, Cursor, Trae), and now fully autonomous agents capable of resolving issues independently (SWE-Agent, Devin), AI is shifting from mere Code Generation to full-scale Software Engineering automation. Software engineering involves far more than just writing code; it demands execution, debugging, vulnerability repair, tracing, writing documentation, and ensuring secure, trustworthy deployments.

This post breaks down the core technologies and challenges of Coding Agents across three key dimensions: Evaluation, Agentic Frameworks, and Code LLMs.

1. Evaluating Coding Agents

How do we objectively measure an LLM’s ability to engineer software? This remains a highly debated topic.

1.1 Early Benchmarks and Data Contamination

Early datasets like HumanEval and MBPP (2021) focus on completing Python functions given docstrings and unit tests. However, Data Contamination is a fatal flaw. For instance, researchers found that 65.4% of instances in the MBPP test set already existed on sites like GeeksforGeeks. If a model memorized these during pre-training, the benchmark is no longer a valid test of reasoning.

1.2 Finer-Grained and Anti-Contamination Benchmarks

To reflect true capabilities, new benchmarks evaluate deeper reasoning:

MHPP: Evaluates models across 7 fine-grained challenge types, including Commonsense, Distraction (lengthy redundant info), Redefinition, and Cornercases. This demands rigorous text and code joint reasoning.
LiveCodeBench: Built specifically to thwart contamination. It collects competitive programming problems (LeetCode, AtCoder) released after a specific date. Notably, testing showed a stark performance drop in some open-source models on problems published after their training cutoff (e.g., September 2023), whereas GPT models remained stable, exposing the reality of data memorization.

1.3 From Algorithms to Real Repositories: SWE-Bench

SWE-Bench pushed evaluation into real-world software engineering, requiring the model to resolve real GitHub issues, navigate complex architectures, and generate pull requests. OpenAI subsequently introduced SWE-bench Verified, manually filtering 500 problems. They discovered that 59.4% of audited problems models commonly failed were actually due to flawed test cases rejecting correct submissions. Furthermore, they pointed out that frontier models are so heavily trained on open-source data that they can often reproduce the exact human-written bug fixes verbatim, highlighting that even “real-world” tasks are somewhat contaminated.

1.4 Metrics and Broader Domains

Besides the classic Pass@K (generating K examples and expecting at least one to pass tests), evaluations have expanded:

Semantic Overlap: When code isn’t easily executable, metrics like CodeBLEU and CodeBERTScore (evaluating syntax and semantic flow) assess similarity to human logic.
Data Science: Interactive, incremental code generation in Jupyter Notebooks.
Multimodal (Design2Code): Translating UI images into frontend code, measuring high-level visual similarity and low-level element recall.
Code Efficiency (EffiBench / Mercury): Studies reveal that ChatGPT-generated code can take up to 3.12x the execution time of human-written solutions. Writing correct code is not enough; writing optimal code is the new frontier.

2. Agentic Frameworks

A capable Coding Agent must understand repository structures, modify files, run code, and debug. The workflow generally falls into two paradigms: Dynamic Control and Procedural Control.

2.1 Dynamic Control: SWE-Agent and ReAct

SWE-Agent epitomizes dynamic control, employing the ReAct (Reason + Act) loop. The LLM dynamically generates a Thought and decides which Tool to call next, appending the environment’s observation to its trajectory until success or timeout.

Its core innovation is the Agent-Computer Interface (ACI). Giving an LLM a raw Linux shell is error-prone. ACI encapsulates specific tool functions, making the agent’s trajectories more compact, error feedback more informative, and providing guardrails to prevent error cascading.

2.2 Procedural Control: Agentless

Agentless argues that if a workflow is relatively standard, letting the LLM dynamically guess the next step leads to derailment. Instead, it uses a hard-coded Python control flow, using the LLM merely as a processing node:

Localization: Hierarchically narrowing down the search from files, to classes/functions, down to specific lines.
Repair: Directly generating a patch at the localized site.
Validate: Testing the patch. By removing the LLM’s agency over the control flow, Agentless eliminates tool-use syntax errors and prevents trajectories from spiraling out of control due to initial mistakes.

2.3 Other Notable Frameworks

CodeAct: Instead of calling discrete tools, the model directly writes and executes Bash/Python scripts, leading to faster resolutions and higher success rates.
OpenHands: An open platform defining an “event stream” for coding and execution, turning SWE-agent style actions into callable “skills.”
AutoCodeRover: Combines powerful search tools with procedural control, separating phases into distinct trajectories and instructions.
Passerine (Google): Google’s internal agent framework. It uses dynamic ReAct-style loops but replaces command lines with Google’s actual build infrastructure (like Bazel and internal code search).

Core Trade-off: Do complex bugs require dynamic flexibility, or does procedural rigidity prevent catastrophic failures? Regarding test-time compute: when a trajectory fails, is it better to extend it and try to fix the error (SWE-Agent), or abandon it and start fresh? These questions are actively shaping the design space.

3. Code LLMs & File Localization

3.1 Training Pipelines and Prompt Strategy

The modern Code LLM pipeline: Code Corpus (e.g., The Stack 2) -> Next Token Prediction -> Post-training (SFT/DPO/RL).

Code Infilling: Developers frequently edit code in the middle of a file. Models like InCoder are explicitly trained on infilling tasks to master this context.
Copilot Prompting Strategy: Auto-completion requires immense background work. Copilot extensions extract the current document and cursor position, identify the language, grab the 20 most recently accessed related files, and include imported files and path metadata to craft the ultimate prompt context.

3.2 Localization

Before fixing a bug, the agent must find the correct file in repositories with thousands of files.

Tool Search: Providing grep or file-search tools for the agent to use dynamically (SWE-Agent).
Repository Maps: Aider uses a repomap—a tree-structured map of the codebase containing function/class signatures—to give the LLM an overarching view.
Hierarchical Search: Agentless uses procedural narrowing.
Retrieval-Augmented Generation (RAG): Retrieving similar code and documentation, though timing the RAG trigger within an agent loop remains a challenge.

4. Vulnerabilities and Safety

Giving models execution privileges carries inherent risks.

Accidental Harm: An agent might accidentally push garbage code to the main branch. Or, instructed to “make tests pass,” it might decide the easiest route is to delete the test suite entirely.
Intentional Exploitation: Agents possess code analysis capabilities that hackers could exploit to write malware.

Industry Mitigations:

Sandboxing: Executing all actions in isolated environments (e.g., OpenHands runs everything strictly inside Docker sandboxes).
Credentialing: Adhering to the principle of least privilege, severely restricting GitHub access tokens and cloud secrets.
Post-hoc Auditing: Using separate security analyzer models or static tools to intercept potentially harmful behavior before it commits.

As the underlying models advance in logical reasoning and Agent-Computer Interfaces refine further, AI will evolve from a helpful Copilot into an autonomous engineer capable of reliably delivering complex software.