Original: Anthropic Engineering · 21/02/2026
Summary
Our latest model, the upgraded Claude 3.5 Sonnet, achieved 49% on SWE-bench Verified. SWE-bench is an AI evaluation benchmark that assesses a model’s ability to complete real-world software engineering tasks.Key Insights
“Our latest model, the upgraded Claude 3.5 Sonnet, achieved 49% on SWE-bench Verified.” — Highlighting the achievement of Claude 3.5 Sonnet on the SWE-bench Verified benchmark.
“SWE-bench is an AI evaluation benchmark that assesses a model’s ability to complete real-world software engineering tasks.” — Explaining what SWE-bench evaluates in AI models.
“The performance of an agent on SWE-bench can vary significantly based on this scaffolding.” — Discussing the importance of the software scaffolding around an AI model.
Topics
Full Article
Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet
Author: Anthropic EngineeringPublished: 2026-02-21
Source: https://www.anthropic.com/engineering/swe-bench-sonnet
Engineering at Anthropic
Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet
Published Jan 06, 2025 SWE-bench is an AI evaluation benchmark that assesses a model’s ability to complete real-world software engineering tasks. Our latest model, the upgraded Claude 3.5 Sonnet, achieved 49% on SWE-bench Verified, a software engineering evaluation, beating the previous state-of-the-art model’s 45%. This post explains the “agent” we built around the model, and is intended to help developers get the best possible performance out of Claude 3.5 Sonnet. SWE-bench is an AI evaluation benchmark that assesses a model’s ability to complete real-world software engineering tasks. Specifically, it tests how the model can resolve GitHub issues from popular open-source Python repositories. For each task in the benchmark, the AI model is given a set up Python environment and the checkout (a local working copy) of the repository from just before the issue was resolved. The model then needs to understand, modify, and test the code before submitting its proposed solution. Each solution is graded against the real unit tests from the pull request that closed the original GitHub issue. This tests whether the AI model was able to achieve the same functionality as the original human author of the PR. SWE-bench doesn’t just evaluate the AI model in isolation, but rather an entire “agent” system. In this context, an “agent” refers to the combination of an AI model and the software scaffolding around it. This scaffolding is responsible for generating the prompts that go into the model, parsing the model’s output to take action, and managing the interaction loop where the result of the model’s previous action is incorporated into its next prompt. The performance of an agent on SWE-bench can vary significantly based on this scaffolding, even when using the same underlying AI model. There are many other benchmarks for the coding abilities of Large Language Models, but SWE-bench has gained in popularity for several reasons:- It uses real engineering tasks from actual projects, rather than competition- or interview-style questions;
- It is not yet saturated—there’s plenty of room for improvement. No model has yet crossed 50% completion on SWE-bench Verified (though the updated Claude 3.5 Sonnet is, at the time of writing, at 49%);
- It measures an entire “agent”, rather than a model in isolation. Open-source developers and startups have had great success in optimizing scaffoldings to greatly improve the performance around the same model.
Achieving state-of-the-art
Tool Using Agent
Our design philosophy when creating the agent scaffold optimized for updated Claude 3.5 Sonnet was to give as much control as possible to the language model itself, and keep the scaffolding minimal. The agent has a prompt, a Bash Tool for executing bash commands, and an Edit Tool, for viewing and editing files and directories. We continue to sample until the model decides that it is finished, or exceeds its 200k context length. This scaffold allows the model to use its own judgment of how to pursue the problem, rather than be hardcoded into a particular pattern or workflow. The prompt outlines a suggested approach for the model, but it’s not overly long or too detailed for this task. The model is free to choose how it moves from step to step, rather than having strict and discrete transitions. If you are not token-sensitive, it can help to explicitly encourage the model to produce a long response. The following code shows the prompt from our agent scaffold:old\_str to replace with new\_str in the given file. The replacement will only occur if there is exactly one match of old\_str. If there are more or fewer matches, the model is shown an appropriate error message for it to retry.
The spec for our Edit Tool is shown below:
Results
In general, the upgraded Claude 3.5 Sonnet demonstrates higher reasoning, coding, and mathematical abilities than our prior models, and the previous state-of-the-art model. It also demonstrates improved agentic capabilities: the tools and scaffolding help put those improved abilities to their best use.| Model | Claude 3.5 Sonnet (new) | Previous SOTA | Claude 3.5 Sonnet (old) | Claude 3 Opus |
|---|---|---|---|---|
| SWE-bench Verified score | 49% | 45% | 33% | 22% |
Examples of agent behavior
For running the benchmark, we used the SWE-Agent framework as a foundation for our agent code. In our logs below, we render the agent’s text output, tool calls, and tool responses as THOUGHT, ACTION, and OBSERVATION, even though we don’t constrain the model to a fixed ordering. The code blocks below will walk through a typical case of the Sonnet 3.5 solving a SWE-bench problem. In this first block, you can see part of the initial prompt given to the model, with{pr\_description} filled in with the real value from a SWE-bench task. Importantly, this task contains steps to reproduce the issue, which will give the model a valuable starting point to investigate.
Challenges
SWE-bench Verified is a powerful evaluation, but it’s also more complex to run than simple, single-turn evals. These are some of the challenges that we faced in using it—challenges that other AI developers might also encounter.- Duration and high token costs. The examples above are from a case that was successfully completed in 12 steps. However, many successful runs took hundreds of turns for the model to resolve, and >100k tokens. The updated Claude 3.5 Sonnet is tenacious: it can often find its way around a problem given enough time, but that can be expensive;
- Grading. While inspecting failed tasks, we found cases where the model behaved correctly, but there were environment setup issues, or problems with install patches being applied twice. Resolving these systems issues is crucial for getting an accurate picture of an AI agent’s performance.
- Hidden tests. Because the model cannot see the tests it’s being graded against, it often “thinks” that it has succeeded when the task actually is a failure. Some of these failures are because the model solved the problem at the wrong level of abstraction (applying a bandaid instead of a deeper refactor). Other failures feel a little less fair: they solve the problem, but do not match the unit tests from the original task.
- Multimodal. Despite the updated Claude 3.5 Sonnet having excellent vision and multimodal capabilities, we did not implement a way for it to view files saved to the filesystem or referenced as URLs. This made debugging certain tasks (especially those from Matplotlib) especially difficult, and also prone to model hallucinations. There is definitely low-hanging fruit here for developers to improve upon—and SWE-bench has launched a new evaluation focused on multi-modal tasks. We look forward to seeing developers achieve higher scores on this eval with Claude in the near future.
Acknowledgements
Erik Schluntz optimized the SWE-bench agent and wrote this blog post. Simon Biggs, Dawn Drain, and Eric Christiansen helped implement the benchmark. Shauna Kravec, Dawn Drain, Felipe Rosso, Nova DasSarma, Ven Chandrasekaran, and many others contributed to training Claude 3.5 Sonnet to be excellent at agentic coding.Key Takeaways
Notable Quotes
Our latest model, the upgraded Claude 3.5 Sonnet, achieved 49% on SWE-bench Verified.Context: Highlighting the achievement of Claude 3.5 Sonnet on the SWE-bench Verified benchmark.
SWE-bench is an AI evaluation benchmark that assesses a model’s ability to complete real-world software engineering tasks.Context: Explaining what SWE-bench evaluates in AI models.
The performance of an agent on SWE-bench can vary significantly based on this scaffolding.Context: Discussing the importance of the software scaffolding around an AI model.
Related Topics
- [[topics/prompt-engineering]]
- [[topics/ai-agents]]
- [[topics/claude-code]]
Related Articles
SWE-bench February 2026 leaderboard update
Simon Willison · reference · 76% similar
Designing AI-resistant technical evaluations
Anthropic Engineering · explanation · 73% similar
[AINews] Anthropic's Agent Autonomy study
Swyx · explanation · 73% similar
Originally published at https://www.anthropic.com/engineering/swe-bench-sonnet.