We rebuilt our tool use benchmark from scratch. Instead of LLM-as-judge, every command now runs in a deterministic Docker sandbox and is validated against ground truth. The results tell a different story: frontier models achieve 65-69% correctness on atomic bash generation. The gap between models is real, measurable, and reproducible.

Update: This is v2 of our January benchmark. The original used GPT-4o as a judge and reported 9-13% correctness rates. Those numbers reflected a flawed evaluation methodology, not model capability. We replaced subjective judging with deterministic execution and exact output matching. The benchmark now has four validation modes: exact stdout match, regex pattern match, artifact verification commands, and exit code checks.

The Models

We tested 10 models via OpenRouter's unified API, spanning frontier, open source, and reasoning categories:

Frontier: Claude Opus 4.6, Claude Sonnet 4.5, GPT-4o, Gemini 2.5 Pro

Open Source: Llama 3.3 70B, DeepSeek V3.1, Mistral Large 2512, Qwen 2.5 72B

Reasoning: DeepSeek R1, GLM-4.7

Quality Rankings (Docker-Validated)

Each model generates a bash command for a given goal. That command runs in an isolated Docker container (Ubuntu 22.04, no network). The output is validated against canonical expected results—exact stdout match, regex pattern, or artifact verification depending on the test type.

RankModelCorrectRateAvg Latency
1Claude Opus 4.6320/46469.0%1,926ms
2Claude Sonnet 4.5313/46467.5%1,275ms
3DeepSeek V3.1304/46465.5%1,764ms
4Qwen 2.5 72B298/46464.2%1,102ms
5Llama 3.3 70B293/46463.1%728ms
5GPT-4o293/46463.1%1,119ms
7Gemini 2.5 Pro292/46462.9%2,791ms
8GLM-4.7253/46454.5%1,170ms
9DeepSeek R1232/46450.0%1,351ms
10Mistral Large231/46449.8%2,058ms

What Changed From v1

The original benchmark reported all models between 9-13%. That compression was an artifact of the evaluation method—GPT-4o judging produced noisy, inconsistent scores that failed to differentiate between models. The actual spread is 49-69%, a 20-point gap that was invisible to LLM-as-judge.

Key changes in v2:

Claude Opus 4.6 Leads

Claude Opus 4.6 achieves 69.0%, a clear lead over Sonnet 4.5 (67.5%) and DeepSeek V3.1 (65.5%). The top 7 models cluster between 62.9-69.0%—all capable of generating correct bash commands roughly two-thirds of the time.

The bottom three—GLM-4.7 (54.5%), DeepSeek R1 (50.0%), and Mistral Large (49.8%)—fall sharply. At 50% correctness, you need retry logic for every other command.

Reasoning Hurts Here

DeepSeek R1 ranks 9th at 50.0%—below every non-reasoning model except Mistral. The chain-of-thought overhead (high latency, massive token usage) produces worse results than direct generation for atomic command tasks. R1 overthinks simple commands, adding unnecessary flags, quoting, or rewriting the goal as a multi-step pipeline when a single command suffices.

This is consistent with the emerging pattern: reasoning models excel at complex multi-step problems but underperform on tasks where the correct answer is a single, direct action.

Speed vs Quality

ModelRateAvg LatencyQuality/Second
Llama 3.3 70B63.1%728ms0.87
Qwen 2.5 72B64.2%1,102ms0.58
GPT-4o63.1%1,119ms0.56
Claude Sonnet 4.567.5%1,275ms0.53
DeepSeek R150.0%1,351ms0.37
GLM-4.754.5%1,170ms0.47
DeepSeek V3.165.5%1,764ms0.37
Claude Opus 4.669.0%1,926ms0.36
Mistral Large49.8%2,058ms0.24
Gemini 2.5 Pro62.9%2,791ms0.23

Llama 3.3 70B dominates the speed/quality tradeoff at 728ms average latency with 63.1% correctness. For high-volume workloads, it delivers the most correct commands per second. Qwen 2.5 72B offers slightly higher quality (64.2%) at 1.5x the latency.

If quality is the only metric, Claude Opus 4.6 justifies its 1.9s latency with a 6-point lead over the next-fastest models at similar quality.

Methodology

464 operations derived from a first-principles taxonomy of bash/shell tool use: filesystem, text processing, network (mocked), process management, security, containers (mocked), version control, data transformation, archiving, system administration, and time operations.

Environment: Ubuntu 22.04 Docker container. No network access. Pre-baked fixture files (test data, git repos, mock binaries). Each test runs in a fresh container.

Generation: Each model generates commands with temperature=0 via OpenRouter. 10 concurrent requests per model. No retries on valid responses.

Validation: Four types, selected per-test based on output characteristics:

All expected values were generated by running the canonical command in Docker and capturing the actual output. No hand-written expectations, no LLM judging.

The full benchmark code and results are available at github.com/agentiagency/tool-use-benchmark.