We rebuilt our tool use benchmark from scratch. Instead of LLM-as-judge, every command now runs in a deterministic Docker sandbox and is validated against ground truth. The results tell a different story: frontier models achieve 65-69% correctness on atomic bash generation. The gap between models is real, measurable, and reproducible.
Update: This is v2 of our January benchmark. The original used GPT-4o as a judge and reported 9-13% correctness rates. Those numbers reflected a flawed evaluation methodology, not model capability. We replaced subjective judging with deterministic execution and exact output matching. The benchmark now has four validation modes: exact stdout match, regex pattern match, artifact verification commands, and exit code checks.
The Models
We tested 10 models via OpenRouter's unified API, spanning frontier, open source, and reasoning categories:
Frontier: Claude Opus 4.6, Claude Sonnet 4.5, GPT-4o, Gemini 2.5 Pro
Open Source: Llama 3.3 70B, DeepSeek V3.1, Mistral Large 2512, Qwen 2.5 72B
Reasoning: DeepSeek R1, GLM-4.7
Quality Rankings (Docker-Validated)
Each model generates a bash command for a given goal. That command runs in an isolated Docker container (Ubuntu 22.04, no network). The output is validated against canonical expected results—exact stdout match, regex pattern, or artifact verification depending on the test type.
| Rank | Model | Correct | Rate | Avg Latency |
|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 320/464 | 69.0% | 1,926ms |
| 2 | Claude Sonnet 4.5 | 313/464 | 67.5% | 1,275ms |
| 3 | DeepSeek V3.1 | 304/464 | 65.5% | 1,764ms |
| 4 | Qwen 2.5 72B | 298/464 | 64.2% | 1,102ms |
| 5 | Llama 3.3 70B | 293/464 | 63.1% | 728ms |
| 5 | GPT-4o | 293/464 | 63.1% | 1,119ms |
| 7 | Gemini 2.5 Pro | 292/464 | 62.9% | 2,791ms |
| 8 | GLM-4.7 | 253/464 | 54.5% | 1,170ms |
| 9 | DeepSeek R1 | 232/464 | 50.0% | 1,351ms |
| 10 | Mistral Large | 231/464 | 49.8% | 2,058ms |
What Changed From v1
The original benchmark reported all models between 9-13%. That compression was an artifact of the evaluation method—GPT-4o judging produced noisy, inconsistent scores that failed to differentiate between models. The actual spread is 49-69%, a 20-point gap that was invisible to LLM-as-judge.
Key changes in v2:
- Deterministic validation: Every expected output was captured by running the canonical command in Docker, not guessed or hand-written
- Four validation types: Exact stdout match (273 tests), regex patterns for nondeterministic output (88 tests), artifact verification commands (49 tests), exit code checks (54 tests)
- Mock infrastructure: Docker, systemctl, and other tools mocked with fixture scripts so container/service tests run without real infrastructure
- Reproducible: Same Docker image, same fixtures, same tests. Run it yourself and get the same numbers
Claude Opus 4.6 Leads
Claude Opus 4.6 achieves 69.0%, a clear lead over Sonnet 4.5 (67.5%) and DeepSeek V3.1 (65.5%). The top 7 models cluster between 62.9-69.0%—all capable of generating correct bash commands roughly two-thirds of the time.
The bottom three—GLM-4.7 (54.5%), DeepSeek R1 (50.0%), and Mistral Large (49.8%)—fall sharply. At 50% correctness, you need retry logic for every other command.
Reasoning Hurts Here
DeepSeek R1 ranks 9th at 50.0%—below every non-reasoning model except Mistral. The chain-of-thought overhead (high latency, massive token usage) produces worse results than direct generation for atomic command tasks. R1 overthinks simple commands, adding unnecessary flags, quoting, or rewriting the goal as a multi-step pipeline when a single command suffices.
This is consistent with the emerging pattern: reasoning models excel at complex multi-step problems but underperform on tasks where the correct answer is a single, direct action.
Speed vs Quality
| Model | Rate | Avg Latency | Quality/Second |
|---|---|---|---|
| Llama 3.3 70B | 63.1% | 728ms | 0.87 |
| Qwen 2.5 72B | 64.2% | 1,102ms | 0.58 |
| GPT-4o | 63.1% | 1,119ms | 0.56 |
| Claude Sonnet 4.5 | 67.5% | 1,275ms | 0.53 |
| DeepSeek R1 | 50.0% | 1,351ms | 0.37 |
| GLM-4.7 | 54.5% | 1,170ms | 0.47 |
| DeepSeek V3.1 | 65.5% | 1,764ms | 0.37 |
| Claude Opus 4.6 | 69.0% | 1,926ms | 0.36 |
| Mistral Large | 49.8% | 2,058ms | 0.24 |
| Gemini 2.5 Pro | 62.9% | 2,791ms | 0.23 |
Llama 3.3 70B dominates the speed/quality tradeoff at 728ms average latency with 63.1% correctness. For high-volume workloads, it delivers the most correct commands per second. Qwen 2.5 72B offers slightly higher quality (64.2%) at 1.5x the latency.
If quality is the only metric, Claude Opus 4.6 justifies its 1.9s latency with a 6-point lead over the next-fastest models at similar quality.
Methodology
464 operations derived from a first-principles taxonomy of bash/shell tool use: filesystem, text processing, network (mocked), process management, security, containers (mocked), version control, data transformation, archiving, system administration, and time operations.
Environment: Ubuntu 22.04 Docker container. No network access. Pre-baked fixture files (test data, git repos, mock binaries). Each test runs in a fresh container.
Generation: Each model generates commands with temperature=0 via OpenRouter. 10 concurrent requests per model. No retries on valid responses.
Validation: Four types, selected per-test based on output characteristics:
- output (273 tests): Exact stdout match after stripping whitespace
- regex (88 tests): Stdout matches a pattern (for nondeterministic output like timestamps, PIDs, hashes)
- cmd (49 tests): A verification command runs after the test command in the same container (for artifact-producing operations like compression, file creation, permission changes)
- exit_code (54 tests): Exit code matches expected value
All expected values were generated by running the canonical command in Docker and capturing the actual output. No hand-written expectations, no LLM judging.
The full benchmark code and results are available at github.com/agentiagency/tool-use-benchmark.