Tool Use Benchmark v2: 10 Models, 464 Operations, Docker-Validated

We rebuilt our tool use benchmark from scratch. Instead of LLM-as-judge, every command now runs in a deterministic Docker sandbox and is validated against ground truth. The results tell a different story: frontier models achieve 65-69% correctness on atomic bash generation. The gap between models is real, measurable, and reproducible.

Update: This is v2 of our January benchmark. The original used GPT-4o as a judge and reported 9-13% correctness rates. Those numbers reflected a flawed evaluation methodology, not model capability. We replaced subjective judging with deterministic execution and exact output matching. The benchmark now has four validation modes: exact stdout match, regex pattern match, artifact verification commands, and exit code checks.

The Models

We tested 10 models via OpenRouter's unified API, spanning frontier, open source, and reasoning categories:

Frontier: Claude Opus 4.6, Claude Sonnet 4.5, GPT-4o, Gemini 2.5 Pro

Open Source: Llama 3.3 70B, DeepSeek V3.1, Mistral Large 2512, Qwen 2.5 72B

Reasoning: DeepSeek R1, GLM-4.7

Quality Rankings (Docker-Validated)

Each model generates a bash command for a given goal. That command runs in an isolated Docker container (Ubuntu 22.04, no network). The output is validated against canonical expected results—exact stdout match, regex pattern, or artifact verification depending on the test type.

Rank	Model	Correct	Rate	Avg Latency
1	Claude Opus 4.6	320/464	69.0%	1,926ms
2	Claude Sonnet 4.5	313/464	67.5%	1,275ms
3	DeepSeek V3.1	304/464	65.5%	1,764ms
4	Qwen 2.5 72B	298/464	64.2%	1,102ms
5	Llama 3.3 70B	293/464	63.1%	728ms
5	GPT-4o	293/464	63.1%	1,119ms
7	Gemini 2.5 Pro	292/464	62.9%	2,791ms
8	GLM-4.7	253/464	54.5%	1,170ms
9	DeepSeek R1	232/464	50.0%	1,351ms
10	Mistral Large	231/464	49.8%	2,058ms

What Changed From v1

The original benchmark reported all models between 9-13%. That compression was an artifact of the evaluation method—GPT-4o judging produced noisy, inconsistent scores that failed to differentiate between models. The actual spread is 49-69%, a 20-point gap that was invisible to LLM-as-judge.

Key changes in v2:

Deterministic validation: Every expected output was captured by running the canonical command in Docker, not guessed or hand-written
Four validation types: Exact stdout match (273 tests), regex patterns for nondeterministic output (88 tests), artifact verification commands (49 tests), exit code checks (54 tests)
Mock infrastructure: Docker, systemctl, and other tools mocked with fixture scripts so container/service tests run without real infrastructure
Reproducible: Same Docker image, same fixtures, same tests. Run it yourself and get the same numbers

Claude Opus 4.6 Leads

Claude Opus 4.6 achieves 69.0%, a clear lead over Sonnet 4.5 (67.5%) and DeepSeek V3.1 (65.5%). The top 7 models cluster between 62.9-69.0%—all capable of generating correct bash commands roughly two-thirds of the time.

The bottom three—GLM-4.7 (54.5%), DeepSeek R1 (50.0%), and Mistral Large (49.8%)—fall sharply. At 50% correctness, you need retry logic for every other command.

Reasoning Hurts Here

DeepSeek R1 ranks 9th at 50.0%—below every non-reasoning model except Mistral. The chain-of-thought overhead (high latency, massive token usage) produces worse results than direct generation for atomic command tasks. R1 overthinks simple commands, adding unnecessary flags, quoting, or rewriting the goal as a multi-step pipeline when a single command suffices.

This is consistent with the emerging pattern: reasoning models excel at complex multi-step problems but underperform on tasks where the correct answer is a single, direct action.

Speed vs Quality

Model	Rate	Avg Latency	Quality/Second
Llama 3.3 70B	63.1%	728ms	0.87
Qwen 2.5 72B	64.2%	1,102ms	0.58
GPT-4o	63.1%	1,119ms	0.56
Claude Sonnet 4.5	67.5%	1,275ms	0.53
DeepSeek R1	50.0%	1,351ms	0.37
GLM-4.7	54.5%	1,170ms	0.47
DeepSeek V3.1	65.5%	1,764ms	0.37
Claude Opus 4.6	69.0%	1,926ms	0.36
Mistral Large	49.8%	2,058ms	0.24
Gemini 2.5 Pro	62.9%	2,791ms	0.23

Llama 3.3 70B dominates the speed/quality tradeoff at 728ms average latency with 63.1% correctness. For high-volume workloads, it delivers the most correct commands per second. Qwen 2.5 72B offers slightly higher quality (64.2%) at 1.5x the latency.

If quality is the only metric, Claude Opus 4.6 justifies its 1.9s latency with a 6-point lead over the next-fastest models at similar quality.

Methodology

464 operations derived from a first-principles taxonomy of bash/shell tool use: filesystem, text processing, network (mocked), process management, security, containers (mocked), version control, data transformation, archiving, system administration, and time operations.

Environment: Ubuntu 22.04 Docker container. No network access. Pre-baked fixture files (test data, git repos, mock binaries). Each test runs in a fresh container.

Generation: Each model generates commands with temperature=0 via OpenRouter. 10 concurrent requests per model. No retries on valid responses.

Validation: Four types, selected per-test based on output characteristics:

output (273 tests): Exact stdout match after stripping whitespace
regex (88 tests): Stdout matches a pattern (for nondeterministic output like timestamps, PIDs, hashes)
cmd (49 tests): A verification command runs after the test command in the same container (for artifact-producing operations like compression, file creation, permission changes)
exit_code (54 tests): Exit code matches expected value

All expected values were generated by running the canonical command in Docker and capturing the actual output. No hand-written expectations, no LLM judging.

The full benchmark code and results are available at github.com/agentiagency/tool-use-benchmark.