Same outcome.
10.5× the bill.
Three Dropstone tiers ran the same three real May 2026 open-source bug-fix tasks. Every run produced a diff matching the merged human PR. The flagship tier paid 10.5× more in dollars and 7.5× more in joules for identical engineering output.
Verified entries publish a full observational trace. Costs are computed at each model's published list price as of May 2026 and reproduced from billed token counts. Anyone can re-score.
Cost × energy, same outcome
Lower-left is better. Brighter cyan = cheaper Dropstone tier. Hollow dashed marker = Unverified.
The headline numbers
| # | Tier | Company | Runs | Attention F1 | $ / task | J / task | Joule Score | Disclosure |
|---|---|---|---|---|---|---|---|---|
| 1 | Dropstone FastCurrent leader | Blankline via Dropstone CLI | 3 | 1.000 | $0.0820 | 224 | 0.883 | Verified |
| 2 | Claude Haiku 4.5 | Anthropic via Claude Code | 3 | 1.000 | $0.318 | 146 | 0.825 | Verified |
| 3 | Dropstone Pro | Blankline via Dropstone CLI | 3 | 1.000 | $0.362 | 233 | 0.778 | Verified |
| 4 | Claude Opus 4.7 | Anthropic via Claude Code | 3 | 1.000 | $1.03 | 511 | 0.703 | Verified |
| 5 | Gemini 3.1 Flash | Google via Gemini CLI | 3 | 0.667 | $0.0607 | 69 | 0.646 | Verified |
| 6 | Claude Sonnet 4.6 | Anthropic via Claude Code | 3 | 0.727 | $0.523 | 180 | 0.638 | Verified |
| 7 | Gemini 3.1 Pro | Google via Gemini CLI | 3 | 0.821 | $0.912 | 402 | 0.629 | Verified |
| 8 | Dropstone Heavy | Blankline via Dropstone CLI | 3 | 1.000 | $0.857 | 1693 | 0.563 | Verified |
Every cell, every number
| Task | Tier | F1 | Prec | Rec | Tokens in / out | $ / task | J / task | Wall s | Source |
|---|---|---|---|---|---|---|---|---|---|
| joule-001 | Dropstone Fast | 1.000 | 1.000 | 1.000 | 239,079 / 3,297 | $0.0117 | 44 | 147 | billed |
| joule-001 | Gemini 3.1 Flash | 1.000 | 1.000 | 1.000 | 433,737 / 1,180 | $0.0356 | 36 | 44 | billed |
| joule-001 | Claude Haiku 4.5 | 1.000 | 1.000 | 1.000 | 880,778 / 4,783 | $0.166 | 70 | 299 | billed |
| joule-001 | Dropstone Heavy | 1.000 | 1.000 | 1.000 | 358,394 / 3,973 | $0.276 | 545 | 186 | billed |
| joule-001 | Claude Opus 4.7 | 1.000 | 1.000 | 1.000 | 113,883 / 588 | $0.132 | 62 | 16 | billed |
| joule-001 | Dropstone Pro | 1.000 | 1.000 | 1.000 | 325,882 / 2,545 | $0.249 | 107 | 161 | billed |
| joule-001 | Gemini 3.1 Pro | 1.000 | 1.000 | 1.000 | 154,547 / 5,047 | $0.217 | 101 | 75 | billed |
| joule-001 | Claude Sonnet 4.6 | 1.000 | 1.000 | 1.000 | 103,889 / 657 | $0.117 | 40 | 19 | billed |
| joule-002 | Dropstone Fast | 1.000 | 1.000 | 1.000 | 376,501 / 7,982 | $0.0297 | 74 | 3108 | billed |
| joule-002 | Gemini 3.1 Flash | 1.000 | 1.000 | 1.000 | 229,787 / 1,872 | $0.0264 | 31 | 31 | billed |
| joule-002 | Claude Haiku 4.5 | 1.000 | 1.000 | 1.000 | 997,281 / 5,419 | $0.159 | 72 | 375 | billed |
| joule-002 | Dropstone Heavy | 1.000 | 1.000 | 1.000 | 316,201 / 4,036 | $0.245 | 484 | 195 | billed |
| joule-002 | Claude Opus 4.7 | 1.000 | 1.000 | 1.000 | 173,397 / 1,807 | $0.200 | 97 | 39 | billed |
| joule-002 | Dropstone Pro | 1.000 | 1.000 | 1.000 | 419,502 / 5,756 | $0.319 | 149 | 346 | billed |
| joule-002 | Gemini 3.1 Pro | 1.000 | 1.000 | 1.000 | 247,638 / 4,601 | $0.250 | 107 | 87 | billed |
| joule-002 | Claude Sonnet 4.6 | 1.000 | 1.000 | 1.000 | 138,780 / 2,562 | $0.207 | 47 | 95 | billed |
| joule-004 | Dropstone Fast | 1.000 | 1.000 | 1.000 | 3,062,902 / 39,620 | $0.204 | 555 | 3798 | billed |
| joule-004 | Gemini 3.1 Flash | 0.000 | 0.000 | 0.000 | 1,275,389 / 7,679 | $0.120 | 140 | 100 | billed |
| joule-004 | Claude Haiku 4.5 | 1.000 | 1.000 | 1.000 | 4,387,841 / 20,647 | $0.628 | 297 | 328 | billed |
| joule-004 | Dropstone Heavy | 1.000 | 1.000 | 1.000 | 2,692,119 / 24,243 | $2.05 | 4051 | 1503 | billed |
| joule-004 | Claude Opus 4.7 | 1.000 | 1.000 | 1.000 | 2,820,013 / 17,985 | $2.74 | 1375 | 403 | billed |
| joule-004 | Dropstone Pro | 1.000 | 1.000 | 1.000 | 1,273,631 / 15,672 | $0.517 | 444 | 806 | billed |
| joule-004 | Gemini 3.1 Pro | 0.462 | 0.600 | 0.375 | 2,538,983 / 27,130 | $2.27 | 998 | 369 | billed |
| joule-004 | Claude Sonnet 4.6 | 0.182 | 0.333 | 0.125 | 1,622,105 / 23,668 | $1.24 | 453 | 563 | billed |
Flagship AI pricing has begun to outrun the benchmark designed to measure it.
Of the 6 commercial frontier models in market this month with no Verified Joule Index entry, 2 are priced such that a single attempt at the reference task, at the vendor's own published per-token rate, would exceed the benchmark's $10 per-task cost ceiling. The chart below applies that price to the token budget a Verified agent consumed on the same task.
Both the calculation and the cap are reproducible from each vendor's published pricing. Under the methodology, any vendor may submit a Verified run at any time, which would move that row from the projection panel into the leaderboard above.
The reference task is joule-004, an 8-file refactor of Mozilla Common Voice. On smaller tasks the picture is less extreme. On long-horizon work, flagship-tier list pricing from several vendors places them above the cap and out of contention. The chart is computed from live pricing data and updates whenever the Pricing Preview is rebuilt.
How this leaderboard was produced, and how it can be challenged.
- 01The full methodology, pre-registered
Six forms of validity (construct, reliability, discriminant, convergent, predictive, gaming-resistance). Six contamination defenses. Statistical reporting requirements. Reviewer calibration protocol. Read the protocol before reading the table.
- 02Per-run trace index and retired tasks
Every Verified run links to a sanitized public_trace.json. Retired tasks (currently joule-003) are kept on disk for audit, with the retirement reason documented in the open.
- 03Submitting a model
Verified submissions are ranked above all projection rows. Vendor IP (source code, system prompts, internal reasoning) is never required. Only the observational trace is.
Editor's note
The V0 release of the Joule Index reports evaluated runs across three open-source bug-fix tasks and eight Verified model tiers (Dropstone Fast / Pro / Heavy, Claude Haiku 4.5 / Sonnet 4.6 / Opus 4.7, Gemini 3.1 Flash / Pro), with one further task retired under published methodology rules. All Verified cost figures are computed at each model's list price as of May 2026 from billed token counts. Projection rows shown in The Cap Test section apply published list prices to a reference token budget; no capability claim is made and no Joule Score is computed. All energy figures for closed APIs are estimated from token counts and published per-token rates; direct measurement is scheduled for the V1 release.