Leaderboard · Reported May 2026

Same outcome.
12.5× the bill.

8agent tiers from four vendors ran the same three real May 2026 open-source bug-fix tasks. 5 tiers produced a diff matching the merged human PR. Among them, the cheapest cost $0.082 per task and the flagship cost $1.025, 12.5× more in dollars and 11.6× more in joules for identical engineering output.

Verified entries publish a full observational trace. Costs are computed at each model's published list price as of May 2026 and reproduced from billed token counts. Anyone can re-score.

Current leaderboard · May 2026

Cost × energy, same outcome

Lower-left is better. Brighter cyan = cheaper Dropstone tier. Hollow dashed marker = Unverified.

Source · The Joule Index · Blanklinejoule.blankline.org · Reported May 2026

Dropstone FastDropstone ProDropstone HeavyClaude Haiku 4.5Claude Sonnet 4.6Claude Opus 4.7Gemini 3.1 FlashGemini 3.1 ProOpenAI (projection)xAI (projection)Pricing preview

Per-tier averages

The headline numbers

#	Tier	Company	Runs	Attention F1	$ / task	J / task	Joule Score	Disclosure
1	Dropstone FastCurrent leader	Blankline via Dropstone CLI	3	1.000	$0.0820	224	0.883	Verified
2	Claude Haiku 4.5	Anthropic via Claude Code	3	1.000	$0.318	146	0.825	Verified
3	Dropstone Pro	Blankline via Dropstone CLI	3	1.000	$0.362	233	0.778	Verified
4	Claude Opus 4.7	Anthropic via Claude Code	3	1.000	$1.03	511	0.703	Verified
5	Gemini 3.1 Flash	Google via Gemini CLI	3	0.667	$0.0607	69	0.646	Verified
6	Claude Sonnet 4.6	Anthropic via Claude Code	3	0.727	$0.523	180	0.638	Verified
7	Gemini 3.1 Pro	Google via Gemini CLI	3	0.821	$0.912	402	0.629	Verified
8	Dropstone Heavy	Blankline via Dropstone CLI	3	1.000	$0.857	1693	0.563	Verified

Per-run detail

Every cell, every number

Task	Tier	F1	Prec	Rec	Tokens in / out	$ / task	J / task	Wall s	Source
joule-001	Dropstone Fast	1.000	1.000	1.000	239,079 / 3,297	$0.0117	44	147	billed
joule-001	Gemini 3.1 Flash	1.000	1.000	1.000	433,737 / 1,180	$0.0356	36	44	billed
joule-001	Claude Haiku 4.5	1.000	1.000	1.000	880,778 / 4,783	$0.166	70	299	billed
joule-001	Dropstone Heavy	1.000	1.000	1.000	358,394 / 3,973	$0.276	545	186	billed
joule-001	Claude Opus 4.7	1.000	1.000	1.000	113,883 / 588	$0.132	62	16	billed
joule-001	Dropstone Pro	1.000	1.000	1.000	325,882 / 2,545	$0.249	107	161	billed
joule-001	Gemini 3.1 Pro	1.000	1.000	1.000	154,547 / 5,047	$0.217	101	75	billed
joule-001	Claude Sonnet 4.6	1.000	1.000	1.000	103,889 / 657	$0.117	40	19	billed
joule-002	Dropstone Fast	1.000	1.000	1.000	376,501 / 7,982	$0.0297	74	3108	billed
joule-002	Gemini 3.1 Flash	1.000	1.000	1.000	229,787 / 1,872	$0.0264	31	31	billed
joule-002	Claude Haiku 4.5	1.000	1.000	1.000	997,281 / 5,419	$0.159	72	375	billed
joule-002	Dropstone Heavy	1.000	1.000	1.000	316,201 / 4,036	$0.245	484	195	billed
joule-002	Claude Opus 4.7	1.000	1.000	1.000	173,397 / 1,807	$0.200	97	39	billed
joule-002	Dropstone Pro	1.000	1.000	1.000	419,502 / 5,756	$0.319	149	346	billed
joule-002	Gemini 3.1 Pro	1.000	1.000	1.000	247,638 / 4,601	$0.250	107	87	billed
joule-002	Claude Sonnet 4.6	1.000	1.000	1.000	138,780 / 2,562	$0.207	47	95	billed
joule-004	Dropstone Fast	1.000	1.000	1.000	3,062,902 / 39,620	$0.204	555	3798	billed
joule-004	Gemini 3.1 Flash	0.000	0.000	0.000	1,275,389 / 7,679	$0.120	140	100	billed
joule-004	Claude Haiku 4.5	1.000	1.000	1.000	4,387,841 / 20,647	$0.628	297	328	billed
joule-004	Dropstone Heavy	1.000	1.000	1.000	2,692,119 / 24,243	$2.05	4051	1503	billed
joule-004	Claude Opus 4.7	1.000	1.000	1.000	2,820,013 / 17,985	$2.74	1375	403	billed
joule-004	Dropstone Pro	1.000	1.000	1.000	1,273,631 / 15,672	$0.517	444	806	billed
joule-004	Gemini 3.1 Pro	0.462	0.600	0.375	2,538,983 / 27,130	$2.27	998	369	billed
joule-004	Claude Sonnet 4.6	0.182	0.333	0.125	1,622,105 / 23,668	$1.24	453	563	billed

§ The Cap Test · Pricing preview

Flagship AI pricing has begun to outrun the benchmark designed to measure it.

Of the 6 commercial frontier models in market this month with no Verified Joule Index entry, 2 are priced such that a single attempt at the reference task, at the vendor's own published per-token rate, would exceed the benchmark's $10 per-task cost ceiling. The chart below applies that price to the token budget a Verified agent consumed on the same task.

Both the calculation and the cap are reproducible from each vendor's published pricing. Under the methodology, any vendor may submit a Verified run at any time, which would move that row from the projection panel into the leaderboard above.

The Cap Test · joule-004

Cost to attempt one Joule Index task at the Dropstone Fast token budget, priced at each model's May 2026 list rate (uncached).

$10 cap

Gemini 3.1 Flash

$0.120

Dropstone Fast

$0.204

Dropstone Pro

$0.517

Claude Haiku 4.5

$0.628

Grok 4.1 Fast

$0.632

GPT-5 Mini

$0.845

Claude Sonnet 4.6

$1.24

Dropstone Heavy

$2.05

Gemini 3.1 Pro

$2.27

Claude Opus 4.7

$2.74

Grok 4.3

$3.93

GPT-5.4

$8.25

GPT-5.5

$16.50✕ busts cap

GPT-5.5 Pro

→ $99

$99.02✕ busts cap

Verified rows (filled bars) carry full observational traces. List-price rows (hollow bars) apply the vendor's published per-token price to the token budget a Verified agent used on this task. The figures are a calculation, not a measurement, and never enter the Joule Score. Models above the $10 cap are ineligible to compete under the published rules.

Source · The Joule Index · Blanklinejoule.blankline.org · Reported May 2026

The reference task is joule-004, an 8-file refactor of Mozilla Common Voice. On smaller tasks the picture is less extreme. On long-horizon work, flagship-tier list pricing from several vendors places them above the cap and out of contention. The chart is computed from live pricing data and updates whenever the Pricing Preview is rebuilt.

Methods · Further reading

How this leaderboard was produced, and how it can be challenged.

01
The full methodology, pre-registered
Six forms of validity (construct, reliability, discriminant, convergent, predictive, gaming-resistance). Six contamination defenses. Statistical reporting requirements. Reviewer calibration protocol. Read the protocol before reading the table.
02
Per-run trace index and retired tasks
Every Verified run links to a sanitized public_trace.json. Retired tasks (currently joule-003) are kept on disk for audit, with the retirement reason documented in the open.
03
Submitting a model
Verified submissions are ranked above all projection rows. Vendor IP (source code, system prompts, internal reasoning) is never required. Only the observational trace is.

Editor's note

The V0 release of the Joule Index reports evaluated runs across three open-source bug-fix tasks and eight Verified model tiers (Dropstone Fast / Pro / Heavy, Claude Haiku 4.5 / Sonnet 4.6 / Opus 4.7, Gemini 3.1 Flash / Pro), with one further task retired under published methodology rules. All Verified cost figures are computed at each model's list price as of May 2026 from billed token counts. Projection rows shown in The Cap Test section apply published list prices to a reference token budget; no capability claim is made and no Joule Score is computed. All energy figures for closed APIs are estimated from token counts and published per-token rates; direct measurement is scheduled for the V1 release.