The Joule Index
An Independent Benchmark · Reported May 2026
Leaderboard · Reported May 2026

Same outcome.
10.5× the bill.

Three Dropstone tiers ran the same three real May 2026 open-source bug-fix tasks. Every run produced a diff matching the merged human PR. The flagship tier paid 10.5× more in dollars and 7.5× more in joules for identical engineering output.


Verified entries publish a full observational trace. Costs are computed at each model's published list price as of May 2026 and reproduced from billed token counts. Anyone can re-score.

Current leaderboard · May 2026

Cost × energy, same outcome

Lower-left is better. Brighter cyan = cheaper Dropstone tier. Hollow dashed marker = Unverified.

$0.01$0.10$1100 J1k JClaude Haiku 4.5Claude Opus 4.7Claude Sonnet 4.6Dropstone FastCURRENT LEADERDropstone HeavyDropstone ProGemini 3.1 FlashGemini 3.1 ProDOLLARS PER MERGE-READY PR (LOG)JOULES PER TASK (LOG)
Source · The Joule Index · Blanklinejoule.blankline.org · Reported May 2026
Dropstone FastDropstone ProDropstone HeavyClaude Haiku 4.5Claude Sonnet 4.6Claude Opus 4.7Gemini 3.1 FlashGemini 3.1 ProOpenAI (projection)xAI (projection)Pricing preview

Per-tier averages

The headline numbers

#TierCompanyRunsAttention F1$ / taskJ / taskJoule ScoreDisclosure
1Dropstone FastCurrent leader
Blankline
via Dropstone CLI
31.000$0.08202240.883 Verified
2Claude Haiku 4.5
Anthropic
via Claude Code
31.000$0.3181460.825 Verified
3Dropstone Pro
Blankline
via Dropstone CLI
31.000$0.3622330.778 Verified
4Claude Opus 4.7
Anthropic
via Claude Code
31.000$1.035110.703 Verified
5Gemini 3.1 Flash
Google
via Gemini CLI
30.667$0.0607690.646 Verified
6Claude Sonnet 4.6
Anthropic
via Claude Code
30.727$0.5231800.638 Verified
7Gemini 3.1 Pro
Google
via Gemini CLI
30.821$0.9124020.629 Verified
8Dropstone Heavy
Blankline
via Dropstone CLI
31.000$0.85716930.563 Verified

Per-run detail

Every cell, every number

TaskTierF1PrecRecTokens in / out$ / taskJ / taskWall sSource
joule-001Dropstone Fast1.0001.0001.000239,079 / 3,297$0.011744147billed
joule-001Gemini 3.1 Flash1.0001.0001.000433,737 / 1,180$0.03563644billed
joule-001Claude Haiku 4.51.0001.0001.000880,778 / 4,783$0.16670299billed
joule-001Dropstone Heavy1.0001.0001.000358,394 / 3,973$0.276545186billed
joule-001Claude Opus 4.71.0001.0001.000113,883 / 588$0.1326216billed
joule-001Dropstone Pro1.0001.0001.000325,882 / 2,545$0.249107161billed
joule-001Gemini 3.1 Pro1.0001.0001.000154,547 / 5,047$0.21710175billed
joule-001Claude Sonnet 4.61.0001.0001.000103,889 / 657$0.1174019billed
joule-002Dropstone Fast1.0001.0001.000376,501 / 7,982$0.0297743108billed
joule-002Gemini 3.1 Flash1.0001.0001.000229,787 / 1,872$0.02643131billed
joule-002Claude Haiku 4.51.0001.0001.000997,281 / 5,419$0.15972375billed
joule-002Dropstone Heavy1.0001.0001.000316,201 / 4,036$0.245484195billed
joule-002Claude Opus 4.71.0001.0001.000173,397 / 1,807$0.2009739billed
joule-002Dropstone Pro1.0001.0001.000419,502 / 5,756$0.319149346billed
joule-002Gemini 3.1 Pro1.0001.0001.000247,638 / 4,601$0.25010787billed
joule-002Claude Sonnet 4.61.0001.0001.000138,780 / 2,562$0.2074795billed
joule-004Dropstone Fast1.0001.0001.0003,062,902 / 39,620$0.2045553798billed
joule-004Gemini 3.1 Flash0.0000.0000.0001,275,389 / 7,679$0.120140100billed
joule-004Claude Haiku 4.51.0001.0001.0004,387,841 / 20,647$0.628297328billed
joule-004Dropstone Heavy1.0001.0001.0002,692,119 / 24,243$2.0540511503billed
joule-004Claude Opus 4.71.0001.0001.0002,820,013 / 17,985$2.741375403billed
joule-004Dropstone Pro1.0001.0001.0001,273,631 / 15,672$0.517444806billed
joule-004Gemini 3.1 Pro0.4620.6000.3752,538,983 / 27,130$2.27998369billed
joule-004Claude Sonnet 4.60.1820.3330.1251,622,105 / 23,668$1.24453563billed

§ The Cap Test · Pricing preview

Flagship AI pricing has begun to outrun the benchmark designed to measure it.

Of the 6 commercial frontier models in market this month with no Verified Joule Index entry, 2 are priced such that a single attempt at the reference task, at the vendor's own published per-token rate, would exceed the benchmark's $10 per-task cost ceiling. The chart below applies that price to the token budget a Verified agent consumed on the same task.

Both the calculation and the cap are reproducible from each vendor's published pricing. Under the methodology, any vendor may submit a Verified run at any time, which would move that row from the projection panel into the leaderboard above.

The Cap Test · joule-004
Cost to attempt one Joule Index task at the Dropstone Fast token budget, priced at each model's May 2026 list rate (uncached).
$10 cap
Gemini 3.1 FlashDisclosureVerified runA measured run with a sanitized public observational trace. Tokens, costs, joules and the final diff are published. Any third party can re-score from the trace.
$0.120
Dropstone FastDisclosureVerified runA measured run with a sanitized public observational trace. Tokens, costs, joules and the final diff are published. Any third party can re-score from the trace.
$0.204
Dropstone ProDisclosureVerified runA measured run with a sanitized public observational trace. Tokens, costs, joules and the final diff are published. Any third party can re-score from the trace.
$0.517
Claude Haiku 4.5DisclosureVerified runA measured run with a sanitized public observational trace. Tokens, costs, joules and the final diff are published. Any third party can re-score from the trace.
$0.628
Grok 4.1 Fast$Disclosure$List-price projectionNo measured run has been submitted. The cost shown is the vendor's published per-token price applied to the token budget a Verified agent used on this task. A calculation, not a measurement.
$0.632
GPT-5 Mini$Disclosure$List-price projectionNo measured run has been submitted. The cost shown is the vendor's published per-token price applied to the token budget a Verified agent used on this task. A calculation, not a measurement.
$0.845
Claude Sonnet 4.6DisclosureVerified runA measured run with a sanitized public observational trace. Tokens, costs, joules and the final diff are published. Any third party can re-score from the trace.
$1.24
Dropstone HeavyDisclosureVerified runA measured run with a sanitized public observational trace. Tokens, costs, joules and the final diff are published. Any third party can re-score from the trace.
$2.05
Gemini 3.1 ProDisclosureVerified runA measured run with a sanitized public observational trace. Tokens, costs, joules and the final diff are published. Any third party can re-score from the trace.
$2.27
Claude Opus 4.7DisclosureVerified runA measured run with a sanitized public observational trace. Tokens, costs, joules and the final diff are published. Any third party can re-score from the trace.
$2.74
Grok 4.3$Disclosure$List-price projectionNo measured run has been submitted. The cost shown is the vendor's published per-token price applied to the token budget a Verified agent used on this task. A calculation, not a measurement.
$3.93
GPT-5.4$Disclosure$List-price projectionNo measured run has been submitted. The cost shown is the vendor's published per-token price applied to the token budget a Verified agent used on this task. A calculation, not a measurement.
$8.25
GPT-5.5$Disclosure$List-price projectionNo measured run has been submitted. The cost shown is the vendor's published per-token price applied to the token budget a Verified agent used on this task. A calculation, not a measurement.
$16.50✕ busts cap
GPT-5.5 Pro$Disclosure$List-price projectionNo measured run has been submitted. The cost shown is the vendor's published per-token price applied to the token budget a Verified agent used on this task. A calculation, not a measurement.
→ $99
$99.02✕ busts cap
DisclosureVerified runA measured run with a sanitized public observational trace. Tokens, costs, joules and the final diff are published. Any third party can re-score from the trace. Verified rows (filled bars) carry full observational traces. $Disclosure$List-price projectionNo measured run has been submitted. The cost shown is the vendor's published per-token price applied to the token budget a Verified agent used on this task. A calculation, not a measurement. List-price rows (hollow bars) apply the vendor's published per-token price to the token budget a Verified agent used on this task. The figures are a calculation, not a measurement, and never enter the Joule Score. Models above the $10 cap are ineligible to compete under the published rules.
Source · The Joule Index · Blanklinejoule.blankline.org · Reported May 2026

The reference task is joule-004, an 8-file refactor of Mozilla Common Voice. On smaller tasks the picture is less extreme. On long-horizon work, flagship-tier list pricing from several vendors places them above the cap and out of contention. The chart is computed from live pricing data and updates whenever the Pricing Preview is rebuilt.


Methods · Further reading

How this leaderboard was produced, and how it can be challenged.

  1. 01
    The full methodology, pre-registered

    Six forms of validity (construct, reliability, discriminant, convergent, predictive, gaming-resistance). Six contamination defenses. Statistical reporting requirements. Reviewer calibration protocol. Read the protocol before reading the table.

  2. 02
    Per-run trace index and retired tasks

    Every Verified run links to a sanitized public_trace.json. Retired tasks (currently joule-003) are kept on disk for audit, with the retirement reason documented in the open.

  3. 03
    Submitting a model

    Verified submissions are ranked above all projection rows. Vendor IP (source code, system prompts, internal reasoning) is never required. Only the observational trace is.

Editor's note

The V0 release of the Joule Index reports evaluated runs across three open-source bug-fix tasks and eight Verified model tiers (Dropstone Fast / Pro / Heavy, Claude Haiku 4.5 / Sonnet 4.6 / Opus 4.7, Gemini 3.1 Flash / Pro), with one further task retired under published methodology rules. All Verified cost figures are computed at each model's list price as of May 2026 from billed token counts. Projection rows shown in The Cap Test section apply published list prices to a reference token budget; no capability claim is made and no Joule Score is computed. All energy figures for closed APIs are estimated from token counts and published per-token rates; direct measurement is scheduled for the V1 release.