The Joule Index
An Independent Benchmark · Reported May 2026
Methodology · Pre-registered · Reported May 2026

The rules.

Stricter than SWE-bench. Stricter than ARC-AGI. Modeled on MLPerf Power's mandatory-disclosure regime, which is the source of MLPerf's authority as the gold standard of inference benchmarking.

§1

The question the benchmark measures

Given a real bug filed within the last 30 days on a real open-source repository by a real user, can an AI agent produce a Pull Request that a human maintainer would merge unmodified, and at what cost in dollars and joules?

§2

Six forms of validity

  1. Construct validity. First-Review Merge Rate is the headline. 90-day reversion and follow-up-fix rates are also tracked.
  2. Reliability. Every model is run 3× per task at temperature 0 and 3× at temperature 0.7. 95% Wilson CIs on every cell.
  3. Discriminant validity. Sanity-check that known-different capability tiers separate with statistical significance.
  4. Convergent validity. Pearson + Spearman correlation with SWE-bench Verified, Aider polyglot, LiveCodeBench reported.
  5. Predictive validity. Six-month post-launch audit of real-world deployment share against Joule Index scores.
  6. Gaming resistance. Blind review, public rubric, 20% audit sampling, sock-puppet defense at maintainer-recruitment stage.
§3

Dataset sourcing: hard rules

§4

Disclosure: the Verifiability standard

Every leaderboard entry carries one of two tags:

● Verified
Full observational trace published. Any third party can audit.
● Unverified
Score reported without trace. Sorted below all Verified entries, regardless of headline number.

Source code, system prompts, and internal reasoning are never required. Only the observational trace (tool calls, file reads, tokens, and diff) is required for a Verified score. This is the line MLPerf drew for chip vendors; the Joule Index draws it the same way for agent vendors.

§5

Contamination defense: six layers

  1. Time-locked sourcing: issues filed after the model's training cutoff
  2. Preference for repos with low Common Crawl coverage
  3. Programmatic mutation of symbols (V1+)
  4. Canary strings probed against frontier models periodically
  5. Private Holdout never published
  6. Submission requires a declared training cutoff
§5

The Joule Score composite (v0.1)

Joule Score = Attention F1 / (1 + 0.5 · $ + 0.5 · kJ)
range 0..1 · higher is better

Cost is in US dollars at list price; kJ is joules per task divided by one thousand. The 0.5 + 0.5 weights treat $1 of cost and 1 kJ of inference energy as equally penalising the score, a thermodynamic-economic anchor. A model scoring F1 = 1.000 at $0.10 and 200 J earns Joule Score ≈ 0.949. The same F1 at $0.85 and 1.7 kJ earns Joule Score ≈ 0.460. Capability is preserved as the multiplicative factor; cost and energy enter the denominator.

§5.5

Cache-aware joule calculation (v0.1)

Joules per task are estimated from token counts × published per-token energy rates, with a critical refinement: cache reads are charged at 15% of fresh-input energy. This reflects the underlying physics: a cache hit skips the costly prefill forward pass and only requires loading the KV-cache from memory plus running attention.

The token energy buckets used by the harness are:

BucketEnergyRationale
Fresh input100%Full prefill forward pass
Cache creation100%Prefill computed once, then stored
Cache read15%Skip prefill; load KV-cache + run attention
Output (decode)500% of fresh inputDecode is the dominant per-token cost

The 15% cache-read ratio is the conservative end of the 10–15% range observed in the research literature (Patterson 2021; Luccioni 2024; "Prefill vs Decode Bottlenecks", arxiv 2512.22066, 2025) and consistent with the 10× pricing discount Anthropic and other vendors apply to cache reads. Direct GPU power measurement on open-weight runs is scheduled for V1.

§6

Statistical reporting

§6.5

Pricing preview rows (projection-only)

The leaderboard page carries a separate Pricing Preview section listing unsubmitted frontier models with no observational trace. These rows show only what one Joule Index task would cost on each model at the vendor's own published per-token price, applied to the token budget a Verified agent actually consumed. They are a calculation, not a measurement. Attention F1 and Joule Score are blank by design; no capability claim is made; pricing preview rows never enter the leaderboard ranking. Models whose projected cost exceeds the $10 per-task cap are ineligible to compete under the published rules until either (a) they are repriced by their vendor or (b) the vendor submits a Verified run demonstrating a cache-and-decode profile that fits.

§7

V0 limitations declared openly

Full methodology source: methodology.md