Methodology · Pre-registered · Reported May 2026

The rules.

Stricter than SWE-bench. Stricter than ARC-AGI. Modeled on MLPerf Power's mandatory-disclosure regime, which is the source of MLPerf's authority as the gold standard of inference benchmarking.

§1

The question the benchmark measures

Given a real bug filed within the last 30 days on a real open-source repository by a real user, can an AI agent produce a Pull Request that a human maintainer would merge unmodified, and at what cost in dollars and joules?

§2

Six forms of validity

Construct validity. First-Review Merge Rate is the headline. 90-day reversion and follow-up-fix rates are also tracked.
Reliability. Every model is run 3× per task at temperature 0 and 3× at temperature 0.7. 95% Wilson CIs on every cell.
Discriminant validity. Sanity-check that known-different capability tiers separate with statistical significance.
Convergent validity. Pearson + Spearman correlation with SWE-bench Verified, Aider polyglot, LiveCodeBench reported.
Predictive validity. Six-month post-launch audit of real-world deployment share against Joule Index scores.
Gaming resistance. Blind review, public rubric, 20% audit sampling, sock-puppet defense at maintainer-recruitment stage.

§3

Dataset sourcing: hard rules

Repo has ≥500 GitHub stars, active maintenance, permissive license. The 500-star floor balances real-maintainership signal against task availability; repos below this often lack the merged-PR workflow that makes Attention F1 reliable.
Issue filed strictly after the evaluated model's training cutoff (this alone mathematically prevents contamination)
Two senior-engineer reviewers must agree (Cohen's κ > 0.7)
Inter-rater disagreement → task is discarded, not adjudicated

§4

Disclosure: the Verifiability standard

Every leaderboard entry carries one of two tags:

● Verified

Full observational trace published. Any third party can audit.

● Unverified

Score reported without trace. Sorted below all Verified entries, regardless of headline number.

Source code, system prompts, and internal reasoning are never required. Only the observational trace (tool calls, file reads, tokens, and diff) is required for a Verified score. This is the line MLPerf drew for chip vendors; the Joule Index draws it the same way for agent vendors.

§5

Contamination defense: six layers

Time-locked sourcing: issues filed after the model's training cutoff
Preference for repos with low Common Crawl coverage
Programmatic mutation of symbols (V1+)
Canary strings probed against frontier models periodically
Private Holdout never published
Submission requires a declared training cutoff

§5

The Joule Score composite (v0.1)

Joule Score = Attention F1 / (1 + 0.5 · $ + 0.5 · kJ)

range 0..1 · higher is better

Cost is in US dollars at list price; kJ is joules per task divided by one thousand. The 0.5 + 0.5 weights treat $1 of cost and 1 kJ of inference energy as equally penalising the score, a thermodynamic-economic anchor. A model scoring F1 = 1.000 at $0.10 and 200 J earns Joule Score ≈ 0.949. The same F1 at $0.85 and 1.7 kJ earns Joule Score ≈ 0.460. Capability is preserved as the multiplicative factor; cost and energy enter the denominator.

§5.5

Cache-aware joule calculation (v0.1)

Joules per task are estimated from token counts × published per-token energy rates, with a critical refinement: cache reads are charged at 15% of fresh-input energy. This reflects the underlying physics: a cache hit skips the costly prefill forward pass and only requires loading the KV-cache from memory plus running attention.

The token energy buckets used by the harness are:

Bucket	Energy	Rationale
Fresh input	100%	Full prefill forward pass
Cache creation	100%	Prefill computed once, then stored
Cache read	15%	Skip prefill; load KV-cache + run attention
Output (decode)	500% of fresh input	Decode is the dominant per-token cost

The 15% cache-read ratio is the conservative end of the 10–15% range observed in the research literature (Patterson 2021; Luccioni 2024; "Prefill vs Decode Bottlenecks", arxiv 2512.22066, 2025) and consistent with the 10× pricing discount Anthropic and other vendors apply to cache reads. Direct GPU power measurement on open-weight runs is scheduled for V1.

§6

Statistical reporting

95% Wilson intervals on every proportion
"Model A > Model B" requires non-overlapping CIs and p < 0.05 on paired bootstrap
Minimum sample size for a published score: n ≥ 30 per (model × category)
Effect sizes (Cohen's h) reported alongside p-values
Pre-registered on the Open Science Framework before public launch

§6.5

Pricing preview rows (projection-only)

The leaderboard page carries a separate Pricing Preview section listing unsubmitted frontier models with no observational trace. These rows show only what one Joule Index task would cost on each model at the vendor's own published per-token price, applied to the token budget a Verified agent actually consumed. They are a calculation, not a measurement. Attention F1 and Joule Score are blank by design; no capability claim is made; pricing preview rows never enter the leaderboard ranking. Models whose projected cost exceeds the $10 per-task cap are ineligible to compete under the published rules until either (a) they are repriced by their vendor or (b) the vendor submits a Verified run demonstrating a cache-and-decode profile that fits.

§7

V0 limitations declared openly

V0 ships n = 3 per cell. Numbers are indicative; V1 will publish n ≥ 30.
Energy is estimated from token counts × published Wh/token rates with a v0.1 cache-aware adjustment (§5.5). Direct GPU measurement on open-weight runs in V1.
FRMR is "eligible" rather than "achieved" until external maintainer review is wired in V1.
Wall-time has measurement noise from prep-vs-execution gaps; cost and joules come from billing records and are reliable.

Full methodology source: methodology.md