The rules.
Stricter than SWE-bench. Stricter than ARC-AGI. Modeled on MLPerf Power's mandatory-disclosure regime, which is the source of MLPerf's authority as the gold standard of inference benchmarking.
The question the benchmark measures
Given a real bug filed within the last 30 days on a real open-source repository by a real user, can an AI agent produce a Pull Request that a human maintainer would merge unmodified, and at what cost in dollars and joules?
Six forms of validity
- Construct validity. First-Review Merge Rate is the headline. 90-day reversion and follow-up-fix rates are also tracked.
- Reliability. Every model is run 3× per task at temperature 0 and 3× at temperature 0.7. 95% Wilson CIs on every cell.
- Discriminant validity. Sanity-check that known-different capability tiers separate with statistical significance.
- Convergent validity. Pearson + Spearman correlation with SWE-bench Verified, Aider polyglot, LiveCodeBench reported.
- Predictive validity. Six-month post-launch audit of real-world deployment share against Joule Index scores.
- Gaming resistance. Blind review, public rubric, 20% audit sampling, sock-puppet defense at maintainer-recruitment stage.
Dataset sourcing: hard rules
- Repo has ≥500 GitHub stars, active maintenance, permissive license. The 500-star floor balances real-maintainership signal against task availability; repos below this often lack the merged-PR workflow that makes Attention F1 reliable.
- Issue filed strictly after the evaluated model's training cutoff (this alone mathematically prevents contamination)
- Two senior-engineer reviewers must agree (Cohen's κ > 0.7)
- Inter-rater disagreement → task is discarded, not adjudicated
Disclosure: the Verifiability standard
Every leaderboard entry carries one of two tags:
Source code, system prompts, and internal reasoning are never required. Only the observational trace (tool calls, file reads, tokens, and diff) is required for a Verified score. This is the line MLPerf drew for chip vendors; the Joule Index draws it the same way for agent vendors.
Contamination defense: six layers
- Time-locked sourcing: issues filed after the model's training cutoff
- Preference for repos with low Common Crawl coverage
- Programmatic mutation of symbols (V1+)
- Canary strings probed against frontier models periodically
- Private Holdout never published
- Submission requires a declared training cutoff
The Joule Score composite (v0.1)
Cost is in US dollars at list price; kJ is joules per task divided by one thousand. The 0.5 + 0.5 weights treat $1 of cost and 1 kJ of inference energy as equally penalising the score, a thermodynamic-economic anchor. A model scoring F1 = 1.000 at $0.10 and 200 J earns Joule Score ≈ 0.949. The same F1 at $0.85 and 1.7 kJ earns Joule Score ≈ 0.460. Capability is preserved as the multiplicative factor; cost and energy enter the denominator.
Cache-aware joule calculation (v0.1)
Joules per task are estimated from token counts × published per-token energy rates, with a critical refinement: cache reads are charged at 15% of fresh-input energy. This reflects the underlying physics: a cache hit skips the costly prefill forward pass and only requires loading the KV-cache from memory plus running attention.
The token energy buckets used by the harness are:
| Bucket | Energy | Rationale |
|---|---|---|
| Fresh input | 100% | Full prefill forward pass |
| Cache creation | 100% | Prefill computed once, then stored |
| Cache read | 15% | Skip prefill; load KV-cache + run attention |
| Output (decode) | 500% of fresh input | Decode is the dominant per-token cost |
The 15% cache-read ratio is the conservative end of the 10–15% range observed in the research literature (Patterson 2021; Luccioni 2024; "Prefill vs Decode Bottlenecks", arxiv 2512.22066, 2025) and consistent with the 10× pricing discount Anthropic and other vendors apply to cache reads. Direct GPU power measurement on open-weight runs is scheduled for V1.
Statistical reporting
- 95% Wilson intervals on every proportion
- "Model A > Model B" requires non-overlapping CIs and p < 0.05 on paired bootstrap
- Minimum sample size for a published score: n ≥ 30 per (model × category)
- Effect sizes (Cohen's h) reported alongside p-values
- Pre-registered on the Open Science Framework before public launch
Pricing preview rows (projection-only)
The leaderboard page carries a separate Pricing Preview section listing unsubmitted frontier models with no observational trace. These rows show only what one Joule Index task would cost on each model at the vendor's own published per-token price, applied to the token budget a Verified agent actually consumed. They are a calculation, not a measurement. Attention F1 and Joule Score are blank by design; no capability claim is made; pricing preview rows never enter the leaderboard ranking. Models whose projected cost exceeds the $10 per-task cap are ineligible to compete under the published rules until either (a) they are repriced by their vendor or (b) the vendor submits a Verified run demonstrating a cache-and-decode profile that fits.
V0 limitations declared openly
- V0 ships n = 3 per cell. Numbers are indicative; V1 will publish n ≥ 30.
- Energy is estimated from token counts × published Wh/token rates with a v0.1 cache-aware adjustment (§5.5). Direct GPU measurement on open-weight runs in V1.
- FRMR is "eligible" rather than "achieved" until external maintainer review is wired in V1.
- Wall-time has measurement noise from prep-vs-execution gaps; cost and joules come from billing records and are reliable.
Full methodology source: methodology.md