Runs · Reported May 2026
Every Verified run, every observational trace, every billed number.
The current release contains 24 Verified runs across 3 real OSS bug-fix tasks and 8 model tiers from Anthropic, Blankline, Google, plus one retired task kept on disk for full audit. Sanitized public_trace.json files are published per the disclosure policy.
Per-run detail · 24 Verified runs
| Task | Tier | Vendor | F1 | $ / task | J / task | Tokens in / out | Wall s | Trace |
|---|---|---|---|---|---|---|---|---|
| joule-001 | Claude Haiku 4.5 | Anthropic | 1.000 | $0.1660 | 70 | 880,778 / 4,783 | 299 | public_trace.json · pending host |
| joule-001 | Claude Opus 4.7 | Anthropic | 1.000 | $0.1318 | 62 | 113,883 / 588 | 16 | public_trace.json · pending host |
| joule-001 | Claude Sonnet 4.6 | Anthropic | 1.000 | $0.1169 | 40 | 103,889 / 657 | 19 | public_trace.json · pending host |
| joule-001 | Dropstone FastCurrent leader | Blankline | 1.000 | $0.0117 | 44 | 239,079 / 3,297 | 147 | public_trace.json · pending host |
| joule-001 | Dropstone Heavy | Blankline | 1.000 | $0.2755 | 545 | 358,394 / 3,973 | 186 | public_trace.json · pending host |
| joule-001 | Dropstone Pro | Blankline | 1.000 | $0.2494 | 107 | 325,882 / 2,545 | 161 | public_trace.json · pending host |
| joule-001 | Gemini 3.1 Flash | 1.000 | $0.0356 | 36 | 433,737 / 1,180 | 44 | public_trace.json · pending host | |
| joule-001 | Gemini 3.1 Pro | 1.000 | $0.2172 | 101 | 154,547 / 5,047 | 75 | public_trace.json · pending host | |
| joule-002 | Claude Haiku 4.5 | Anthropic | 1.000 | $0.1592 | 72 | 997,281 / 5,419 | 375 | public_trace.json · pending host |
| joule-002 | Claude Opus 4.7 | Anthropic | 1.000 | $0.2001 | 97 | 173,397 / 1,807 | 39 | public_trace.json · pending host |
| joule-002 | Claude Sonnet 4.6 | Anthropic | 1.000 | $0.2075 | 47 | 138,780 / 2,562 | 95 | public_trace.json · pending host |
| joule-002 | Dropstone FastCurrent leader | Blankline | 1.000 | $0.0297 | 74 | 376,501 / 7,982 | 3108 | public_trace.json · pending host |
| joule-002 | Dropstone Heavy | Blankline | 1.000 | $0.2449 | 484 | 316,201 / 4,036 | 195 | public_trace.json · pending host |
| joule-002 | Dropstone Pro | Blankline | 1.000 | $0.3190 | 149 | 419,502 / 5,756 | 346 | public_trace.json · pending host |
| joule-002 | Gemini 3.1 Flash | 1.000 | $0.0264 | 31 | 229,787 / 1,872 | 31 | public_trace.json · pending host | |
| joule-002 | Gemini 3.1 Pro | 1.000 | $0.2500 | 107 | 247,638 / 4,601 | 87 | public_trace.json · pending host | |
| joule-004 | Claude Haiku 4.5 | Anthropic | 1.000 | $0.6279 | 297 | 4,387,841 / 20,647 | 328 | public_trace.json · pending host |
| joule-004 | Claude Opus 4.7 | Anthropic | 1.000 | $2.7440 | 1375 | 2,820,013 / 17,985 | 403 | public_trace.json · pending host |
| joule-004 | Claude Sonnet 4.6 | Anthropic | 0.182 | $1.2448 | 453 | 1,622,105 / 23,668 | 563 | public_trace.json · pending host |
| joule-004 | Dropstone FastCurrent leader | Blankline | 1.000 | $0.2045 | 555 | 3,062,902 / 39,620 | 3798 | public_trace.json · pending host |
| joule-004 | Dropstone Heavy | Blankline | 1.000 | $2.0507 | 4051 | 2,692,119 / 24,243 | 1503 | public_trace.json · pending host |
| joule-004 | Dropstone Pro | Blankline | 1.000 | $0.5172 | 444 | 1,273,631 / 15,672 | 806 | public_trace.json · pending host |
| joule-004 | Gemini 3.1 Flash | 0.000 | $0.1200 | 140 | 1,275,389 / 7,679 | 100 | public_trace.json · pending host | |
| joule-004 | Gemini 3.1 Pro | 0.462 | $2.2700 | 998 | 2,538,983 / 27,130 | 369 | public_trace.json · pending host |
Retired tasks
Excluded from leaderboard, kept for audit
Methodology §3.3 specifies that tasks with inter-rater disagreement are discarded rather than adjudicated. A benchmark that documents its retired tasks publicly is more trustworthy than one that hides them.
| Task | Reason for retirement |
|---|---|
| joule-003 | The merged Pull Request's description listed scope (a backend proxy endpoint) that did not appear in the merged diff. Every evaluated tier followed the description in good faith and produced changes the diff did not reward. The Blankline Research Team identified the mismatch during inter-rater review and retired the task per methodology §3.3. |