Runs · Reported May 2026

Every Verified run, every observational trace, every billed number.

The current release contains 24 Verified runs across 3 real OSS bug-fix tasks and 8 model tiers from Anthropic, Blankline, Google, plus one retired task kept on disk for full audit. Sanitized public_trace.json files are published per the disclosure policy.

Per-run detail · 24 Verified runs

Task	Tier	Vendor	F1	$ / task	J / task	Tokens in / out	Wall s	Trace
joule-001	Claude Haiku 4.5	Anthropic	1.000	$0.1660	70	880,778 / 4,783	299	public_trace.json · pending host
joule-001	Claude Opus 4.7	Anthropic	1.000	$0.1318	62	113,883 / 588	16	public_trace.json · pending host
joule-001	Claude Sonnet 4.6	Anthropic	1.000	$0.1169	40	103,889 / 657	19	public_trace.json · pending host
joule-001	Dropstone FastCurrent leader	Blankline	1.000	$0.0117	44	239,079 / 3,297	147	public_trace.json · pending host
joule-001	Dropstone Heavy	Blankline	1.000	$0.2755	545	358,394 / 3,973	186	public_trace.json · pending host
joule-001	Dropstone Pro	Blankline	1.000	$0.2494	107	325,882 / 2,545	161	public_trace.json · pending host
joule-001	Gemini 3.1 Flash	Google	1.000	$0.0356	36	433,737 / 1,180	44	public_trace.json · pending host
joule-001	Gemini 3.1 Pro	Google	1.000	$0.2172	101	154,547 / 5,047	75	public_trace.json · pending host
joule-002	Claude Haiku 4.5	Anthropic	1.000	$0.1592	72	997,281 / 5,419	375	public_trace.json · pending host
joule-002	Claude Opus 4.7	Anthropic	1.000	$0.2001	97	173,397 / 1,807	39	public_trace.json · pending host
joule-002	Claude Sonnet 4.6	Anthropic	1.000	$0.2075	47	138,780 / 2,562	95	public_trace.json · pending host
joule-002	Dropstone FastCurrent leader	Blankline	1.000	$0.0297	74	376,501 / 7,982	3108	public_trace.json · pending host
joule-002	Dropstone Heavy	Blankline	1.000	$0.2449	484	316,201 / 4,036	195	public_trace.json · pending host
joule-002	Dropstone Pro	Blankline	1.000	$0.3190	149	419,502 / 5,756	346	public_trace.json · pending host
joule-002	Gemini 3.1 Flash	Google	1.000	$0.0264	31	229,787 / 1,872	31	public_trace.json · pending host
joule-002	Gemini 3.1 Pro	Google	1.000	$0.2500	107	247,638 / 4,601	87	public_trace.json · pending host
joule-004	Claude Haiku 4.5	Anthropic	1.000	$0.6279	297	4,387,841 / 20,647	328	public_trace.json · pending host
joule-004	Claude Opus 4.7	Anthropic	1.000	$2.7440	1375	2,820,013 / 17,985	403	public_trace.json · pending host
joule-004	Claude Sonnet 4.6	Anthropic	0.182	$1.2448	453	1,622,105 / 23,668	563	public_trace.json · pending host
joule-004	Dropstone FastCurrent leader	Blankline	1.000	$0.2045	555	3,062,902 / 39,620	3798	public_trace.json · pending host
joule-004	Dropstone Heavy	Blankline	1.000	$2.0507	4051	2,692,119 / 24,243	1503	public_trace.json · pending host
joule-004	Dropstone Pro	Blankline	1.000	$0.5172	444	1,273,631 / 15,672	806	public_trace.json · pending host
joule-004	Gemini 3.1 Flash	Google	0.000	$0.1200	140	1,275,389 / 7,679	100	public_trace.json · pending host
joule-004	Gemini 3.1 Pro	Google	0.462	$2.2700	998	2,538,983 / 27,130	369	public_trace.json · pending host

Retired tasks

Excluded from leaderboard, kept for audit

Methodology §3.3 specifies that tasks with inter-rater disagreement are discarded rather than adjudicated. A benchmark that documents its retired tasks publicly is more trustworthy than one that hides them.

Task	Reason for retirement
joule-003	The merged Pull Request's description listed scope (a backend proxy endpoint) that did not appear in the merged diff. Every evaluated tier followed the description in good faith and produced changes the diff did not reward. The Blankline Research Team identified the mismatch during inter-rater review and retired the task per methodology §3.3.