What is the Joule Index?

The Joule Index is an independent benchmark that measures the dollar cost, joule cost, and human-merge-readiness of frontier AI coding agents on real open-source bug fixes filed within the last 30 days. Every leaderboard entry is published under a mandatory Verified-disclosure regime modelled on MLPerf Power.

How is the Joule Index different from SWE-bench?

SWE-bench reports task-completion percentages. The Joule Index reports task completion, dollar cost, joule cost, wall-clock and merge-readiness together on one chart, and requires every score to be accompanied by a sanitized observational trace any third party can audit. SWE-bench tasks are also drawn from historical issues; Joule Index tasks are filed strictly after the evaluated model's training cutoff to prevent contamination.

Who runs the Joule Index?

The Joule Index is built and maintained by the Blankline Research Team, an independent research organisation. Blankline also builds Dropstone CLI, the reference coding agent used to validate the evaluation harness. Vendor submissions are reviewed under the same Verified-disclosure rules that apply to Blankline's own runs.

What does the Joule Score formula mean?

Joule Score = Attention F1 / (1 + 0.5 · $ + 0.5 · kJ). Attention F1 captures whether the model touched the same files a human maintainer would. Cost is in US dollars at vendor list price. kJ is joules per task divided by one thousand. The score is bounded between zero and one. Higher is better. Capability multiplies; cost and energy enter the denominator.

How are energy and cost calculated?

Energy is estimated from token counts multiplied by published per-token Wh rates from peer-reviewed sources, with a cache-aware adjustment that charges cache-read tokens at 15 percent of fresh-input energy. Cost is computed from billed token counts against each vendor's published list price as of the report date. Both numbers are reproducible from each run's public trace.

What does Verified disclosure require?

A Verified score requires publication of the full observational trace: every tool call, file read, billed token, and final diff produced by the agent. Source code, system prompts, and internal reasoning are never required. The line is drawn at what is necessary to audit the result, not at what is necessary to reproduce the agent.

About · The Joule Index

The benchmark that asks what intelligence costs.

A research note by the Blankline Research Team · 18 May 2026

Blankline, an independent research organisation in Chennai, has published the first AI leaderboard that scores autonomous coding agents on dollars, joules, and human-merge-readiness on the same chart. The team says it built the chart because no one else had.

The race for artificial general intelligence is, by most public estimates, the most expensive engineering programme humanity has ever attempted. Power-purchase agreements signed by frontier laboratories now run into multiple gigawatts. Datacenter electricity demand is on track to roughly double by the end of the decade. A single one of the largest training runs is estimated to consume more energy than seventy thousand American households use in a year. Yet to the simplest possible question about any of it ("how much did this answer cost?"), the AI benchmarking field has, until now, offered no public chart.

The Joule Index, released today in preview by Blankline, is an attempt to provide one. Built outside the academy, outside the frontier laboratories, and outside the major standards consortia, the leaderboard scores autonomous coding agents on three co-equal axes (capability, dollars, and joules) using live open-source bug-fix tasks filed within the last thirty days. Every score on the leaderboard is paired with a sanitized, publicly auditable observational trace, modelled on the disclosure regime MLCommons established for hardware vendors in MLPerf Power.

§1

The question the field stopped asking

Every public AI benchmark in active use as of May 2026 reports the same kind of number: a percentage representing how often a model solves a defined task. The practice has held since at least HumanEval in 2021. What the practice does not report is the cost of producing the percentage. Not in dollars. Not in joules. Not in carbon. Not in the human engineering hour the agent is being measured against.

Blankline's research team argues that this omission has become unsustainable. Cognitive labor is becoming a commodity. The price of that commodity, denominated in money and electricity, will determine which jurisdictions write the next layer of software, which medical questions get answered in the global south, which engineering problems get solved at the bottom of the income pyramid. Treated this way, AI accessibility is not a marketing variable. It is a developmental one.

A benchmark score that cannot be verified is not a benchmark. It is marketing.

§2

How the Joule Index is scored

Every entry on the leaderboard reports four numbers: Attention F1, dollars per task, joules per task, and a composite the team calls the Joule Score. The formula is short by design.

Joule Score · v0.1

Joule Score = Attention F1 / (1 + 0.5 · $ + 0.5 · kJ)

higher is better · range 0 to 1 · $1 of cost and 1 kJ of inference energy are penalised equally

Attention F1 is the harmonic mean of two ratios: how many of the files an agent touched were actually relevant to the task ("precision"), and how many of the relevant files the agent touched at all ("recall"). The ground truth is the file set a human maintainer merged into production. According to the published methodology, the benchmark authors do not write the test. The grader is the production repository itself.

Cost is taken at each model's published list price as of May 2026 and summed across every billed API call the agent made during its full multi-turn loop. Joules are currently estimated from token counts and published per-token energy rates; direct measurement on open-weight runs is scheduled for the V1 release. The team has published an open-source evaluation harness on GitHub.

The composite score is deliberately ungenerous both to capability bought with money and to capability bought with electricity. A model that scores a perfect Attention F1 at two dollars and three kilojoules a task earns roughly the same Joule Score as a model scoring 0.85 at twenty cents and three hundred joules. Energy and cost are weighed equally, on the rough thermodynamic anchor that one kilojoule of inference work and one dollar of inference cost are both penalties the planet pays at comparable scale. That math, the team argues, is the one a procurement officer, a climate scientist, and a worker on the median global wage would each recognise as fair.

§3

Intelligence as a thermodynamic phenomenon

Every joule of inference electricity drawn by an AI model comes off a power grid. Rolf Landauer's 1961 result placed a physical lower bound on the energy required to erase a single bit of information at finite temperature. Practical frontier inference today runs roughly nine orders of magnitude above that bound. The gap is enormous, and the next half-century of computer architecture is, in effect, a long effort to close it.

Until that gap closes, the energy cost of cognition is real. According to the Joule Index V0 data, an eight-file Mozilla Common Voice fix consumed roughly four thousand joules of inference energy on Blankline's most expensive Dropstone tier (Kimi K2.6) and roughly twelve hundred joules on its cheapest (DeepSeek V4-Flash). The merged diff was the same in both cases. The bill differed by more than three to one in energy and by more than ten to one in dollars.

Scaling that delta is what makes the chart consequential. The world has roughly twenty-seven million professional software developers. If each substituted one engineering hour per day with autonomous AI on the Joule Index's worst case, the daily delta between the cheapest and most expensive tier would exceed seventy megawatt-hours, comparable to the daily output of a small utility-scale wind farm.

The flagship premium is paying for compute, not capability.

§4

The benchmarks that came close

Six prior attempts have been made at parts of what the Joule Index does. Blankline cites all of them in its methodology and says the foundation each laid is what made the new chart possible. The team's argument is that the foundation was never completed.

Princeton, in SWE-bench Verified and SWE-bench Pro, established real GitHub-issue evaluation as the field standard. Its leaderboards publish a single percentage. They do not include cost. They do not include energy.
Microsoft Research, in SWE-bench Live, demonstrated that contamination-resistant monthly task refresh is feasible at scale. Cost and energy remain absent from the public leaderboard.
The ARC Prize Foundation (drawing on long collaboration with MIT researchers) shipped the most durable hardest-to-saturate benchmark in the field. ARC-AGI caps inference compute in its prize rules, but its tasks are symbolic puzzles, not production engineering, and the public chart shows neither dollars nor joules.
Stanford CRFM, in HELM and HELM-Efficiency, built the cleanest academic framework for reporting efficiency alongside capability. The efficiency axis covers generic prompts rather than agentic engineering, and industry treats the efficiency tab as optional.
MLCommons created the gold standard of transparent power-disclosure in MLPerf Power. The regime governs hardware vendors; no equivalent exists for agent vendors. The Joule Index is, on Blankline's reading, the first public application of MLPerf Power's mandatory-disclosure philosophy to autonomous coding work.
Hugging Face and Salesforce, in the AI Energy Score andTokenPowerBench projects, made direct energy measurement on real models a published practice. The measurements run on synthetic prompts, not on autonomous coding workloads, and so do not yet inform procurement decisions about real engineering tasks.

Each project contributed a piece. None published a leaderboard in which (a) tasks are real OSS bug fixes filed within the last thirty days, (b) the agent runs end-to-end multi-turn loops, (c) every billed token is accounted for, (d) joules appear on the same chart as dollars and capability, and (e) every score carries a sanitized public observational trace that any third party can re-score. The Joule Index is the first leaderboard, on Blankline's account, that does all five.

Asked why the chart had not appeared from a larger institution, Blankline's CEO Santosh Arron pointed to incentive structure. Academic benchmark teams publish capability because capability is what gets cited. Frontier laboratories publish capability because cost is a procurement disadvantage. Standards consortia publish what their vendor members agree to disclose. The constituency the Joule Index addresses, Arron said, is none of those: it is the worker who pays for intelligence at retail, the maintainer whose code is being evaluated, the climate scientist who counts the joules, and the public administrator who needs to know whether AI is affordable enough to plan a country on.

§5

Blankline's stated mission

Blankline was founded, the company says, on a single conviction: intelligence should be available to everyone in a civilization at a price they can pay. Not as a luxury good. Not as enterprise pricing tiered against productivity. The company frames AI as a utility that should approach the affordability of electricity or running water, engineered down through transparency, competition, and unrelenting attention to the cost of every token.

That conviction has a number attached to it. The global median individual income is approximately ten dollars per day. At Dropstone Fast list pricing, a human earning the global median can afford roughly one hundred and twenty real engineering tasks per day. At Dropstone Heavy pricing, the same person can afford eleven. At flagship-tier closed-API list pricing, fewer than half a task. Those are the practical stakes of model pricing for the majority of the world's working population.

An AGI that is only economically accessible to the top decile of the OECD, the company argues, is not the AGI worth building. The engineering challenge of the coming decade, Blankline says, is not whether AI can think, but whether AI can think for a farmer in Kenya, a clinic worker in Bangladesh, and a small-press journalist in Argentina on the same day, for the same price, at the same quality.

§6

The Blankline Standard

Industry transparency, on most published measures, is in decline. The Stanford Foundation Model Transparency Index reports that average vendor transparency scores fell from fifty-eight out of one hundred in 2024 to forty in 2025. Several of the largest frontier laboratories now publish neither training-data summaries nor compute estimates nor energy reports, and at least two declined to publish complete technical reports for their most recent flagship models.

The Joule Index publishes against that current under what Blankline calls the Blankline Standard, a four-point disclosure regime.

Verified disclosure or nothing. Every leaderboard entry carries either a Verified tag (full observational trace published) or an Unverified tag (sorted below all Verified entries regardless of headline score). Source code, system prompts, and internal reasoning remain private; the observational trace, the record of what the agent did to the task workspace, is published in full.
Real tasks, not curated ones. Tasks are live GitHub issues filed within the last thirty days. Issues are accepted only if filed strictly after the evaluated model's training cutoff, a single rule that, the team says, mathematically prevents contamination.
Cost and energy on the same chart as capability. A leaderboard that publishes accuracy without publishing cost is, on the methodology's reading, a leaderboard whose numbers cannot survive a procurement review or a climate audit.
Methodology pre-registered, retirements documented openly. The full protocol is pre-registered on the Open Science Framework before public launch. When a task is retired, the reason is published verbatim and the retired data remains on disk for audit.

§7

What V0 found

The V0 release reports nine evaluated runs across three real OSS bug-fix tasks and three Dropstone model tiers, with one further task retired under the published methodology. Every evaluated run produced a Pull Request matching the diff a human maintainer had merged. The flagship tier paid 10.4 times more in dollars and 3.6 times more in joules for that identical outcome. On the largest task (an eight-file change to Mozilla's Common Voice bundler), the gap reached more than two dollars per Heavy run against less than ten cents per Fast run.

Blankline says the V0 finding is one preliminary observation rather than a final result. Over the next twelve months, the Joule Index is scheduled to scale to thirty tasks per model per month, with a rotating private holdout to defeat contamination and an open invitation for every frontier laboratory operating today to submit. The team identifies three findings it expects to emerge with statistical force at the larger sample size.

The cost spread within a single coding workload is likely not three or four times. The team estimates it is closer to fifty or one hundred times once frontier closed models are included on the same chart.
Models without prompt caching compound cost across multi-turn agent loops by factors that are invisible to single-call benchmarks. The team argues this becomes a procurement and climate variable rather than a purely engineering one.
The inversion thesis, that flagship tiers buy compute rather than capability, will hold across more than half of practical bug-fix work. Where it breaks, the failures themselves will become the most informative signal the field has produced on what flagship inference is actually for.

§8

The civilizational stake

Every architectural choice frontier labs make, whether to cache prompts, how to price tiers, whether to publish energy figures, shapes the affordability of the next decade of cognitive labor. Every dollar a procurement team spends on AI is a dollar that does not go to housing, healthcare, or scientific research. Every joule consumed by an inefficient model is, on a finite grid, a joule a more efficient model would not have asked for. These are not abstractions: they appear as line items in budgets, on electricity bills, in carbon ledgers, and in the development plans of low- and middle-income countries that will live or die by what intelligence costs when they need it.

The Joule Index is, in its authors' view, the only public benchmark that today holds those choices accountable on the open record. Whether other research bodies and laboratories adopt the same disclosure regime will, the team says, be a better measure of the field's seriousness than any one model's leaderboard ranking.

If intelligence has a price, this is the place where the price is paid in public.

§8.5

The Cap Test

Alongside the leaderboard, Blankline publishes a Pricing Preview covering every commercial frontier model that has not yet returned a Verified run. The benchmark is open for submission. Each preview row shows what one Joule Index task would cost on that model at the vendor's own published per-token rate, applied to the token budget a Verified agent consumed on the same task. The figures are a calculation, not a measurement. They carry no capability score and never enter the leaderboard ranking.

On the long-horizon reference task, several flagship-tier models from major commercial vendors are priced such that a single attempt at the Verified-agent token budget exceeds the benchmark's published per-task cap, in the most extreme case by close to an order of magnitude. Models above the cap are ineligible to compete under the published rules until their pricing changes or the vendor submits a Verified run demonstrating a cache and decode profile that fits within the cap. The judgment is not editorial. It is arithmetic on the vendor's own pricing page, and it is recomputed every time the Pricing Preview is rebuilt.

§9

What the team asks of the field

Blankline has invited Anthropic, OpenAI, Google DeepMind, xAI, Meta, Mistral, DeepSeek, Moonshot, Alibaba, and every other frontier laboratory operating today to submit a Verified score. The company will publicly track which vendors have and which have not. A vendor that declines verification, Arron said, is a vendor that has decided the rest of the world is not entitled to know what its intelligence costs.

The team has also invited academic researchers, climate scientists, procurement professionals, and software maintainers to audit, challenge, and contribute. The methodology is open. The evaluation harness is published under a permissive open license. The traces are public. According to Arron, disagreement is the point.

Operated by

The Blankline Research Team

Blankline is an independent research organisation building tools and benchmarks for cost-efficient autonomous intelligence. The reference agent used to produce the V0 Joule Index measurements is Dropstone CLI, Blankline's proprietary coding agent. The evaluation harness, scoring scripts, and methodology are independent of any frontier laboratory and published in the open. Hosting is independent.

Editorial authority

Blankline's Chief Executive Officer, Santosh Arron, personally directed the V0 release and ordered it public ahead of the company's own scheduled model launch. The sequencing, Arron said, was deliberate. He wanted the chart on the open record before any Blankline product could be perceived as a beneficiary of its findings.

Methodology pre-registration: pending OSF deposit prior to V1 launch. Evaluation harness license: permissive open-source. Reference agent: Dropstone CLI, proprietary, by Blankline. Editorial authority: Santosh Arron, Chief Executive Officer, Blankline. Contact: joule@blankline.org.