The Joule Index
An Independent Benchmark · Reported May 2026
About · The Joule Index

The benchmark that asks what intelligence costs.

A research note by the Blankline Research Team · 18 May 2026

Blankline, an independent research organisation in Chennai, has published the first AI leaderboard that scores autonomous coding agents on dollars, joules, and human-merge-readiness on the same chart. The team says it built the chart because no one else had.


The race for artificial general intelligence is, by most public estimates, the most expensive engineering programme humanity has ever attempted. Power-purchase agreements signed by frontier laboratories now run into multiple gigawatts. Datacenter electricity demand is on track to roughly double by the end of the decade. A single one of the largest training runs is estimated to consume more energy than seventy thousand American households use in a year. Yet to the simplest possible question about any of it ("how much did this answer cost?"), the AI benchmarking field has, until now, offered no public chart.

The Joule Index, released today in preview by Blankline, is an attempt to provide one. Built outside the academy, outside the frontier laboratories, and outside the major standards consortia, the leaderboard scores autonomous coding agents on three co-equal axes (capability, dollars, and joules) using live open-source bug-fix tasks filed within the last thirty days. Every score on the leaderboard is paired with a sanitized, publicly auditable observational trace, modelled on the disclosure regime MLCommons established for hardware vendors in MLPerf Power.

§1

The question the field stopped asking

Every public AI benchmark in active use as of May 2026 reports the same kind of number: a percentage representing how often a model solves a defined task. The practice has held since at least HumanEval in 2021. What the practice does not report is the cost of producing the percentage. Not in dollars. Not in joules. Not in carbon. Not in the human engineering hour the agent is being measured against.

Blankline's research team argues that this omission has become unsustainable. Cognitive labor is becoming a commodity. The price of that commodity, denominated in money and electricity, will determine which jurisdictions write the next layer of software, which medical questions get answered in the global south, which engineering problems get solved at the bottom of the income pyramid. Treated this way, AI accessibility is not a marketing variable. It is a developmental one.

A benchmark score that cannot be verified is not a benchmark. It is marketing.
§2

How the Joule Index is scored

Every entry on the leaderboard reports four numbers: Attention F1, dollars per task, joules per task, and a composite the team calls the Joule Score. The formula is short by design.

Joule Score · v0.1
Joule Score = Attention F1 / (1 + 0.5 · $ + 0.5 · kJ)
higher is better · range 0 to 1 · $1 of cost and 1 kJ of inference energy are penalised equally

Attention F1 is the harmonic mean of two ratios: how many of the files an agent touched were actually relevant to the task ("precision"), and how many of the relevant files the agent touched at all ("recall"). The ground truth is the file set a human maintainer merged into production. According to the published methodology, the benchmark authors do not write the test. The grader is the production repository itself.

Cost is taken at each model's published list price as of May 2026 and summed across every billed API call the agent made during its full multi-turn loop. Joules are currently estimated from token counts and published per-token energy rates; direct measurement on open-weight runs is scheduled for the V1 release. The team has published an open-source evaluation harness on GitHub.

The composite score is deliberately ungenerous both to capability bought with money and to capability bought with electricity. A model that scores a perfect Attention F1 at two dollars and three kilojoules a task earns roughly the same Joule Score as a model scoring 0.85 at twenty cents and three hundred joules. Energy and cost are weighed equally, on the rough thermodynamic anchor that one kilojoule of inference work and one dollar of inference cost are both penalties the planet pays at comparable scale. That math, the team argues, is the one a procurement officer, a climate scientist, and a worker on the median global wage would each recognise as fair.

§3

Intelligence as a thermodynamic phenomenon

Every joule of inference electricity drawn by an AI model comes off a power grid. Rolf Landauer's 1961 result placed a physical lower bound on the energy required to erase a single bit of information at finite temperature. Practical frontier inference today runs roughly nine orders of magnitude above that bound. The gap is enormous, and the next half-century of computer architecture is, in effect, a long effort to close it.

Until that gap closes, the energy cost of cognition is real. According to the Joule Index V0 data, an eight-file Mozilla Common Voice fix consumed roughly four thousand joules of inference energy on Blankline's most expensive Dropstone tier (Kimi K2.6) and roughly twelve hundred joules on its cheapest (DeepSeek V4-Flash). The merged diff was the same in both cases. The bill differed by more than three to one in energy and by more than ten to one in dollars.

Scaling that delta is what makes the chart consequential. The world has roughly twenty-seven million professional software developers. If each substituted one engineering hour per day with autonomous AI on the Joule Index's worst case, the daily delta between the cheapest and most expensive tier would exceed seventy megawatt-hours, comparable to the daily output of a small utility-scale wind farm.

The flagship premium is paying for compute, not capability.
§4

The benchmarks that came close

Six prior attempts have been made at parts of what the Joule Index does. Blankline cites all of them in its methodology and says the foundation each laid is what made the new chart possible. The team's argument is that the foundation was never completed.

Each project contributed a piece. None published a leaderboard in which (a) tasks are real OSS bug fixes filed within the last thirty days, (b) the agent runs end-to-end multi-turn loops, (c) every billed token is accounted for, (d) joules appear on the same chart as dollars and capability, and (e) every score carries a sanitized public observational trace that any third party can re-score. The Joule Index is the first leaderboard, on Blankline's account, that does all five.

Asked why the chart had not appeared from a larger institution, Blankline's CEO Santosh Arron pointed to incentive structure. Academic benchmark teams publish capability because capability is what gets cited. Frontier laboratories publish capability because cost is a procurement disadvantage. Standards consortia publish what their vendor members agree to disclose. The constituency the Joule Index addresses, Arron said, is none of those: it is the worker who pays for intelligence at retail, the maintainer whose code is being evaluated, the climate scientist who counts the joules, and the public administrator who needs to know whether AI is affordable enough to plan a country on.

§5

Blankline's stated mission

Blankline was founded, the company says, on a single conviction: intelligence should be available to everyone in a civilization at a price they can pay. Not as a luxury good. Not as enterprise pricing tiered against productivity. The company frames AI as a utility that should approach the affordability of electricity or running water, engineered down through transparency, competition, and unrelenting attention to the cost of every token.

That conviction has a number attached to it. The global median individual income is approximately ten dollars per day. At Dropstone Fast list pricing, a human earning the global median can afford roughly one hundred and twenty real engineering tasks per day. At Dropstone Heavy pricing, the same person can afford eleven. At flagship-tier closed-API list pricing, fewer than half a task. Those are the practical stakes of model pricing for the majority of the world's working population.

An AGI that is only economically accessible to the top decile of the OECD, the company argues, is not the AGI worth building. The engineering challenge of the coming decade, Blankline says, is not whether AI can think, but whether AI can think for a farmer in Kenya, a clinic worker in Bangladesh, and a small-press journalist in Argentina on the same day, for the same price, at the same quality.

§6

The Blankline Standard

Industry transparency, on most published measures, is in decline. The Stanford Foundation Model Transparency Index reports that average vendor transparency scores fell from fifty-eight out of one hundred in 2024 to forty in 2025. Several of the largest frontier laboratories now publish neither training-data summaries nor compute estimates nor energy reports, and at least two declined to publish complete technical reports for their most recent flagship models.

The Joule Index publishes against that current under what Blankline calls the Blankline Standard, a four-point disclosure regime.

  1. Verified disclosure or nothing. Every leaderboard entry carries either a Verified tag (full observational trace published) or an Unverified tag (sorted below all Verified entries regardless of headline score). Source code, system prompts, and internal reasoning remain private; the observational trace, the record of what the agent did to the task workspace, is published in full.
  2. Real tasks, not curated ones. Tasks are live GitHub issues filed within the last thirty days. Issues are accepted only if filed strictly after the evaluated model's training cutoff, a single rule that, the team says, mathematically prevents contamination.
  3. Cost and energy on the same chart as capability. A leaderboard that publishes accuracy without publishing cost is, on the methodology's reading, a leaderboard whose numbers cannot survive a procurement review or a climate audit.
  4. Methodology pre-registered, retirements documented openly. The full protocol is pre-registered on the Open Science Framework before public launch. When a task is retired, the reason is published verbatim and the retired data remains on disk for audit.
§7

What V0 found

The V0 release reports nine evaluated runs across three real OSS bug-fix tasks and three Dropstone model tiers, with one further task retired under the published methodology. Every evaluated run produced a Pull Request matching the diff a human maintainer had merged. The flagship tier paid 10.4 times more in dollars and 3.6 times more in joules for that identical outcome. On the largest task (an eight-file change to Mozilla's Common Voice bundler), the gap reached more than two dollars per Heavy run against less than ten cents per Fast run.

Blankline says the V0 finding is one preliminary observation rather than a final result. Over the next twelve months, the Joule Index is scheduled to scale to thirty tasks per model per month, with a rotating private holdout to defeat contamination and an open invitation for every frontier laboratory operating today to submit. The team identifies three findings it expects to emerge with statistical force at the larger sample size.

§8

The civilizational stake

Every architectural choice frontier labs make, whether to cache prompts, how to price tiers, whether to publish energy figures, shapes the affordability of the next decade of cognitive labor. Every dollar a procurement team spends on AI is a dollar that does not go to housing, healthcare, or scientific research. Every joule consumed by an inefficient model is, on a finite grid, a joule a more efficient model would not have asked for. These are not abstractions: they appear as line items in budgets, on electricity bills, in carbon ledgers, and in the development plans of low- and middle-income countries that will live or die by what intelligence costs when they need it.

The Joule Index is, in its authors' view, the only public benchmark that today holds those choices accountable on the open record. Whether other research bodies and laboratories adopt the same disclosure regime will, the team says, be a better measure of the field's seriousness than any one model's leaderboard ranking.

If intelligence has a price, this is the place where the price is paid in public.
§8.5

The Cap Test

Alongside the leaderboard, Blankline publishes a Pricing Preview covering every commercial frontier model that has not yet returned a Verified run. The benchmark is open for submission. Each preview row shows what one Joule Index task would cost on that model at the vendor's own published per-token rate, applied to the token budget a Verified agent consumed on the same task. The figures are a calculation, not a measurement. They carry no capability score and never enter the leaderboard ranking.

On the long-horizon reference task, several flagship-tier models from major commercial vendors are priced such that a single attempt at the Verified-agent token budget exceeds the benchmark's published per-task cap, in the most extreme case by close to an order of magnitude. Models above the cap are ineligible to compete under the published rules until their pricing changes or the vendor submits a Verified run demonstrating a cache and decode profile that fits within the cap. The judgment is not editorial. It is arithmetic on the vendor's own pricing page, and it is recomputed every time the Pricing Preview is rebuilt.

§9

What the team asks of the field

Blankline has invited Anthropic, OpenAI, Google DeepMind, xAI, Meta, Mistral, DeepSeek, Moonshot, Alibaba, and every other frontier laboratory operating today to submit a Verified score. The company will publicly track which vendors have and which have not. A vendor that declines verification, Arron said, is a vendor that has decided the rest of the world is not entitled to know what its intelligence costs.

The team has also invited academic researchers, climate scientists, procurement professionals, and software maintainers to audit, challenge, and contribute. The methodology is open. The evaluation harness is published under a permissive open license. The traces are public. According to Arron, disagreement is the point.

Operated by

The Blankline Research Team

Blankline is an independent research organisation building tools and benchmarks for cost-efficient autonomous intelligence. The reference agent used to produce the V0 Joule Index measurements is Dropstone CLI, Blankline's proprietary coding agent. The evaluation harness, scoring scripts, and methodology are independent of any frontier laboratory and published in the open. Hosting is independent.

Editorial authority

Blankline's Chief Executive Officer, Santosh Arron, personally directed the V0 release and ordered it public ahead of the company's own scheduled model launch. The sequencing, Arron said, was deliberate. He wanted the chart on the open record before any Blankline product could be perceived as a beneficiary of its findings.

Methodology pre-registration: pending OSF deposit prior to V1 launch. Evaluation harness license: permissive open-source. Reference agent: Dropstone CLI, proprietary, by Blankline. Editorial authority: Santosh Arron, Chief Executive Officer, Blankline. Contact: joule@blankline.org.