May 24 20266 min read

AI Tokenomics: Incentives, Noise, and the Adoption Tax

Benchmark leaderboards and token economics are creating perverse incentives across the AI stack. Most people have not figured out how to use AI well yet. This is making it harder.

aiengineeringeconomicsbenchmarksproduct

Charlie Munger had a simple rule. Show him the incentive and he could predict the outcome. It applies to markets, organizations, and industries mid-transition. It applies to AI adoption right now.

The technology is accelerating. That is real. Capability gains over the last two years have been significant, and the pace is not obviously slowing. Even where current architectures approach physical limits, algorithmic improvements (better training strategies, distillation, new approaches to inference) keep opening new headroom. There is no shortage of smart people working on what comes next.

What is also real: most people have not figured out how to use AI well yet. That is normal. Every transformative technology takes time to absorb. The personal computer changed how organizations worked, but the productivity gains did not show up in economic data for over a decade after widespread adoption. We are early.

The problem is that being early, while things are accelerating, with incentive structures that generate noise rather than signal, makes for a genuinely difficult environment to make good decisions. Capital is going to the wrong places. Teams are measuring the wrong things. The harm from those decisions compounds faster than it would in a slower market.

What Token Maxxing Actually Is

Early on, token consumption was a genuine signal. The people burning through the most tokens were the serious adopters: engineers running complex workflows, researchers chaining prompts, teams that had figured out how to extract real leverage from the tools. High token usage meant you were in deep. It was a reasonable gauge of who was getting the most out of AI.

Then it got tracked. Then it got leaderboarded. Then it got gamified.

Once token usage became a visible, competitive metric, the incentive shifted into what the industry now recognizes as token maxxing: you no longer needed to use tokens productively to rank highly. You needed to use a lot of them. Verbose prompts, unnecessary chain-of-thought, outputs padded well past the point of usefulness. The behavior that looks like deep adoption from the outside is often just waste optimized to appear productive.

The signal inverted. High token consumption used to mean you were getting value. Now it often means someone is optimizing for the appearance of value. The metric is the same. What it measures has changed completely.

The Cisco Parallel

In 1999, Cisco was the most valuable company in the world. The reasoning was tight: the internet needed to grow, growth required networking infrastructure, Cisco built networking infrastructure.

The thesis was directionally right. The internet did grow. What was wrong was the assumption about value capture. The infrastructure got built. Most of the value went to companies running on top of it. Cisco is a solid business today. Its 2000 valuation implied it would own the internet. It did not. The companies that captured value were the ones running applications on the infrastructure, not the ones selling the pipes.

Tokens are being priced like Cisco routers in 1999. The logic goes: better AI means more tokens processed, tokens have a cost, therefore spending on tokens correlates with AI value creation. This confuses the pipe for what flows through it. Tokens are infrastructure. The question is what the tokens produce, and whether that production justifies the cost.

the metrics AI systems optimize for are not the metrics that determine business value

Goodhart Hits the Leaderboard

Goodhart's Law is simple: when a measure becomes a target, it ceases to be a good measure.

SWE-bench is the clearest current case. When first published, it was a useful proxy for coding agent capability. Labs now train with it explicitly in scope. The tasks have a known distribution. Models learn the distribution. Benchmark scores improve steadily. Production coding performance on actual codebases improves more slowly and more variably, depending on whether your codebase resembles the benchmark.

This is not unique to SWE-bench. Any benchmark that becomes widely used for purchasing decisions becomes a target. MMLU, GPQA, AIME. The proxy was useful. The proxy got optimized. Now you need a different proxy, and the same cycle starts again.

Reasoning models make this worse. Chain-of-thought produces higher scores, often for legitimate reasons. But it also produces dramatically more output tokens. A simple factual question does not need 800 tokens of visible reasoning before the answer. The model was trained to think out loud because thinking out loud scores better. The user pays for the full output.

The Noise Problem Is Acute Right Now

Most people interacting with AI today are still in the clumsiness phase. That is not a criticism. It is the normal arc of technology adoption. The personal computer required years of organizational learning before teams figured out workflows that actually worked. We are at an equivalent stage.

The difficulty is that we are in the clumsiness phase, at the same time as the evaluation metrics are noisy, at the same time as the technology is changing fast enough that last quarter's benchmark number means something different this quarter. Each of those conditions makes rational decision-making harder. Together they create conditions where capital flows toward impressive-sounding metrics rather than demonstrated outcomes.

Enterprise buyers are making platform decisions based on leaderboard position. A model ranked higher on SWE-bench might cost three to four times more per token. If the production performance gap is smaller than that pricing spread (and for many real workloads, it is), the difference between rank one and rank five is a number on a chart. The premium gets paid regardless.

Startups building on foundation models absorb token costs that do not track with value delivered. A customer-facing agent that responds in 2,000 tokens when 400 would do equally well is not more valuable. It is more expensive. At small scale, invisible. At production scale, the margin impact is real.

the incentive chain optimizes cleanly up to price, then breaks before value

Acceleration Does Not Fix the Math

The counterargument is that models are improving so fast that today's inefficiencies get competed away. That is partially true. Better capability at lower prices is a real trend, and it will continue.

But the structural problem is not about capability level. It is about what gets measured and optimized. Even if capability doubles every year, organizations making purchasing decisions based on benchmark position rather than production outcomes are still making expensive mistakes. The specific benchmarks will change. The pricing will compress. The same Goodhart dynamic will apply to whatever proxy replaces the current ones.

And physical model limits cut both ways. If current architectures are approaching ceilings, the next wave of gains comes from new training approaches, better distillation, architectural changes. Those shifts will produce new benchmarks and new leaderboards. The incentive to maximize the new metrics will be identical to the incentive to maximize the current ones. The problem migrates, it does not disappear.

The Road Gets Bumpy Before It Gets Good

Here is where I think this leads, not immediately, but over the next several years.

Productivity will reach levels that look extraordinary compared to what is possible with human hours alone. A small team with the right setup and organizational model will produce output that previously required organizations ten times their size. That potential is real. The trajectory is there.

The path there runs through significant disruption. Every system we have built for a world where human hours are the primary unit of production is going to be challenged. Org structures, pricing models, hiring frameworks, how teams measure success: none of those were designed for a world where effective output can be multiplied by an order of magnitude with the right tooling.

Some of that disruption is displacement, and it is worth being honest about it. Roles built around execution without judgment are exposed. Organizations that built headcount for tasks that AI now handles will face hard choices. That is uncomfortable, and it is coming.

The opportunity side is larger. Think of what it means when a team's budget becomes a blend of human time and token spend, not just headcount. You start planning capacity differently. You measure productivity differently. The org design question is not how many engineers you need but what combination of people and compute produces the best output per dollar. That model is not figured out yet. Neither is the pricing, the tooling, or the accounting. Each of those unsolved problems is a gap worth filling.

The Munger principle cuts both ways. Right now the prevailing incentives generate noise: optimize for benchmarks, spend on context you do not use, purchase on leaderboard rank. Organizations that design their own internal incentives around production outcomes rather than proxy metrics will make better decisions, move faster, and carry lower costs than the ones following the prevailing signal.

The road is bumpy. The destination is worth building toward.

← Essays & Opinions