Vu Nguyen
← Essays & Opinions
2 min read

Automated Productivity Theater

AI tooling closed the loop on vanity metrics. Now you can generate, review, and merge code at scale with no human judgment, and the costs show up in your COGS before anyone notices the outcomes haven't moved.

engineeringaiproductivity

The pipeline runs without you now. An agent writes the code, an AI reviewer flags nothing of consequence, an automated approver merges it, and somewhere a dashboard updates. PRs merged: up. Velocity: healthy. Your team shipped seventy-three pull requests this sprint.

Nobody asked whether any of them needed to be written.

The Old Game

Activity metrics have always been gameable. Velocity points get inflated in sprint planning. Commit counts spike the Friday before performance reviews. PR tallies reward small, safe changes over large, difficult ones. None of this is new.

What kept it bounded was human effort. Gaming the system required actually doing something. You had to open the PR, write at least plausible code, wait for a real reviewer to read it. The overhead of theater was high enough that most engineers didn't bother most of the time. You could masquerade productivity, but it cost you.

Closing the Loop

Agentic coding tools removed that overhead. A Cursor session running in automated mode can open dozens of PRs in an afternoon. Add an AI code reviewer that approves by pattern rather than judgment, and an automated merge policy, and the loop is closed. Generate, review, approve, merge. No human enters the pipeline at any point.

What used to require coordination and human time now runs headlessly. The only limit is how much you're willing to spend on tokens and compute. And the stat machine doesn't care about the difference.

GENERATEai agentAI REVIEWai reviewerAI APPROVEai approverSTAT +1metricsTOKEN COSTTOKEN COSTCI + CLOUDNO HUMAN JUDGMENT REQUIRED
the automated pipeline: every stage runs without a human in the loop

The Cost Is Real

The spend is not hypothetical. Consider what happens when an AI agent runs a rename or refactor across a large codebase instead of a codemod. A jscodeshift transform or a well-targeted sed script costs nothing, runs in seconds, and produces one deterministic commit. An agentic session doing the same work can burn tens of thousands of tokens across multiple passes, produce dozens of incremental PRs, and introduce subtle inconsistencies the agent couldn't detect.

The token cost is one line item. But every merged PR also triggers your CI pipeline: test suites, build jobs, deploy previews, cloud compute. Each of those costs money. Multiply by seventy-three PRs in a sprint, and the automation is adding real overhead to your COGS without adding value to the product.

The CFO doesn't see it labeled that way. It shows up diffuse: a little more in compute, a little more in API spend, a little more in CI minutes. Nobody connects it to the PR count. The dashboard shows velocity. The P&L shows cost creep.

The Budget Problem

Traditional budget cycles assume costs accumulate predictably. You plan headcount, infrastructure, and tooling for a year, and the spend distributes roughly as expected. AI spending doesn't work that way.

Uber rolled out Claude Code to roughly 5,000 engineers in December 2025. By March 2026, 84% were classified as agentic coding users, up from 32% the month before. Monthly API costs per engineer ran between $500 and $2,000. By April, four months into the fiscal year, the company had burned through its entire AI budget for 2026.

The signal that caught leadership's attention wasn't the cost. It was that higher token consumption wasn't translating into a proportional increase in consumer-facing features. Uber's COO put it plainly: "If you're not actually able to draw a direct line to how much useful features and functionality you're shipping to your users, that trade becomes harder to justify."

Uber is now benchmarking AI token spend against the cost of hiring engineers. The governance failure is plain: the velocity of AI spending outran the org's ability to measure whether it was working. By the time anyone asked what they were getting for it, the budget was already gone.

Why People Do This

This is not stupidity and it is not malice. It is rational behavior under the wrong measurement regime.

Job insecurity is real, particularly in engineering right now. The feedback loop for "I merged three PRs today" is immediate and legible. The feedback loop for "I moved retention by 0.4 points this quarter" is slow, ambiguous, and almost impossible to attribute to one person. When engineers are anxious about their standing, they optimize for the thing they can win quickly.

The incentive architecture makes token-maxing sensible. If the metric is PRs merged and the tool can generate PRs at scale, using the tool is the rational move. The problem isn't the engineer. It's that the metric was never measuring the right thing, and now it can be gamed faster than anyone can audit.

The irony is that the behavior designed to protect jobs is generating the data that argues against them. When token consumption outpaces feature delivery, leadership starts comparing AI costs to headcount costs. The activity theater doesn't just fail to signal value. It actively makes the case for reducing the people producing it.

The Actual Question

Revenue growing doesn't mean the work is working. A product can ride macro tailwinds, a pricing change, a competitor stumbling. When the numbers are up, people stop asking whether the activity is causing them. The signal disappears.

The question that cuts through is simpler than it sounds: what work moved revenue this quarter? Not PRs, not commits, not velocity. What shipped, and what measurably changed after it shipped?

That question is uncomfortable when the team has been optimizing for something else. The answer is often "we don't know" or "we didn't measure." Those answers are useful. They tell you the real productivity problem isn't the tools. It's that the system generates activity without tying any of it to outcomes.

AI tooling is fast, cheap to run at scale, and very good at producing things that look like work. It will optimize for exactly what you tell it to optimize for. The question is whether you told it the right thing.


← Essays & Opinions