Volume ≠ Profit — The Hidden Token Economics Behind AI Profit

Gemini 3 and Antigravity IDE dropped. Wonder why every AI vendor is chasing AI coding? Because input tokens are a 95%‑margin business, and that same equation can reshape whose AI prints profit vs burns cash. Are you building on ground where token economics work for you or against you? The answer lies in a margin secret most teams ignore.

EaseFlows AI
EaseFlows AI
7 min
Share:

Why Everyone Is Chasing Coding

Google launched Gemini 3 Pro with Antigravity. OpenAI released Codex in May. Anthropic built Claude Code. When every major player enters the same race, the unit economics work.

Coding is where AI commercialisation is proving itself, but the pattern reveals something broader: one structural factor explains why some AI investments generate returns sooner than others.

How LLMs function creates natural economic advantages in certain domains and friction in others. Think of streaming video in 2010: the technology worked, but only where bandwidth, cost per bit, and content supply reached equilibrium. AI today has similar economic thresholds, and coding crossed them first. Understanding where the underlying economics create tailwinds versus headwinds separates strategic timing from premature spend.

How LLMs Earn Per Token

Why AI Margins Look Unreal

From a user's perspective, AI appears to be becoming increasingly affordable. Subscription plans that previously capped usage at a few hundred requests now allow thousands, without prices increasing proportionally. The natural conclusion is that profit margins must be enormous, and at the unit level, that instinct is often correct.

To understand why, think of an AI model like a postal sorting facility that can read addresses but must hand-write responses.

Parallel In, Sequential Out

Imagine a large postal facility receives 32 letters at once, each containing a different question. The facility needs to read all the letters and then write personalised responses to each one.

Reading the letters (input processing): The facility has an optical scanning system that can photograph and digitise all 32 letters simultaneously. One pass through the scanner, and the system has "read" everything. All 32,000 words across all letters are captured in a single operation. This is parallel processing: handle everything at once, move on. Time required: one unit of effort.

Writing the responses (output generation): Now comes the hard part. The facility must write replies, but there is a critical constraint: it can only write one word at a time, and before writing each new word, it must re-read the original letter AND everything it has already written.

Why? Because each new word must be consistent with every word that came before it. If the response is 1,000 words long, then:

  • Writing word 1 requires reading the original letter once
  • Writing word 2 requires reading the original letter PLUS word 1
  • Writing word 3 requires reading the original letter PLUS words 1 and 2
  • Writing word 1,000 requires reading the original letter PLUS all 999 previous words

This is sequential processing with a growing computational burden. You cannot write word 50 until you have written word 49, and each step requires attending to an increasingly long chain of context.

By the time the facility finishes all 1,000 words for all 32 customers, it has performed roughly 1,000 full "re-reading" operations. Reading the input was one quick scan. Writing the output was a thousand careful, sequential reviews.

The Batching Penalty

There is a second economic factor: opportunity cost.

When the postal facility is scanning incoming letters, it can easily batch multiple customers together. Thirty-two letters, a hundred letters, even more - the scanner handles them all in one efficient sweep. The facility is never "locked" to a single customer during the reading phase.

But once the facility starts writing a response, it is committed. The writing desk is occupied with that specific letter sequence. It cannot pause halfway through writing word 500 to quickly scan some new incoming mail. The sequential nature of writing means the facility is tied up for the full duration of that output task, which reduces overall throughput and prevents efficient batching of other requests.

In GPU terms, during input processing, the hardware can serve many requests in parallel. During output generation, the GPU is effectively reserved for that sequence until generation completes. This opportunity cost shows up in pricing.

Where the Margin Hides

When you look at public pricing, output tokens typically cost a few dollars per million tokens, while input tokens cost about one-fifth to one-eighth as much. That looks like a reasonable markup on both sides.

Under the hood, the actual compute cost is wildly asymmetric. Processing input is cheap because it happens once, in parallel, with efficient batching. Generating output is expensive because it happens sequentially, with growing context, and locks the hardware.

Input processing costs almost nothing to perform, yet providers charge meaningful fees for it. That is where the extraordinary margin lives: input tokens can yield 95%+ gross margins.

Where AI Sprints And Stalls

Races Ahead vs Slows Down

Once the cost difference between "reading" and "writing" is clear, use cases sort into two camps: those that will feel cheaper and more available, and those that remain flashy but constrained.

The simplest lens is the token exchange rate. Give the model many tokens and ask for a few back, or drip in a few and expect a flood. The economics could not be more different.

The Expensive Pattern: Tiny Input, Huge Output

Where the setup is "small prompt, huge response", economics work against you. Common examples:

  • Single sentences requesting novels, scripts, or full reports
  • Images, music, and video: a 50-token description generating thousands to millions of output tokens

These feel like "one-click creativity" but are expensive ways to make the model write frame-by-frame from minimal instruction. Providers will improve them, but the cost profile means they'll stay metered or limited, not infinite-use utilities.

The Cheap Pattern: Lots In, A Little Out

Use cases with heavy context and modest output align with underlying economics. Examples teams already use:

  • Incident compression: stream logs, alerts, and tickets into the model, ask for root cause plus three corrective actions.
  • Research consolidation: dump academic papers, internal experiments, and competitor updates into context, get a "what changed and what next" memo.
  • Retrieval-augmented generation (RAG) and retrieval-augmented search: pull many documents into context and return compact, grounded answers.A typical exchange: 10,000 input tokens for 100 or fewer output tokens. Hardware handles reading efficiently in parallel; writing is brief. These support generous usage, good margins, and rapid iteration.

This explains why summarisation, error-checking, and AI-search improve quickly and spread across tools, while long-form generation carries obvious limits. The first group leans into how the system wants to work; the second leans against it.

Why Coding Is The Prime Real Estate

Within the "lots in, a little out" family, code generation is the single most attractive pattern. AI's advantage in coding can be understood from multiple angles. The role dynamics matter (explored in my previous article "Why AI Replaces Senior Devs Before Junior Marketers"), but so does the underlying token structure. That structural advantage is why every major model provider is fighting so hard over this territory.

On the surface, it feels like code output is large: hundreds of lines, thousands of tokens. Compared with a short summary, it is. But compared with what the model reads before writing those lines, the output is small. A realistic coding task might load:

  • Relevant libraries and frameworks.
  • API and SDK documentation.
  • Existing project files and configuration.
  • Logs, error messages, and issue descriptions.

All of that can easily add up to millions of tokens of input context. Out of that, the model might generate 20,000 to 30,000 tokens of code. The ratio is often around 100 to 1 in favour of input.

So code generation sits at a rare intersection:

  • It lives in the high‑input, low‑output sweet spot that the cost structure loves.
  • The output itself is economically dense because it substitutes for or accelerates work normally done by experienced engineers.

Owning that territory at scale is effectively owning the most favourable square on the generative AI board: low serving cost per unit of business value. That is why coding agents and IDE integrations are no longer side features, but flagship products.

Who Owns The AI Margins

Choosing The Profitable Ground

Now that the cost pattern is clear, you can think of AI features as real estate on a new map: some plots sit right next to a paved highway, others are halfway up a mountain and need a tunnel before they earn a cent. The near-term profit is on the flat land where token economics and business value pull in the same direction.

The "busy streets" are use cases with high input, modest output, and high value per answer. Adding one more dimension (who owns the infrastructure) determines how margins are split. Whether using a closed-source API or self-hosting, you're seeking the same plots; the difference is how much margin you keep versus pay as rent.

High Profit, Near-Term Territory

The attractive plots: code assistance, search and RAG, summarisation, and judgement tasks on rich context. They share the same shape: lots of tokens in, few out, meaningful value attached.

This is where infrastructure and economics align, models improve faster, and usage scales without breaking costs. Whether renting or owning, this ground feels both powerful and commercially sensible.

High Cost, Slower Profit Territory

Further up the hill: long-form text from short prompts and rich media from a single sentence. Fun to demo, but sitting on the wrong side of the token exchange rate.

Vendors will invest here, but progress is cost-constrained, keeping these features metered, priced, or bundled into premium tiers. For most companies, these are features to use thoughtfully, not foundations of profitable AI products.

Building On Closed Models

If you build on a closed-source LLM like GPT, Claude, or Gemini, you are effectively opening your store in someone else’s mall. The platform provider sets the rent on every square meter of input and output tokens, and their pricing explicitly charges for both.

Because input is so cheap for them to compute, but still priced at a non-trivial rate, they capture most of the margin on high-input, low-output workloads. You see this in public price sheets where input tokens are cheaper than output tokens, but far from free, even though the true compute cost of input is much lower.

The flip side is that these providers are highly motivated to attack exactly the use cases where that structure is most profitable: coding tools, enterprise search, document summarisation, and decision copilots on top of long context. That is where you can expect the fastest model improvements and richest tooling, because every additional token through those corridors is good business for them.

So when you build on closed models, the pattern is: ride vendor favoured domains for capability and speed, accept that your gross margin in those areas will always sit under their token economics. It is still a good trade when what you care about is time to market, quality, and integration rather than squeezing every point of margin.

When You Own The Land

If you self-host open source models, you move from tenant to landlord. Now you carry the capital and operational costs of GPUs, infrastructure, and engineering, but you also own one hundred percent of any margin created by the gap between your cost per token and the value you sell on top.

In that world, the high input, low output pattern is not just intellectually attractive, it directly improves your P&L. Every feature that feeds the model large volumes of your proprietary data and asks for short, high-value outputs is effectively letting you monetise the same “input is almost free” dynamic that benefits the big labs.

Conversely, high output features are now doubly heavy: you pay for the hardware and for the sequential, unbatchable nature of those workloads, and you cannot amortise that cost across millions of customers the way a hyperscaler can. They can still make strategic sense, but only if they are rare, paid for, or tightly scoped, not positioned as your headline-free feature.

A Simple Plot Test

For any AI feature you are considering, you can ask two short questions. First, is this mostly “a lot in, a little out” or “a little in, a lot out”? Second, given whether you are on a closed API or self-hosted, who gets to capture the margin created by that pattern?

If the feature is high input, low output, and you are on a closed model, expect fast capability improvement but thinner economics, so treat it as strategic infrastructure rather than a raw margin machine. If the same pattern sits on top of your own models, that is prime land, because you are on the same side of the economics as the big providers.

Conclusion

In practice, most AI strategy questions collapse to three checks: is this use case high input and modest output, is the answer economically meaningful, and are we renting or owning the land underneath the margins? The companies that compound value from AI will not just chase use cases that look impressive; they will pick the ground where the economics and ownership model quietly work in their favour.

The Future of Enterprise AI

AI isn't a Feature.
It's the Foundation.

*
*    *

Where today's capabilities multiply tomorrow's possibilities.