← Field Notes

Field Notes · Guide

How Much Does an LLM App Actually Cost?

Most teams underestimate by 2–3×. Here's the real math — retries, caching, and growth included — modeled across six models.

2026-06-10 · 8 min read

Every cost estimate starts with the same back-of-napkin math: input tokens × price + output tokens × price. That number is real — and it’s also the least useful one you’ll compute, because it assumes every call succeeds, nothing is cached, and you never grow. In practice teams underestimate their LLM bill by 2–3×, and it’s always the same three multipliers that close the gap.

The naive number, and why it lies

Take a simple chatbot: 800 input tokens, 400 output, 1,000 requests/day (30,000/month). On Claude Sonnet 4.6 ($3 / $15 per 1M) the bare math is:

base_cost = (800 × $3 + 400 × $15) / 1,000,000
          = $0.0084 per call
naive/mo  = $0.0084 × 1,000 × 30  ≈  $252/mo

That’s the Naive $/mo figure — the number you’d put in a slide. The problem is what it leaves out. Three multipliers separate it from your real invoice.

1 · Retries — the silent +8%

Roughly 8% of calls fail and retry (closer to 15% for RAG and agents, where failures compound across a chain). Failed calls still burn tokens. Retries are short — you regenerate the failing call, not the whole output — so they add a fraction, not a full call, but it’s a fraction you pay every month.

2 · Caching — the lever that cuts the most

If your system prompt is stable, prompt caching reads it back at ~10% of input price on OpenAI, Anthropic, and Gemini (98% cheaper on DeepSeek). At a realistic 40% hit rate, caching on a system-prompt-heavy workload often more than offsets the retry overhead. There’s a catch: the first write costs a premium (1.25× input on Anthropic), so caching only pays off when prompts actually repeat.

3 · Growth — costs aren’t linear

The model you pick at 1,000 requests/day may not be the one you want at 10,000. Modeling the bill at 3× and 10× up front tells you whether your margin survives growth — or whether you’ll be forced into a painful migration mid-scale.

Model this yourself

Prefilled with the Simple chatbot archetype — 800/400 tokens, 1,000 req/day. Adjust the multipliers and watch all six models update live.

Open in calculator →

What it looks like across models

Same workload, the six free-tier models, naive monthly cost (before multipliers). Even the bare number spans a wide range:

Claude Sonnet 4.6    $252/mo
GPT-5.4              $240/mo
Claude Haiku 4.5     $84/mo
GPT-5.4 mini         $72/mo
Gemini 2.5 Flash     $37/mo
DeepSeek V4 Flash    $7/mo

A ~36× spread for the same app. Apply realistic retry, caching, and a 15% infra overhead and every row shifts — caching pulls the premium models down, infra nudges everything up. Which model is right depends on quality fit, not just price, which is exactly what the recommendation engine weighs per archetype.

The token math is exact; the multipliers are estimates you control. With your real retry and cache-hit rates, expect ±15% accuracy at small scale — good enough to choose a stack, which is the decision this is for.

How to actually estimate yours

Don’t start from a per-token table. Start from your architecture:

Stop guessing. Model your real architecture across every model in about 30 seconds.

Open the calculator →