Every cost estimate starts with the same back-of-napkin math: input tokens × price + output tokens × price. That number is real — and it’s also the least useful one you’ll compute, because it assumes every call succeeds, nothing is cached, and you never grow. In practice teams underestimate their LLM bill by 2–3×, and it’s always the same three multipliers that close the gap.
The naive number, and why it lies
Take a simple chatbot: 800 input tokens, 400 output, 1,000 requests/day (30,000/month). On Claude Sonnet 4.6 ($3 / $15 per 1M) the bare math is:
base_cost = (800 × $3 + 400 × $15) / 1,000,000
= $0.0084 per call
naive/mo = $0.0084 × 1,000 × 30 ≈ $252/moThat’s the Naive $/mo figure — the number you’d put in a slide. The problem is what it leaves out. Three multipliers separate it from your real invoice.
1 · Retries — the silent +8%
Roughly 8% of calls fail and retry (closer to 15% for RAG and agents, where failures compound across a chain). Failed calls still burn tokens. Retries are short — you regenerate the failing call, not the whole output — so they add a fraction, not a full call, but it’s a fraction you pay every month.
2 · Caching — the lever that cuts the most
If your system prompt is stable, prompt caching reads it back at ~10% of input price on OpenAI, Anthropic, and Gemini (98% cheaper on DeepSeek). At a realistic 40% hit rate, caching on a system-prompt-heavy workload often more than offsets the retry overhead. There’s a catch: the first write costs a premium (1.25× input on Anthropic), so caching only pays off when prompts actually repeat.
3 · Growth — costs aren’t linear
The model you pick at 1,000 requests/day may not be the one you want at 10,000. Modeling the bill at 3× and 10× up front tells you whether your margin survives growth — or whether you’ll be forced into a painful migration mid-scale.
Model this yourself
Prefilled with the Simple chatbot archetype — 800/400 tokens, 1,000 req/day. Adjust the multipliers and watch all six models update live.
Open in calculator →What it looks like across models
Same workload, the six free-tier models, naive monthly cost (before multipliers). Even the bare number spans a wide range:
Claude Sonnet 4.6 $252/mo GPT-5.4 $240/mo Claude Haiku 4.5 $84/mo GPT-5.4 mini $72/mo Gemini 2.5 Flash $37/mo DeepSeek V4 Flash $7/mo
A ~36× spread for the same app. Apply realistic retry, caching, and a 15% infra overhead and every row shifts — caching pulls the premium models down, infra nudges everything up. Which model is right depends on quality fit, not just price, which is exactly what the recommendation engine weighs per archetype.
How to actually estimate yours
Don’t start from a per-token table. Start from your architecture:
- Count calls per request. A chatbot is 1. A RAG pipeline is 3–5. A multi-step agent can be 5–12. This multiplier dwarfs token pricing.
- Estimate tokens per layer. System prompt, retrieved context, history, and the user message add up fast — and they cache differently.
- Apply honest multipliers. Your real retry rate, a conservative 40% cache hit, batch eligibility for offline work, and infra overhead.
- Model 3× and 10×. Decide once, not after the migration is forced on you.
Stop guessing. Model your real architecture across every model in about 30 seconds.
Open the calculator →