Prompt caching is the single biggest lever on most LLM bills — and the most misunderstood. The pitch is simple: if part of your prompt repeats across calls (a system prompt, a retrieved document, conversation history), the provider can cache it and charge you a fraction of the input price on the repeats. The reality has a catch that decides whether it saves you 60% or nothing.
The discount is real — usually ~90%
Cached input tokens are billed far below the normal input rate. Across the current free-tier models:
Provider / model cached read price vs. input OpenAI GPT-5.4 $0.25 / 1M 90% cheaper Anthropic Sonnet 4.6 $0.30 / 1M 90% cheaper Google Gemini 2.5 $0.03 / 1M 90% cheaper DeepSeek V4 Flash $0.0028 / 1M 98% cheaper
So a 2,000-token system prompt that normally costs full input price drops to a tenth of that on every cache hit. On a high-volume app that’s real money.
The catch: the write premium
Caching isn’t free to set up. The first time a prompt prefix is cached, some providers charge a write premium. On Anthropic it’s 1.25× the normal input price for that first write. OpenAI and Google charge no write surcharge. That single difference changes the math: on OpenAI/Google caching pays off almost immediately; on Anthropic you need enough cache hits to earn back the write premium first.
The formula
Here’s exactly how the cost engine models it — savings on hits, minus the write overhead on misses:
savings_per_hit = (in_tokens / 1M) × (in_price − cached_price) write_overhead = (in_tokens / 1M) × (write_price − in_price) × (1 − hit_rate) cache_savings = savings_per_hit × hit_rate − write_overhead
The variable you control — and the one people guess wrong — is hit rate: what fraction of calls actually reuse the cached prefix. A stable system prompt on a busy app hits often. A prompt that changes per user, or an app with long idle gaps (caches expire), hits rarely.
What it actually saves
Take a support chatbot with a 2,000-token reusable system prompt, 400 output tokens, 1,000 requests/day (30,000/month). The input bill alone is ~$180/mo on Claude Sonnet 4.6. Now apply caching:
Claude Sonnet 4.6 (1.25× write premium) 40% hit rate → save ~$38/mo (21% off input) 80% hit rate → save ~$121/mo (67% off input) GPT-5.4 (no write premium) 40% hit rate → save ~$54/mo (36% off input) 80% hit rate → save ~$108/mo (72% off input)
Two things jump out. First, hit rate dominates — doubling it roughly triples the savings, because the write overhead stops eating into it. Second, at low hit rates OpenAI saves more than Anthropic despite similar prices, purely because there’s no write premium to earn back. At high hit rates they converge.
Model this yourself
The Chatbot-with-history archetype, prefilled. Slide the cache-hit-rate control and watch the realistic column move.
Open in calculator →How to actually turn it on
The economics above only land if your prompt is structured so the cache can fire. A few durable principles — these hold regardless of which provider you’re on:
- Put the stable content first. Caching keys on the prompt prefix. Order it system prompt → retrieved context → history → the user’s message, so the reusable part comes before anything that changes.
- Keep the prefix byte-identical. A timestamp, a per-user ID, or a reordered tool list in the cached region busts the cache on every call. Audit for variation that sneaks into the “stable” part.
- Mind the TTL. Caches expire (commonly ~5 minutes unless refreshed). Steady traffic keeps them warm; bursty traffic mostly pays cold-write prices.
- Know your provider’s model. OpenAI and Gemini cache automatically above a token threshold; Anthropic is explicit — you mark the cache breakpoint yourself (and pay the 1.25× write).
- Measure the real hit rate. Providers report cached vs. uncached tokens in the usage response. Read your actual rate in production and feed it back into the model instead of guessing.
When caching won’t help
- Your prefix changes every call. If the cacheable portion (system prompt, context) isn’t byte-identical across calls, there’s nothing to reuse. Order your prompt so the stable parts come first.
- Low, spiky traffic. Caches expire fast. One call every few minutes means most calls are cold writes, not hits — and on Anthropic you’re paying the write premium without earning it back.
- Short prompts. Caching a 200-token prompt saves almost nothing in absolute terms. The lever scales with how many tokens you can keep stable.
- Provider doesn’t support it. Mistral, Cohere, AWS Nova, Together, and Groq don’t expose caching — setting a hit rate does nothing for those models.
The practical takeaway: caching is a top-two cost lever if you have a large, stable prefix and steady traffic — and close to a no-op otherwise. Don’t assume; model your real hit rate and prompt size and watch what the realistic column does.
Model your own caching savings — set your prompt size, hit rate, and traffic across every model.
Open the calculator →