Prompt Caching: How Much It Actually Saves

Prompt caching is the single biggest lever on most LLM bills — and the most misunderstood. The pitch is simple: if part of your prompt repeats across calls (a system prompt, a retrieved document, conversation history), the provider can cache it and charge you a fraction of the input price on the repeats. The reality has a catch that decides whether it saves you 60% or nothing.

The discount is real — usually ~90%

Cached input tokens are billed far below the normal input rate. Across the current free-tier models:

Provider / model        cached read price    vs. input
OpenAI    GPT-5.4       $0.25 / 1M           90% cheaper
Anthropic Sonnet 4.6    $0.30 / 1M           90% cheaper
Google    Gemini 2.5    $0.03 / 1M           90% cheaper
DeepSeek  V4 Flash      $0.0028 / 1M         98% cheaper

So a 2,000-token system prompt that normally costs full input price drops to a tenth of that on every cache hit. On a high-volume app that’s real money.

The catch: the write premium

Caching isn’t free to set up. The first time a prompt prefix is cached, some providers charge a write premium. On Anthropic it’s 1.25× the normal input price for that first write. OpenAI and Google charge no write surcharge. That single difference changes the math: on OpenAI/Google caching pays off almost immediately; on Anthropic you need enough cache hits to earn back the write premium first.

The formula

Here’s exactly how the cost engine models it — savings on hits, minus the write overhead on misses:

savings_per_hit = (in_tokens / 1M) × (in_price − cached_price)
write_overhead  = (in_tokens / 1M) × (write_price − in_price) × (1 − hit_rate)
cache_savings   = savings_per_hit × hit_rate − write_overhead

The variable you control — and the one people guess wrong — is hit rate: what fraction of calls actually reuse the cached prefix. A stable system prompt on a busy app hits often. A prompt that changes per user, or an app with long idle gaps (caches expire), hits rarely.

What it actually saves

Take a support chatbot with a 2,000-token reusable system prompt, 400 output tokens, 1,000 requests/day (30,000/month). The input bill alone is ~$180/mo on Claude Sonnet 4.6. Now apply caching:

Claude Sonnet 4.6 (1.25× write premium)
  40% hit rate  →  save ~$38/mo   (21% off input)
  80% hit rate  →  save ~$121/mo  (67% off input)

GPT-5.4 (no write premium)
  40% hit rate  →  save ~$54/mo   (36% off input)
  80% hit rate  →  save ~$108/mo  (72% off input)

Two things jump out. First, hit rate dominates — doubling it roughly triples the savings, because the write overhead stops eating into it. Second, at low hit rates OpenAI saves more than Anthropic despite similar prices, purely because there’s no write premium to earn back. At high hit rates they converge.

Default to a conservative 40% hit rate when you model this. Teams routinely assume 80%+ and are disappointed — caches expire (typically 5 minutes unless refreshed), traffic is bursty, and any per-call variation in the prefix busts the cache.

Model this yourself

The Chatbot-with-history archetype, prefilled. Slide the cache-hit-rate control and watch the realistic column move.

Open in calculator →

How to actually turn it on

The economics above only land if your prompt is structured so the cache can fire. A few durable principles — these hold regardless of which provider you’re on:

Put the stable content first. Caching keys on the prompt prefix. Order it system prompt → retrieved context → history → the user’s message, so the reusable part comes before anything that changes.
Keep the prefix byte-identical. A timestamp, a per-user ID, or a reordered tool list in the cached region busts the cache on every call. Audit for variation that sneaks into the “stable” part.
Mind the TTL. Caches expire (commonly ~5 minutes unless refreshed). Steady traffic keeps them warm; bursty traffic mostly pays cold-write prices.
Know your provider’s model. OpenAI and Gemini cache automatically above a token threshold; Anthropic is explicit — you mark the cache breakpoint yourself (and pay the 1.25× write).
Measure the real hit rate. Providers report cached vs. uncached tokens in the usage response. Read your actual rate in production and feed it back into the model instead of guessing.

When caching won’t help

Your prefix changes every call. If the cacheable portion (system prompt, context) isn’t byte-identical across calls, there’s nothing to reuse. Order your prompt so the stable parts come first.
Low, spiky traffic. Caches expire fast. One call every few minutes means most calls are cold writes, not hits — and on Anthropic you’re paying the write premium without earning it back.
Short prompts. Caching a 200-token prompt saves almost nothing in absolute terms. The lever scales with how many tokens you can keep stable.
Provider doesn’t support it. Mistral, Cohere, AWS Nova, Together, and Groq don’t expose caching — setting a hit rate does nothing for those models.

The practical takeaway: caching is a top-two cost lever if you have a large, stable prefix and steady traffic — and close to a no-op otherwise. Don’t assume; model your real hit rate and prompt size and watch what the realistic column does.

Model your own caching savings — set your prompt size, hit rate, and traffic across every model.

Open the calculator →