← Field Notes

Field Notes · Deep-dive

Prompt Caching: How Much It Actually Saves

The discount is ~90% — but a write premium and your real hit rate decide whether caching cuts 60% off your input bill or nothing.

2026-06-18 · 7 min read

Prompt caching is the single biggest lever on most LLM bills — and the most misunderstood. The pitch is simple: if part of your prompt repeats across calls (a system prompt, a retrieved document, conversation history), the provider can cache it and charge you a fraction of the input price on the repeats. The reality has a catch that decides whether it saves you 60% or nothing.

The discount is real — usually ~90%

Cached input tokens are billed far below the normal input rate. Across the current free-tier models:

Provider / model        cached read price    vs. input
OpenAI    GPT-5.4       $0.25 / 1M           90% cheaper
Anthropic Sonnet 4.6    $0.30 / 1M           90% cheaper
Google    Gemini 2.5    $0.03 / 1M           90% cheaper
DeepSeek  V4 Flash      $0.0028 / 1M         98% cheaper

So a 2,000-token system prompt that normally costs full input price drops to a tenth of that on every cache hit. On a high-volume app that’s real money.

The catch: the write premium

Caching isn’t free to set up. The first time a prompt prefix is cached, some providers charge a write premium. On Anthropic it’s 1.25× the normal input price for that first write. OpenAI and Google charge no write surcharge. That single difference changes the math: on OpenAI/Google caching pays off almost immediately; on Anthropic you need enough cache hits to earn back the write premium first.

The formula

Here’s exactly how the cost engine models it — savings on hits, minus the write overhead on misses:

savings_per_hit = (in_tokens / 1M) × (in_price − cached_price)
write_overhead  = (in_tokens / 1M) × (write_price − in_price) × (1 − hit_rate)
cache_savings   = savings_per_hit × hit_rate − write_overhead

The variable you control — and the one people guess wrong — is hit rate: what fraction of calls actually reuse the cached prefix. A stable system prompt on a busy app hits often. A prompt that changes per user, or an app with long idle gaps (caches expire), hits rarely.

What it actually saves

Take a support chatbot with a 2,000-token reusable system prompt, 400 output tokens, 1,000 requests/day (30,000/month). The input bill alone is ~$180/mo on Claude Sonnet 4.6. Now apply caching:

Claude Sonnet 4.6 (1.25× write premium)
  40% hit rate  →  save ~$38/mo   (21% off input)
  80% hit rate  →  save ~$121/mo  (67% off input)

GPT-5.4 (no write premium)
  40% hit rate  →  save ~$54/mo   (36% off input)
  80% hit rate  →  save ~$108/mo  (72% off input)

Two things jump out. First, hit rate dominates — doubling it roughly triples the savings, because the write overhead stops eating into it. Second, at low hit rates OpenAI saves more than Anthropic despite similar prices, purely because there’s no write premium to earn back. At high hit rates they converge.

Default to a conservative 40% hit rate when you model this. Teams routinely assume 80%+ and are disappointed — caches expire (typically 5 minutes unless refreshed), traffic is bursty, and any per-call variation in the prefix busts the cache.

Model this yourself

The Chatbot-with-history archetype, prefilled. Slide the cache-hit-rate control and watch the realistic column move.

Open in calculator →

How to actually turn it on

The economics above only land if your prompt is structured so the cache can fire. A few durable principles — these hold regardless of which provider you’re on:

When caching won’t help

The practical takeaway: caching is a top-two cost lever if you have a large, stable prefix and steady traffic — and close to a no-op otherwise. Don’t assume; model your real hit rate and prompt size and watch what the realistic column does.

Model your own caching savings — set your prompt size, hit rate, and traffic across every model.

Open the calculator →