RAG Pipeline Cost: Why It's 3–5× a Chatbot

RAG looks like one feature — “answer questions using our docs.” Under the hood it’s three to five LLM calls per user question, each carrying a bigger prompt than a plain chatbot. That fan-out is why RAG bills routinely land at 6× a simple chatbot, not the ~1× people sketch on the whiteboard. Here’s where the money actually goes.

The call fan-out is the real driver

A simple chatbot is one call: user message in, answer out. A RAG pipeline typically chains several:

Query rewrite / expansion — turn the raw question into a better retrieval query (sometimes skipped, sometimes 2–3 variants).
Rerank — score retrieved chunks for relevance (an LLM call, or a dedicated reranker).
Generate — the answer call, now stuffed with retrieved context.
Optional extras — a grounding/citation check, a safety pass, a follow-up. Each is another billable call.

Call it ~4 calls per request as a working default. And each one isn’t a small call: the generate step carries a system prompt plus retrieved documents — ~2,000 input tokens vs. ~800 for a chatbot.

The math: ~6×, not ~4×

At 1,000 user requests/day (30,000/month), 4 calls each, 2,000 input / 500 output tokens per call, naive monthly cost:

Model               $/call    × 4 calls × 30k req
Claude Sonnet 4.6   $0.0135   →  $1,620/mo
GPT-5.4             $0.0125   →  $1,500/mo
GPT-5.4 mini        $0.0037   →  $450/mo
Gemini 2.5 Flash    $0.0019   →  $222/mo

Compare Sonnet’s RAG bill ($1,620) to the same model running a simple chatbot ($252/mo). That’s 6.4× — not the 4× you’d guess from call count alone. The extra comes from the larger prompts: more calls and more tokens per call compound. People budget for one and forget the other.

The headline number to internalize: a RAG feature on a premium model is a four-figure monthly line item at modest traffic. The cheapest-to-priciest spread here is ~7× ($222 vs $1,620), so model choice matters more on RAG than almost anywhere else.

Model this yourself

The RAG pipeline archetype, prefilled — 4 calls/request, 2,000/500 tokens. Adjust calls-per-request and cache hit rate and watch the spread across models.

Open in calculator →

Caching matters more on RAG than anywhere

Here’s the lever that changes RAG economics: a large chunk of every call’s input — the system prompt and (often) the retrieved context — repeats across the calls in a single request and across requests that hit the same documents. That’s exactly what prompt caching is for, and it’s why our recommendation engine weights cache savings 3× for the RAG archetype.

Say 1,500 of the 2,000 input tokens per call are a cacheable prefix, at a conservative 40% hit rate. On Claude Sonnet 4.6 that’s roughly $113/mo back — a meaningful dent in a $1,620 bill, and it scales up fast with a higher hit rate. Order your prompt so the stable parts (system instructions, retrieved docs) come first, or you leave this on the table.

Don’t just grab the cheapest model

The table above makes Gemini Flash at $222/mo look irresistible next to Sonnet at $1,620. Sometimes it is. But RAG is usually customer-facing — a wrong or ungrounded answer is a visible failure, not a backend hiccup. That’s why the recommendation engine applies a quality floor on RAG: if the cheapest pick is budget-tier, it promotes a mid- or premium-tier model whose cache-adjusted cost is within reach. The right question isn’t “what’s cheapest” — it’s “what’s the cheapest model that clears my quality bar,” then let caching close the gap.

And budget for the retries

RAG retries more than a chatbot — failures cascade across the chain (a bad retrieval feeds a bad rerank feeds a bad answer). The cost engine uses a 15% retry rate for RAG vs. 8% for simple archetypes, and the Worst Case column models cascading failures across the calls. At four-figure base cost, even a few points of retry is real money.

The RAG cost checklist

Count your real calls per request. Rewrite + rerank + generate + check can be 5, not 4. Each one multiplies the whole bill.
Measure tokens per call, not per request. The generate step’s retrieved-context payload is where the input cost hides.
Maximize the cacheable prefix. Stable system prompt + retrieved docs first; per-user content last.
Pick on quality-adjusted cost. Cheapest model that clears your accuracy bar, then let caching do the rest.
Stress-test the retries. Model the Worst Case column — cascading failures are a RAG-specific tax.

Model your RAG pipeline — calls per request, token sizes, caching, and retries — across every model.

Open the RAG cost model →