← Field Notes

Field Notes · Deep-dive

RAG Pipeline Cost: Why It's 3–5× a Chatbot

RAG looks like one feature but it's 3–5 LLM calls per question — and with bigger prompts the bill lands ~6× a simple chatbot. Where the money goes, and how caching claws it back.

2026-06-25 · 7 min read

RAG looks like one feature — “answer questions using our docs.” Under the hood it’s three to five LLM calls per user question, each carrying a bigger prompt than a plain chatbot. That fan-out is why RAG bills routinely land at 6× a simple chatbot, not the ~1× people sketch on the whiteboard. Here’s where the money actually goes.

The call fan-out is the real driver

A simple chatbot is one call: user message in, answer out. A RAG pipeline typically chains several:

Call it ~4 calls per request as a working default. And each one isn’t a small call: the generate step carries a system prompt plus retrieved documents — ~2,000 input tokens vs. ~800 for a chatbot.

The math: ~6×, not ~4×

At 1,000 user requests/day (30,000/month), 4 calls each, 2,000 input / 500 output tokens per call, naive monthly cost:

Model               $/call    × 4 calls × 30k req
Claude Sonnet 4.6   $0.0135   →  $1,620/mo
GPT-5.4             $0.0125   →  $1,500/mo
GPT-5.4 mini        $0.0037   →  $450/mo
Gemini 2.5 Flash    $0.0019   →  $222/mo

Compare Sonnet’s RAG bill ($1,620) to the same model running a simple chatbot ($252/mo). That’s 6.4× — not the 4× you’d guess from call count alone. The extra comes from the larger prompts: more calls and more tokens per call compound. People budget for one and forget the other.

The headline number to internalize: a RAG feature on a premium model is a four-figure monthly line item at modest traffic. The cheapest-to-priciest spread here is ~7× ($222 vs $1,620), so model choice matters more on RAG than almost anywhere else.

Model this yourself

The RAG pipeline archetype, prefilled — 4 calls/request, 2,000/500 tokens. Adjust calls-per-request and cache hit rate and watch the spread across models.

Open in calculator →

Caching matters more on RAG than anywhere

Here’s the lever that changes RAG economics: a large chunk of every call’s input — the system prompt and (often) the retrieved context — repeats across the calls in a single request and across requests that hit the same documents. That’s exactly what prompt caching is for, and it’s why our recommendation engine weights cache savings for the RAG archetype.

Say 1,500 of the 2,000 input tokens per call are a cacheable prefix, at a conservative 40% hit rate. On Claude Sonnet 4.6 that’s roughly $113/mo back — a meaningful dent in a $1,620 bill, and it scales up fast with a higher hit rate. Order your prompt so the stable parts (system instructions, retrieved docs) come first, or you leave this on the table.

Don’t just grab the cheapest model

The table above makes Gemini Flash at $222/mo look irresistible next to Sonnet at $1,620. Sometimes it is. But RAG is usually customer-facing — a wrong or ungrounded answer is a visible failure, not a backend hiccup. That’s why the recommendation engine applies a quality floor on RAG: if the cheapest pick is budget-tier, it promotes a mid- or premium-tier model whose cache-adjusted cost is within reach. The right question isn’t “what’s cheapest” — it’s “what’s the cheapest model that clears my quality bar,” then let caching close the gap.

And budget for the retries

RAG retries more than a chatbot — failures cascade across the chain (a bad retrieval feeds a bad rerank feeds a bad answer). The cost engine uses a 15% retry rate for RAG vs. 8% for simple archetypes, and the Worst Case column models cascading failures across the calls. At four-figure base cost, even a few points of retry is real money.

The RAG cost checklist

Model your RAG pipeline — calls per request, token sizes, caching, and retries — across every model.

Open the RAG cost model →