LLM cost archetype

RAG pipeline — LLM cost calculator & pricing model

Your app searches a knowledge base (documents, database, website) and uses what it finds to answer the question. Under the hood, this requires several LLM calls — not just one.

Does this sound like your app?

□Your app searches documents, a knowledge base, or a database before answering
□You use a vector database (Pinecone, Weaviate, pgvector, etc.)
□Each user question triggers multiple LLM calls behind the scenes
□Answers are grounded in specific source documents

Real-world example

An internal company knowledge base. Employee asks 'what's our parental leave policy?' — the app rewrites the query, searches HR documents, reranks results, then generates an answer citing the relevant policy. 3–5 LLM calls per question.

Default cost profile

Calls per request: 4
Batch-eligible: no
Avg input tokens: 2000
Avg output tokens: 500

Assumes 3–5 LLM calls per user request (default 4): typically a query rewrite, one or more retrieval/reranking calls, and a final synthesis call. Each call averages ~2,000 input tokens including retrieved context. Prompt caching is architecturally central — retrieval context and system prompts are re-sent on every call in the chain. Not batch-eligible since responses are typically user-facing and latency-sensitive.

Rough cost

$30–500/mo at 100–1,000 users/day. Prompt caching is critical — enable it.

Red flag

If you're only doing one LLM call per question (no retrieval step), this is probably a Simple chatbot or Chatbot with history.

Model costs for RAG pipeline →← All archetypes