LLM cost archetype
RAG pipeline — LLM cost calculator & pricing model
Your app searches a knowledge base (documents, database, website) and uses what it finds to answer the question. Under the hood, this requires several LLM calls — not just one.
Does this sound like your app?
- □Your app searches documents, a knowledge base, or a database before answering
- □You use a vector database (Pinecone, Weaviate, pgvector, etc.)
- □Each user question triggers multiple LLM calls behind the scenes
- □Answers are grounded in specific source documents
Real-world example
An internal company knowledge base. Employee asks 'what's our parental leave policy?' — the app rewrites the query, searches HR documents, reranks results, then generates an answer citing the relevant policy. 3–5 LLM calls per question.
Default cost profile
- Calls per request
- 4
- Batch-eligible
- no
- Avg input tokens
- 2000
- Avg output tokens
- 500
Assumes 3–5 LLM calls per user request (default 4): typically a query rewrite, one or more retrieval/reranking calls, and a final synthesis call. Each call averages ~2,000 input tokens including retrieved context. Prompt caching is architecturally central — retrieval context and system prompts are re-sent on every call in the chain. Not batch-eligible since responses are typically user-facing and latency-sensitive.
Rough cost
$30–500/mo at 100–1,000 users/day. Prompt caching is critical — enable it.
Red flag
If you're only doing one LLM call per question (no retrieval step), this is probably a Simple chatbot or Chatbot with history.