Which archetype is your app?
Select the pattern that best matches how your app uses LLMs. Each archetype has different cost drivers — getting this right makes the estimate meaningful.
Simple chatbot
Your app takes a user message, sends it to an LLM with some instructions, and returns a response. No memory between sessions. Every conversation starts fresh.
Does this sound like your app?
- □Users ask one-off questions and get answers
- □Each conversation is independent — no history carried over
- □Your system prompt is fixed and doesn't change per user
- □Response time matters — users are waiting in real time
Real-world example
A customer support widget on an e-commerce site. User types 'where is my order?' — the LLM responds using a fixed system prompt about the company's policies. Every session is identical in structure.
If you're including the last 5 messages in every prompt, this is actually a Chatbot with history.
Chatbot with history
Like a simple chatbot, but you include previous messages in every new prompt so the LLM 'remembers' the conversation. The more turns, the bigger (and more expensive) each call gets.
Does this sound like your app?
- □Users have multi-turn conversations — the AI refers back to earlier messages
- □You pass conversation history into every API call
- □Sessions can last 10+ messages
- □Users expect the AI to remember what they said earlier in the same chat
Real-world example
An AI sales assistant that qualifies leads over a multi-turn conversation. By turn 8, the prompt includes the system prompt + 7 prior exchanges — easily 4,000–6,000 tokens per call.
If conversations are always 1–2 turns, use Simple chatbot instead.
RAG pipeline
Your app searches a knowledge base (documents, database, website) and uses what it finds to answer the question. Under the hood, this requires several LLM calls — not just one.
Does this sound like your app?
- □Your app searches documents, a knowledge base, or a database before answering
- □You use a vector database (Pinecone, Weaviate, pgvector, etc.)
- □Each user question triggers multiple LLM calls behind the scenes
- □Answers are grounded in specific source documents
Real-world example
An internal company knowledge base. Employee asks 'what's our parental leave policy?' — the app rewrites the query, searches HR documents, reranks results, then generates an answer citing the relevant policy. 3–5 LLM calls per question.
If you're only doing one LLM call per question (no retrieval step), this is probably a Simple chatbot or Chatbot with history.
Multi-model router
Your app uses a routing layer to decide which LLM to call based on the complexity or type of each request. Simple questions go to a cheap, fast model. Complex or sensitive tasks escalate to a frontier model. The cost depends almost entirely on how accurately the router classifies tasks.
Does this sound like your app?
- □You use more than one LLM, and different requests go to different models
- □You have a 'cheap path' (GPT-5.4 mini, Haiku, Gemini Flash) and an 'expensive path' (GPT-5.4, Claude Sonnet, Gemini Pro)
- □A classifier, rules engine, or small LLM decides which model to use
- □Routing accuracy directly affects your cost — misrouting is a budget risk
Real-world example
A B2B support platform. Simple queries like 'reset my password' route to GPT-5.4 mini (~0.07¢). Complex queries like 'explain why my enterprise integration is failing' escalate to GPT-5.4 (~0.7¢). A 10% misroute rate to the expensive model can 5× the expected cost.
If every request always goes to the same model regardless of complexity, this is not a multi-model router — use a simpler archetype.
Coding assistant
Your app helps developers write, review, explain, or debug code. Prompts are large (code files, diffs, instructions) and outputs are large (generated code, explanations). Token costs are higher than a typical chatbot.
Does this sound like your app?
- □Users paste code or file contents into the prompt
- □The AI generates substantial code in response (not just a short answer)
- □Input prompts are typically 2,000+ tokens
- □Output responses are typically 500–2,000 tokens
Real-world example
A PR review tool. Developer opens a pull request — the app sends the entire diff (3,000 tokens) to an LLM and gets back a detailed code review (1,200 tokens). Output pricing matters a lot here.
If users are asking general programming questions without pasting code, this might be a Simple chatbot.
Document processor
Your app processes documents in bulk — summarizing, extracting data, classifying, or translating them. This runs in the background, not in real time. Because it's async, you can use cheaper batch pricing.
Does this sound like your app?
- □Documents are processed automatically, not triggered by a user waiting for a response
- □You process many documents at a time (invoices, contracts, reports, emails)
- □Results don't need to appear instantly — minutes or hours is fine
- □The input is always a document, not a conversational message
Real-world example
A legal tech tool that summarizes contracts overnight. 500 contracts uploaded — each goes through one LLM call to extract key clauses. No user is waiting. Batch API gives 50% off.
If users are waiting in real time for the result, this isn't batch-eligible and the cost model changes significantly.
Multi-step agent
Your app gives an AI a goal and lets it figure out the steps itself — using tools, making decisions, and iterating until it's done. Each step is a separate LLM call. A task that takes 10 steps costs 10× a single-call task.
Does this sound like your app?
- □The AI decides what to do next based on previous results
- □Your app uses tools — web search, code execution, API calls, file operations
- □A single user task triggers 5 or more LLM calls
- □The number of steps is unpredictable — some tasks take 3 steps, others take 15
Real-world example
An autonomous research agent. User says 'find me the 5 best competitors to my SaaS and summarize their pricing.' The agent searches the web, visits each site, extracts pricing, compares, and writes a summary — 8–12 LLM calls per task.
If the number of LLM calls per task is always exactly 1–2 and predetermined, this is probably a simpler archetype.