LLM Cost Modeling Methodology & Token Calculator

Quick start

beforeyouship is a pre-deployment cost model for LLM apps. You describe the architecture and usage pattern; the tool returns a monthly cost forecast across model tiers with honest multipliers for retries, caching, batching, and growth. It is a planning tool for the design phase — before you commit to a stack.

Pick an app archetype. Seven presets cover the common cost profiles (chatbot, RAG, multi-step agent, doc processor, etc.). Each one pre-fills sensible token defaults and retry behaviour.
Tune the usage pattern. Calls per day, LLM calls per request, average input and output token counts. Paste your prompt instead of guessing — the built-in tokenizer estimates the count locally.
Adjust cost multipliers. Retry rate, cache hit rate, batch eligibility, infrastructure overhead. Each one has a defined formula; see How costs are modeled below.
Read the cost table. Three columns: Naive (raw tokens × prices), Realistic (your multipliers applied), Worst Case (elevated retry rate and, for agent/RAG, cascading retries).
Read the lowest modeled cost. The tool ranks every model by realistic monthly cost with archetype-aware adjustments and shows the winner as {model} — $X/mo, with the numbers behind the ranking. See Model selection for the per-archetype rules.
Check the sensitivity tripwire. Pick any two models and see the retry rate at which the cheaper one stops being cheaper. Pro extends this to a full sweep across cache hit rate, input price, and volume.

Free tier covers six models across four providers. Pro unlocks the full catalog (eighteen models, all providers), per-input-layer token breakdown, cost-per-MAU, a global retry budget, exports, and integrations.

How costs are modeled

Every model in the catalog produces three monthly cost numbers. They all use the same input and output token counts; only the multipliers differ.

Base cost per call

base_cost = (in_tokens × in_price + out_tokens × out_price) / 1,000,000

The bare token math — no retries, no caching, no batch. This is what feeds the Naive $/mo column when multiplied by calls/day × 30 × calls/request.

Retry overhead

retry_cost_per_call = base_cost_per_call × retry_rate × 0.3

Default retry rate is 8% (15% for RAG and multi-step agents, which retry more often because failures compound across the chain). The 0.3 factor reflects that most retries are short — they regenerate only the failing call, not the full output.

Cache savings

savings_per_hit  = (in_tokens / 1M) × (in_price - cached_in_price)
write_overhead   = (in_tokens / 1M) × (cache_write_price - in_price) × (1 - hit_rate)
cache_savings    = savings_per_hit × hit_rate - write_overhead

Anthropic charges cache writes at 1.25× the base input price; OpenAI and Google have no write surcharge. Cache reads run 90% cheaper on OpenAI (gpt-5.x), Anthropic, and Gemini, and 98% cheaper on DeepSeek V4 Flash. The formula above captures both sides.

Batch savings

in_save  = (in_tokens / 1M) × (in_price - batch_in_price)
out_save = (out_tokens / 1M) × (out_price - batch_out_price)
batch_savings = (in_save + out_save) × batch_pct

The batch API is roughly 50% cheaper than real-time pricing on both input and output, at the cost of higher latency (minutes to hours). Only useful for offline workloads — the Document processor archetype enables it by default at 80% eligibility.

Infrastructure overhead

A flat 15% line item covers hosting, vector DB, orchestration, and observability tooling that scale with API spend. Adjust under settings; set to 0 if you self-host the infrastructure side.

Putting it together

effective_cost_per_call = base + retry - cache_savings - batch_savings
monthly_cost = effective_cost_per_call
             × calls_per_day × 30 × calls_per_request
             × (1 + infra_overhead_pct / 100)

The three output columns

Naive $/mo — base cost × call count. No retries, no caching, no batch. The number you’d compute on the back of a napkin. Intentionally muted in the UI; never shown in isolation.
Realistic $/mo — your multipliers applied as defined above. The primary number — this is the one to plan against.
Worst Case $/mo — uses an elevated retry rate (default 20%). For RAG and multi-step agent archetypes, retries compound across the chain via a cascading formula. Stress-tests how exposed you are to bad-traffic days.

Tokenization

Token counts drive the entire cost calculation, so accuracy matters. The challenge is that every provider uses a different tokenizer.

The baseline

Paste mode runs the OpenAI native tokenizer (o200k_base) locally in your browser via WebAssembly. Your prompt text never leaves the page.

Per-model adjustment

For non-OpenAI models, the cost engine multiplies the o200k_base count by a calibrated ratio. Approximate values for English text:

OpenAI, Azure OpenAI, Llama 3.x, Amazon Nova: 1.00 (baseline)
Gemini 2.5 Flash: ≈ 0.97 (SentencePiece, slightly more efficient on English)
Mistral Large 3 / Small 4: ≈ 1.03
Cohere Command R+/R, DeepSeek V4 Flash: ≈ 1.05
Claude Sonnet 4.6 / Haiku 4.5: ≈ 1.10 (verbose on English)

When to verify

Calibrated ratios are accurate to roughly ±5% for English. They’re less accurate for heavy code, multilingual content, or unusual formatting. If your workload is one of those, paste a representative sample through your provider’s own count_tokens endpoint and compare to the number shown in paste mode — divide the provider’s count by the o200k_base count to get your real ratio.

The cost engine doesn’t care about tokens above the price-per-million arithmetic. Even a 10% tokenizer error usually represents less variance than your retry-rate or cache-hit-rate guess.

Archetypes

Seven presets cover the meaningfully different LLM cost profiles. Each preset fills in default token counts, calls-per-request, retry rate, and (where appropriate) batch eligibility. All values stay editable after selection.

Archetype	Calls/req	In / Out tokens	Primary cost driver
Simple chatbot	1	800 / 400	Call volume
Chatbot with history	1	3000 / 600	Context growth
RAG pipeline	3–5	2000 / 500	Call count × caching
Multi-model router	2–4	1000 / 400	Routing logic + model mix
Coding assistant	1–2	4000 / 1500	Large input + output
Document processor	1	8000 / 800	Volume × batch savings
Multi-step agent	5–12	1500 / 600	Steps × cascading retries

Each archetype, in one line plus a typical example

Simple chatbot

DefinitionYour app takes a user message, sends it to an LLM with some instructions, and returns a response. No memory between sessions. Every conversation starts fresh.

Typical exampleA customer support widget on an e-commerce site. User types 'where is my order?' — the LLM responds using a fixed system prompt about the company's policies. Every session is identical in structure.

Full spec →

Chatbot with history

DefinitionLike a simple chatbot, but you include previous messages in every new prompt so the LLM 'remembers' the conversation. The more turns, the bigger (and more expensive) each call gets.

Typical exampleAn AI sales assistant that qualifies leads over a multi-turn conversation. By turn 8, the prompt includes the system prompt + 7 prior exchanges — easily 4,000–6,000 tokens per call.

Full spec →

RAG pipeline

DefinitionYour app searches a knowledge base (documents, database, website) and uses what it finds to answer the question. Under the hood, this requires several LLM calls — not just one.

Typical exampleAn internal company knowledge base. Employee asks 'what's our parental leave policy?' — the app rewrites the query, searches HR documents, reranks results, then generates an answer citing the relevant policy. 3–5 LLM calls per question.

Full spec →

Multi-model router

DefinitionYour app uses a routing layer to decide which LLM to call based on the complexity or type of each request. Simple questions go to a cheap, fast model. Complex or sensitive tasks escalate to a frontier model. The cost depends almost entirely on how accurately the router classifies tasks.

Typical exampleA B2B support platform. Simple queries like 'reset my password' route to GPT-5.4 mini (~0.07¢). Complex queries like 'explain why my enterprise integration is failing' escalate to GPT-5.4 (~0.7¢). A 10% misroute rate to the expensive model can 5× the expected cost.

Full spec →

Coding assistant

DefinitionYour app helps developers write, review, explain, or debug code. Prompts are large (code files, diffs, instructions) and outputs are large (generated code, explanations). Token costs are higher than a typical chatbot.

Typical exampleA PR review tool. Developer opens a pull request — the app sends the entire diff (3,000 tokens) to an LLM and gets back a detailed code review (1,200 tokens). Output pricing matters a lot here.

Full spec →

Document processor

DefinitionYour app processes documents in bulk — summarizing, extracting data, classifying, or translating them. This runs in the background, not in real time. Because it's async, you can use cheaper batch pricing.

Typical exampleA legal tech tool that summarizes contracts overnight. 500 contracts uploaded — each goes through one LLM call to extract key clauses. No user is waiting. Batch API gives 50% off.

Full spec →

Multi-step agent

DefinitionYour app gives an AI a goal and lets it figure out the steps itself — using tools, making decisions, and iterating until it's done. Each step is a separate LLM call. A task that takes 10 steps costs 10× a single-call task.

Typical exampleAn autonomous research agent. User says 'find me the 5 best competitors to my SaaS and summarize their pricing.' The agent searches the web, visits each site, extracts pricing, compares, and writes a summary — 8–12 LLM calls per task.

Full spec →

If your app doesn’t fit any preset cleanly, pick the closest one and overwrite the usage inputs. The cost engine doesn’t know which preset you picked — it just uses the numbers you supply. The archetype affects only the model-selection logic and a few UI hints.

Model selection — lowest modeled cost

The model card is analytical, not advisory. It shows the model with the lowest modeled cost for your workload as {model} — $X/mo, the actual numbers behind that ranking, and one alternative representing the opposite trade-off (a premium option if the winner is budget-tier, the cheapest credible option if the winner is premium) as a bare price comparison. No hedging, no “best fit” language — the numbers make the case, and the sensitivity tripwire below the card shows exactly when they stop holding.

The ranking is not a naive cheapest-first sort. Archetype-aware rules adjust it:

Step 1 — eliminate gross-outliers

Sort by realistic monthly cost ascending. Drop any model that costs more than 2× the cheapest, with one exception: a premium-tier model within 20% of the cheapest gets a pass on quality grounds.

Step 2 — archetype-specific scoring

RAG pipeline, Multi-step agent

Skip the 2× cull. Score = monthlyCost − 3 × cacheSavings. Cache economics dominate because system prompt + retrieval context are re-sent on every call. RAG has a quality floor: if the winner is budget-tier, promote a mid or premium model whose effective score is within 2× — RAG is customer-facing.

Document processor

Score = monthlyCost − batchSavings. Ties broken by larger context window. Biased toward batch-eligible, long-context models.

Simple chatbot, Chatbot with history

Exclude premium tier (overkill for chat). Pick the cheapest, but exclude DeepSeek unless it’s more than 50% cheaper than the next non-DeepSeek option (data-policy hedge for sensitive workloads).

Coding assistant, Multi-model router

Quality floor: exclude budget tier entirely. Coding @ 4000 / 1500 tokens and routing logic both need real reasoning capacity; 8B-class models don’t qualify. Re-run the 2× cull against the qualified mid/premium pool, then pick the cheapest.

Access-aware ranking

Free users see the lowest modeled cost within the free-tier pool (the model card on the main page). When models in the extended catalog rank below the free winner, a banner below the card states how many lower-cost models exist — a neutral count, no price and no model name. Pro users see the ranking computed over the full catalog — consistent across the main card and the model-catalog modal.

Analysis tools

Four analysis layers that sit on top of the base cost engine. None of them changes what shows up in the Naive / Realistic / Worst Case columns — they refine how you reason about the inputs. The sensitivity tripwire’s headline number is free; everything else here is Pro.

Sensitivity tripwire & full sweep

A single cost number is a point estimate; the tripwire tells you how fragile it is. Pick any two models (free tier: the six free models; Pro: the full catalog) and the headline shows the retry-rate break-even — the retry rate at which the cheaper model stops being cheaper. Retries re-bill raw, uncached tokens, so a model that wins on cache economics can lose its lead as failures climb. If the curves never cross in the 0–30% range, the headline says so — one model simply dominates.

The full sweep (Pro) extends the same analysis to every lever: cache hit rate (0–80%), input price (0.5×–2× current list price), and call volume. For each parameter it sweeps the value across its range, recomputes the realistic monthly cost at every step, and reports where your selected pair’s curves cross. All sweep math runs server-side.

Models with identical pricing across providers (e.g. GPT-5.4 on OpenAI vs Azure) tie at every point — there is no break-even to report. And note the Export Pass does not unlock the full sweep; it requires an active Pro subscription.

Input token breakdown

Splits the avg input tokens number into four named layers — system prompt, retrieval context (RAG), conversation history, and the user’s actual message. The total feeds the cost engine identically; the breakdown surfaces which layer is dominating the bill and which layer is the best caching candidate (system prompt and history are the most cache-friendly).

Cost per user / MAU

Divides the realistic monthly cost by your monthly active user count. Useful for pricing-page sanity checks (“we’re at $X / MAU at 1k users, $Y / MAU at 10k”) and board reports. Assumes calls per day scale linearly with users — adjust calls per day for your actual usage curve if it’s super-linear or sub-linear.

Global retry budget

Replaces the 8% / 15% retry-rate heuristic with a deterministic cap: max retries per call, with an optional max retries per hour. If your code has exponential backoff with a hard limit, this is more realistic than the rate-based model. The Worst Case column respects the cap when set.

MCP server (Pro)

beforeyouship runs a remote MCP (Model Context Protocol) server, so you can query cost models without leaving your editor — from Claude Code, Cursor, or any MCP client that supports the Streamable HTTP transport. Same cost engine, same pricing data, same Naive / Realistic / Worst Case output as the website. Also listed on Smithery.

Tools

list_archetypes — the seven app archetypes with their default usage parameters.
get_model_prices — current per-1M-token pricing for the full catalog, with staleness metadata.
estimate_cost — full cost model for an archetype at a given usage pattern: Naive / Realistic / Worst Case monthly cost per model, growth scenarios, and the lowest-modeled-cost pick with its alternative.

Setup

Generate an API key. Account menu → API keys → Generate API key. The key is shown once — store it in your secrets manager. Up to five active keys; revoke any time.
Add the server. In Claude Code:

claude mcp add --transport http beforeyouship \
  https://beforeyouship.dev/api/mcp \
  --header "Authorization: Bearer bys_..."

In Cursor or other MCP clients, add a remote server with URL https://beforeyouship.dev/api/mcp and the same Authorization header.

Try it. Paste this into your editor:

Estimate the monthly cost of a RAG pipeline at 10,000 requests/day

Demo mode

No key yet? Skip the --header flag and the server runs in demo mode — all three tools work against the six free-tier models. A Pro key unlocks the full catalog across all providers.

Full catalog access is a Pro feature. Keys outlive a lapsed subscription but fall back to demo mode (free-tier models only) until the subscription is active again — revoke unused keys from the account menu.

Models & pricing data

Prices come from the official rate cards of each provider. They’re stored in /data/prices.json and reviewed monthly. A weekly GitHub Action checks staleness and pings the maintainer if any entry hasn’t been re-verified in 30 days.

The "!" warning

Any model whose pricing hasn’t been re-verified within the last 30 days gets a small amber "!" next to its name. Hover the icon for the last-verified date and the source URL. It’s a heads-up, not a defect — prices rarely change between checks, but the warning prevents stale numbers from being trusted blindly.

Free vs Pro models

Free tier covers six models that span the four largest providers and the full quality spectrum: GPT-5.4, GPT-5.4 mini (OpenAI), Claude Sonnet 4.6, Claude Haiku 4.5 (Anthropic), Gemini 2.5 Flash (Google), DeepSeek V4 Flash. Pro adds twelve more across Mistral, Cohere, AWS Bedrock (Nova), Azure OpenAI, Together AI, and Groq.

Caching support

Not every provider exposes prompt caching at the API tier. OpenAI, Anthropic, Google, DeepSeek, and Azure OpenAI do; Mistral, Cohere, AWS Nova, Together, and Groq don’t. If you set a cache hit rate > 0, models without caching simply show zero cache savings — their realistic monthly cost is unaffected.

Provider deprecations

DeepSeek’s legacy endpoint names (deepseek-chat and deepseek-reasoner) retire on 2026-07-24 and already route to V4 Flash transparently. Use deepseek-v4-flash going forward — $0.14 / $0.28 per 1M with a 98% cache read discount.

What this tool does NOT model

These are real costs that you’ll incur in production but the tool deliberately leaves out. Each one is too project-specific to model honestly with usage inputs alone.

GPU rental and self-hosted inference. Running Llama or Mistral on your own GPUs replaces per-token billing with hourly GPU rental — a completely different cost structure that depends on QPS, batch size, and GPU class.
Embedding models. Embeddings are usually a separate budget line (vector indexing, semantic search). Worth modelling, but not in this tool.
Fine-tuning. Fine-tune training costs and the per-token premium on the resulting model are highly variable; not in scope.
Vector database. The 15% infrastructure overhead is a rough placeholder. Pinecone, Weaviate, or pgvector at your scale should be modelled separately if they’re a meaningful budget line.
Human-in-the-loop labeling and evals. Quality assurance, eval datasets, and human review time can dwarf the API bill in regulated domains. Out of scope here.
Observability tooling. LangSmith, Helicone, PromptLayer, etc. — modelled implicitly through the 15% overhead, but if your observability stack is significant, account for it separately.

FAQs

How accurate is the estimate?

The base token math is exact; the multipliers are estimates that you control. With well-chosen multipliers (your real retry rate, your real cache hit rate), expect ±15% accuracy at small scale and ±25% at 10× scale where rate-limit effects start to bite. Every formula is documented in the methodology section.

Can I save my model?

Pro tier includes a Saved Scenarios drawer (up to ten scenarios per account). Each saved scenario captures the archetype, usage inputs, and multipliers. Free tier doesn't persist scenarios.

Can I share a model with a teammate?

Yes — the Copy link button encodes your entire scenario into the URL. Free tier feature. Share is collaborative: the recipient can adjust the multipliers, but to keep their adjusted version they need to copy the new URL and share it back. No real-time sync.

Why does the lowest modeled cost differ between free and Pro?

The free-tier ranking is computed only from the six free models; the Pro ranking considers all eighteen. For most archetypes the same model wins both pools, but on chatbot, coding, and router archetypes the Pro pool often surfaces a significantly cheaper option. Free users see a banner with a neutral count of how many lower-cost models exist in the extended catalog — no price, no model name.

Why does Llama 8B never top the ranking for coding?

The model-selection engine enforces a quality floor on the coding-assistant and multi-model-router archetypes — budget-tier models (8B-class) are excluded outright. Llama 8B is fine for chat or simple summarisation but undersized for 4000-input / 1500-output coding work, regardless of how cheap it is.

What is the sensitivity tripwire?

Pick any two models and the tripwire shows the retry rate at which the cheaper one stops being cheaper — the break-even where the cost curves cross. Retries re-bill raw uncached tokens, so cache-driven savings erode as failures climb. The headline number is free; the full sweep (Pro) reports break-evens across cache hit rate, input price, and call volume for the same pair, computed server-side. If two models never cross in range, one simply dominates — the tripwire says so instead of inventing a number.

When should I turn on batch?

Batch saves roughly 50% on input and output but adds minutes-to-hours of latency. Use it for offline pipelines: nightly summarisation, bulk classification, document processing. Don't use it for user-facing flows. The Document processor archetype enables it by default at 80% eligibility.

What does Export Pass do?

$5 one-time unlock that gives you 24 hours of Pro access plus one included PDF or CSV export. Designed for users who need one board-ready cost artifact without committing to a monthly subscription.

What does the amber ! on a model name mean?

The model's pricing hasn't been re-verified within the last 30 days. Hover the icon for the last-verified date and the source URL. It's a freshness indicator, not a defect.

Does my prompt text leave my device when I paste it?

No. The tokenizer runs entirely client-side via WebAssembly. Your prompt text never leaves the page, never hits a server, and isn't logged. The tokenizer WASM bundle loads lazily when you first toggle paste mode.

How often is pricing data updated?

Manual review cycle runs monthly. A weekly GitHub Action checks the last_verified date on every model and alerts via Telegram if anything has gone stale beyond 30 days. Significant pricing changes are patched in as they're announced.

What if my architecture doesn't fit any preset?

Pick the closest archetype and overwrite the usage inputs (calls per day, calls per request, token counts, multipliers). The archetype only affects the model-selection logic and a few UI hints; the cost engine works off the numbers you supply.

Glossary

Naive estimate: Base token math only — no retries, caching, batch, or infra overhead.
Realistic estimate: Naive + your retry, cache, batch, and overhead multipliers applied.
Worst Case: Realistic with an elevated retry rate (default 20%) and, for RAG/agent archetypes, cascading retries across the chain.
Retry rate: Fraction of calls that fail and are retried once. Default 8% (15% for RAG/agent). Retries are billed at 30% of base call cost (regenerating only the failing call, not the full output).
Cache hit rate: Fraction of input tokens served from prompt cache rather than full input price. Default 40%; realistic range 30–60% on workloads with reusable system prompts.
Cache write: First-time tokens written to cache. Billed at 1.25× the base input price on Anthropic; free on OpenAI and Google.
Cached input: Tokens read from cache on subsequent calls. Billed at 0.1× of the base input price on OpenAI (gpt-5.x), Anthropic, and Gemini; 0.02× on DeepSeek V4 Flash.
Batch API: Asynchronous API tier ~50% cheaper than real-time on both input and output. Minutes-to-hours latency. Available on OpenAI, Anthropic, Google, AWS.
Infra overhead: Flat percentage added to monthly cost to cover hosting, vector DB, orchestration, and observability. Default 15%; adjustable under Settings.
Cascading retries: On agent/RAG archetypes, a single retry can compound across N chained calls. Worst Case uses this formula instead of a flat retry rate.
Tokenizer ratio: Per-model multiplier applied to the o200k_base count to approximate that model's native tokenizer. See Tokenization section.
Premium / mid / budget tier: Internal quality classification used by the model-selection engine. Roughly: frontier models, capable mid-range, and small/cheap models.
Sensitivity sweep: Recomputes the realistic monthly cost while one parameter (retry rate, cache hit rate, input price, or call volume) moves across its range, holding everything else fixed. Shows how fragile a cost ranking is.
Break-even (crossover): The parameter value where two models' cost curves cross — below it one model is cheaper, above it the other. The tripwire headline reports the retry-rate break-even for your chosen pair; the full sweep (Pro) reports break-evens for every swept parameter.

LLM cost modeling methodology

Quick start

How costs are modeled

Base cost per call

Retry overhead

Cache savings

Batch savings

Infrastructure overhead

Putting it together

The three output columns

Tokenization

The baseline

Per-model adjustment

When to verify

Archetypes

Each archetype, in one line plus a typical example

Model selection — lowest modeled cost

Step 1 — eliminate gross-outliers

Step 2 — archetype-specific scoring

Access-aware ranking

Analysis tools

Sensitivity tripwire & full sweep

Input token breakdown

Cost per user / MAU

Global retry budget

MCP server (Pro)

Tools

Setup

Demo mode

Models & pricing data

The "!" warning

Free vs Pro models

Caching support

Provider deprecations

What this tool does NOT model

FAQs

Glossary

Further reading