← //beforeyouship

Help & docs

LLM cost modeling methodology

Formulas for the token cost calculator, retries, prompt caching, batch API, and growth — every default and assumption made explicit. beforeyouship earns trust through transparency, not by overclaiming precision.

Quick start

beforeyouship is a pre-deployment cost model for LLM apps. You describe the architecture and usage pattern; the tool returns a monthly cost forecast across model tiers with honest multipliers for retries, caching, batching, and growth. It is a planning tool for the design phase — before you commit to a stack.

  1. Pick an app archetype. Seven presets cover the common cost profiles (chatbot, RAG, multi-step agent, doc processor, etc.). Each one pre-fills sensible token defaults and retry behaviour.
  2. Tune the usage pattern. Calls per day, LLM calls per request, average input and output token counts. Paste your prompt instead of guessing — the built-in tokenizer estimates the count locally.
  3. Adjust cost multipliers. Retry rate, cache hit rate, batch eligibility, infrastructure overhead. Each one has a defined formula; see How costs are modeled below.
  4. Read the cost table. Three columns: Naive (raw tokens × prices), Realistic (your multipliers applied), Worst Case (elevated retry rate and, for agent/RAG, cascading retries).
  5. Read the recommendation. The engine picks the best-fit model for your archetype — not just the cheapest. See Recommendation engine for the per-archetype scoring rules.
Free tier covers six models across four providers. Pro unlocks the full catalog (eighteen models, all providers), per-input-layer token breakdown, cost-per-MAU, a global retry budget, exports, and integrations.

How costs are modeled

Every model in the catalog produces three monthly cost numbers. They all use the same input and output token counts; only the multipliers differ.

Base cost per call

base_cost = (in_tokens × in_price + out_tokens × out_price) / 1,000,000

The bare token math — no retries, no caching, no batch. This is what feeds the Naive $/mo column when multiplied by calls/day × 30 × calls/request.

Retry overhead

retry_cost_per_call = base_cost_per_call × retry_rate × 0.3

Default retry rate is 8% (15% for RAG and multi-step agents, which retry more often because failures compound across the chain). The 0.3 factor reflects that most retries are short — they regenerate only the failing call, not the full output.

Cache savings

savings_per_hit  = (in_tokens / 1M) × (in_price - cached_in_price)
write_overhead   = (in_tokens / 1M) × (cache_write_price - in_price) × (1 - hit_rate)
cache_savings    = savings_per_hit × hit_rate - write_overhead

Anthropic charges cache writes at 1.25× the base input price; OpenAI and Google have no write surcharge. Cache reads run 90% cheaper on OpenAI (gpt-5.x), Anthropic, and Gemini, and 98% cheaper on DeepSeek V4 Flash. The formula above captures both sides.

Batch savings

in_save  = (in_tokens / 1M) × (in_price - batch_in_price)
out_save = (out_tokens / 1M) × (out_price - batch_out_price)
batch_savings = (in_save + out_save) × batch_pct

The batch API is roughly 50% cheaper than real-time pricing on both input and output, at the cost of higher latency (minutes to hours). Only useful for offline workloads — the Document processor archetype enables it by default at 80% eligibility.

Infrastructure overhead

A flat 15% line item covers hosting, vector DB, orchestration, and observability tooling that scale with API spend. Adjust under settings; set to 0 if you self-host the infrastructure side.

Putting it together

effective_cost_per_call = base + retry - cache_savings - batch_savings
monthly_cost = effective_cost_per_call
             × calls_per_day × 30 × calls_per_request
             × (1 + infra_overhead_pct / 100)

The three output columns

  • Naive $/mo — base cost × call count. No retries, no caching, no batch. The number you’d compute on the back of a napkin. Intentionally muted in the UI; never shown in isolation.
  • Realistic $/mo — your multipliers applied as defined above. The primary number — this is the one to plan against.
  • Worst Case $/mo — uses an elevated retry rate (default 20%). For RAG and multi-step agent archetypes, retries compound across the chain via a cascading formula. Stress-tests how exposed you are to bad-traffic days.

Tokenization

Token counts drive the entire cost calculation, so accuracy matters. The challenge is that every provider uses a different tokenizer.

The baseline

Paste mode runs the OpenAI native tokenizer (o200k_base) locally in your browser via WebAssembly. Your prompt text never leaves the page.

Per-model adjustment

For non-OpenAI models, the cost engine multiplies the o200k_base count by a calibrated ratio. Approximate values for English text:

  • OpenAI, Azure OpenAI, Llama 3.x, Amazon Nova: 1.00 (baseline)
  • Gemini 2.5 Flash: ≈ 0.97 (SentencePiece, slightly more efficient on English)
  • Mistral Large 3 / Small 4: ≈ 1.03
  • Cohere Command R+/R, DeepSeek V4 Flash: ≈ 1.05
  • Claude Sonnet 4.6 / Haiku 4.5: ≈ 1.10 (verbose on English)

When to verify

Calibrated ratios are accurate to roughly ±5% for English. They’re less accurate for heavy code, multilingual content, or unusual formatting. If your workload is one of those, paste a representative sample through your provider’s own count_tokens endpoint and compare to the number shown in paste mode — divide the provider’s count by the o200k_base count to get your real ratio.

The cost engine doesn’t care about tokens above the price-per-million arithmetic. Even a 10% tokenizer error usually represents less variance than your retry-rate or cache-hit-rate guess.

Archetypes

Seven presets cover the meaningfully different LLM cost profiles. Each preset fills in default token counts, calls-per-request, retry rate, and (where appropriate) batch eligibility. All values stay editable after selection.

ArchetypeCalls/reqIn / Out tokensPrimary cost driver
Simple chatbot1800 / 400Call volume
Chatbot with history13000 / 600Context growth
RAG pipeline3–52000 / 500Call count × caching
Multi-model router2–41000 / 400Routing logic + model mix
Coding assistant1–24000 / 1500Large input + output
Document processor18000 / 800Volume × batch savings
Multi-step agent5–121500 / 600Steps × cascading retries

Each archetype, in one line plus a typical example

Simple chatbot

DefinitionYour app takes a user message, sends it to an LLM with some instructions, and returns a response. No memory between sessions. Every conversation starts fresh.

Typical exampleA customer support widget on an e-commerce site. User types 'where is my order?' — the LLM responds using a fixed system prompt about the company's policies. Every session is identical in structure.

Full spec →

Chatbot with history

DefinitionLike a simple chatbot, but you include previous messages in every new prompt so the LLM 'remembers' the conversation. The more turns, the bigger (and more expensive) each call gets.

Typical exampleAn AI sales assistant that qualifies leads over a multi-turn conversation. By turn 8, the prompt includes the system prompt + 7 prior exchanges — easily 4,000–6,000 tokens per call.

Full spec →

RAG pipeline

DefinitionYour app searches a knowledge base (documents, database, website) and uses what it finds to answer the question. Under the hood, this requires several LLM calls — not just one.

Typical exampleAn internal company knowledge base. Employee asks 'what's our parental leave policy?' — the app rewrites the query, searches HR documents, reranks results, then generates an answer citing the relevant policy. 3–5 LLM calls per question.

Full spec →

Multi-model router

DefinitionYour app uses a routing layer to decide which LLM to call based on the complexity or type of each request. Simple questions go to a cheap, fast model. Complex or sensitive tasks escalate to a frontier model. The cost depends almost entirely on how accurately the router classifies tasks.

Typical exampleA B2B support platform. Simple queries like 'reset my password' route to GPT-5.4 mini (~0.07¢). Complex queries like 'explain why my enterprise integration is failing' escalate to GPT-5.4 (~0.7¢). A 10% misroute rate to the expensive model can 5× the expected cost.

Full spec →

Coding assistant

DefinitionYour app helps developers write, review, explain, or debug code. Prompts are large (code files, diffs, instructions) and outputs are large (generated code, explanations). Token costs are higher than a typical chatbot.

Typical exampleA PR review tool. Developer opens a pull request — the app sends the entire diff (3,000 tokens) to an LLM and gets back a detailed code review (1,200 tokens). Output pricing matters a lot here.

Full spec →

Document processor

DefinitionYour app processes documents in bulk — summarizing, extracting data, classifying, or translating them. This runs in the background, not in real time. Because it's async, you can use cheaper batch pricing.

Typical exampleA legal tech tool that summarizes contracts overnight. 500 contracts uploaded — each goes through one LLM call to extract key clauses. No user is waiting. Batch API gives 50% off.

Full spec →

Multi-step agent

DefinitionYour app gives an AI a goal and lets it figure out the steps itself — using tools, making decisions, and iterating until it's done. Each step is a separate LLM call. A task that takes 10 steps costs 10× a single-call task.

Typical exampleAn autonomous research agent. User says 'find me the 5 best competitors to my SaaS and summarize their pricing.' The agent searches the web, visits each site, extracts pricing, compares, and writes a summary — 8–12 LLM calls per task.

Full spec →

If your app doesn’t fit any preset cleanly, pick the closest one and overwrite the usage inputs. The cost engine doesn’t know which preset you picked — it just uses the numbers you supply. The archetype affects only the recommendation logic and a few UI hints.

Recommendation engine

The engine picks one model as the strongest fit — not necessarily the cheapest. The scoring rule depends on the archetype.

Step 1 — eliminate gross-outliers

Sort by realistic monthly cost ascending. Drop any model that costs more than 2× the cheapest, with one exception: a premium-tier model within 20% of the cheapest gets a pass on quality grounds.

Step 2 — archetype-specific scoring

RAG pipeline, Multi-step agent

Skip the 2× cull. Score = monthlyCost − 3 × cacheSavings. Cache economics dominate because system prompt + retrieval context are re-sent on every call. RAG has a quality floor: if the winner is budget-tier, promote a mid or premium model whose effective score is within 2× — RAG is customer-facing.

Document processor

Score = monthlyCost − batchSavings. Ties broken by larger context window. Biased toward batch-eligible, long-context models.

Simple chatbot, Chatbot with history

Exclude premium tier (overkill for chat). Pick the cheapest, but exclude DeepSeek unless it’s more than 50% cheaper than the next non-DeepSeek option (data-policy hedge for sensitive workloads).

Coding assistant, Multi-model router

Quality floor: exclude budget tier entirely. Coding @ 4000 / 1500 tokens and routing logic both need real reasoning capacity; 8B-class models don’t qualify. Re-run the 2× cull against the qualified mid/premium pool, then pick the cheapest.

Access-aware recommendation

Free users see a recommendation drawn only from the free-tier pool (the model card on the main page). When a cheaper alternative exists in the extended catalog, a FOMO banner below the recommendation surfaces the alternative’s price (without naming the model) and an upgrade CTA. Pro users see a recommendation drawn from the full catalog — consistent across the main card and the model-catalog modal.

Analysis tools (Pro)

Three Pro-only refinements that sit on top of the base cost engine. None of them changes what shows up in the Naive / Realistic / Worst Case columns — they refine how you reason about the inputs.

Input token breakdown

Splits the avg input tokens number into four named layers — system prompt, retrieval context (RAG), conversation history, and the user’s actual message. The total feeds the cost engine identically; the breakdown surfaces which layer is dominating the bill and which layer is the best caching candidate (system prompt and history are the most cache-friendly).

Cost per user / MAU

Divides the realistic monthly cost by your monthly active user count. Useful for pricing-page sanity checks (“we’re at $X / MAU at 1k users, $Y / MAU at 10k”) and board reports. Assumes calls per day scale linearly with users — adjust calls per day for your actual usage curve if it’s super-linear or sub-linear.

Global retry budget

Replaces the 8% / 15% retry-rate heuristic with a deterministic cap: max retries per call, with an optional max retries per hour. If your code has exponential backoff with a hard limit, this is more realistic than the rate-based model. The Worst Case column respects the cap when set.

MCP server (Pro)

beforeyouship runs a remote MCP (Model Context Protocol) server, so you can query cost models without leaving your editor — from Claude Code, Cursor, or any MCP client that supports the Streamable HTTP transport. Same cost engine, same pricing data, same Naive / Realistic / Worst Case output as the website. Also listed on Smithery.

Tools

  • list_archetypes — the seven app archetypes with their default usage parameters.
  • get_model_prices — current per-1M-token pricing for the full catalog, with staleness metadata.
  • estimate_cost — full cost model for an archetype at a given usage pattern: Naive / Realistic / Worst Case monthly cost per model, growth scenarios, and the recommendation.

Setup

  1. Generate an API key. Account menu → API keysGenerate API key. The key is shown once — store it in your secrets manager. Up to five active keys; revoke any time.
  2. Add the server. In Claude Code:
claude mcp add --transport http beforeyouship \
  https://beforeyouship.dev/api/mcp \
  --header "Authorization: Bearer bys_..."

In Cursor or other MCP clients, add a remote server with URL https://beforeyouship.dev/api/mcp and the same Authorization header.

  1. Try it. Paste this into your editor:
Estimate the monthly cost of a RAG pipeline at 10,000 requests/day

Demo mode

No key yet? Skip the --header flag and the server runs in demo mode — all three tools work against the six free-tier models. A Pro key unlocks the full catalog across all providers.

Full catalog access is a Pro feature. Keys outlive a lapsed subscription but fall back to demo mode (free-tier models only) until the subscription is active again — revoke unused keys from the account menu.

Models & pricing data

Prices come from the official rate cards of each provider. They’re stored in /data/prices.json and reviewed monthly. A weekly GitHub Action checks staleness and pings the maintainer if any entry hasn’t been re-verified in 30 days.

The "!" warning

Any model whose pricing hasn’t been re-verified within the last 30 days gets a small amber "!" next to its name. Hover the icon for the last-verified date and the source URL. It’s a heads-up, not a defect — prices rarely change between checks, but the warning prevents stale numbers from being trusted blindly.

Free vs Pro models

Free tier covers six models that span the four largest providers and the full quality spectrum: GPT-5.4, GPT-5.4 mini (OpenAI), Claude Sonnet 4.6, Claude Haiku 4.5 (Anthropic), Gemini 2.5 Flash (Google), DeepSeek V4 Flash. Pro adds twelve more across Mistral, Cohere, AWS Bedrock (Nova), Azure OpenAI, Together AI, and Groq.

Caching support

Not every provider exposes prompt caching at the API tier. OpenAI, Anthropic, Google, DeepSeek, and Azure OpenAI do; Mistral, Cohere, AWS Nova, Together, and Groq don’t. If you set a cache hit rate > 0, models without caching simply show zero cache savings — their realistic monthly cost is unaffected.

Provider deprecations

DeepSeek’s legacy endpoint names (deepseek-chat and deepseek-reasoner) retire on 2026-07-24 and already route to V4 Flash transparently. Use deepseek-v4-flash going forward — $0.14 / $0.28 per 1M with a 98% cache read discount.

What this tool does NOT model

These are real costs that you’ll incur in production but the tool deliberately leaves out. Each one is too project-specific to model honestly with usage inputs alone.

  • GPU rental and self-hosted inference. Running Llama or Mistral on your own GPUs replaces per-token billing with hourly GPU rental — a completely different cost structure that depends on QPS, batch size, and GPU class.
  • Embedding models. Embeddings are usually a separate budget line (vector indexing, semantic search). Worth modelling, but not in this tool.
  • Fine-tuning. Fine-tune training costs and the per-token premium on the resulting model are highly variable; not in scope.
  • Vector database. The 15% infrastructure overhead is a rough placeholder. Pinecone, Weaviate, or pgvector at your scale should be modelled separately if they’re a meaningful budget line.
  • Human-in-the-loop labeling and evals. Quality assurance, eval datasets, and human review time can dwarf the API bill in regulated domains. Out of scope here.
  • Observability tooling. LangSmith, Helicone, PromptLayer, etc. — modelled implicitly through the 15% overhead, but if your observability stack is significant, account for it separately.

FAQs

How accurate is the estimate?

The base token math is exact; the multipliers are estimates that you control. With well-chosen multipliers (your real retry rate, your real cache hit rate), expect ±15% accuracy at small scale and ±25% at 10× scale where rate-limit effects start to bite. Every formula is documented in the methodology section.

Can I save my model?

Pro tier includes a Saved Scenarios drawer (up to ten scenarios per account). Each saved scenario captures the archetype, usage inputs, and multipliers. Free tier doesn't persist scenarios.

Can I share a model with a teammate?

Yes — the Copy link button encodes your entire scenario into the URL. Free tier feature. Share is collaborative: the recipient can adjust the multipliers, but to keep their adjusted version they need to copy the new URL and share it back. No real-time sync.

Why does the recommendation differ between free and Pro?

The free-tier recommendation is computed only from the six free models; the Pro recommendation considers all eighteen. For most archetypes the same model wins both pools, but on chatbot, coding, and router archetypes the Pro pool often surfaces a significantly cheaper option. Free users see a banner with the cheaper alternative's price.

Why is Llama 8B never recommended for coding?

The recommendation engine enforces a quality floor on the coding-assistant and multi-model-router archetypes — budget-tier models (8B-class) are excluded outright. Llama 8B is fine for chat or simple summarisation but undersized for 4000-input / 1500-output coding work, regardless of how cheap it is.

When should I turn on batch?

Batch saves roughly 50% on input and output but adds minutes-to-hours of latency. Use it for offline pipelines: nightly summarisation, bulk classification, document processing. Don't use it for user-facing flows. The Document processor archetype enables it by default at 80% eligibility.

What does Export Pass do?

$5 one-time unlock that gives you 24 hours of Pro access plus one included PDF or CSV export. Designed for users who need one board-ready cost artifact without committing to a monthly subscription.

What does the amber ! on a model name mean?

The model's pricing hasn't been re-verified within the last 30 days. Hover the icon for the last-verified date and the source URL. It's a freshness indicator, not a defect.

Does my prompt text leave my device when I paste it?

No. The tokenizer runs entirely client-side via WebAssembly. Your prompt text never leaves the page, never hits a server, and isn't logged. The tokenizer WASM bundle loads lazily when you first toggle paste mode.

How often is pricing data updated?

Manual review cycle runs monthly. A weekly GitHub Action checks the last_verified date on every model and alerts via Telegram if anything has gone stale beyond 30 days. Significant pricing changes are patched in as they're announced.

What if my architecture doesn't fit any preset?

Pick the closest archetype and overwrite the usage inputs (calls per day, calls per request, token counts, multipliers). The archetype only affects the recommendation logic and a few UI hints; the cost engine works off the numbers you supply.

Glossary

Naive estimate
Base token math only — no retries, caching, batch, or infra overhead.
Realistic estimate
Naive + your retry, cache, batch, and overhead multipliers applied.
Worst Case
Realistic with an elevated retry rate (default 20%) and, for RAG/agent archetypes, cascading retries across the chain.
Retry rate
Fraction of calls that fail and are retried once. Default 8% (15% for RAG/agent). Retries are billed at 30% of base call cost (regenerating only the failing call, not the full output).
Cache hit rate
Fraction of input tokens served from prompt cache rather than full input price. Default 40%; realistic range 30–60% on workloads with reusable system prompts.
Cache write
First-time tokens written to cache. Billed at 1.25× the base input price on Anthropic; free on OpenAI and Google.
Cached input
Tokens read from cache on subsequent calls. Billed at 0.1× of the base input price on OpenAI (gpt-5.x), Anthropic, and Gemini; 0.02× on DeepSeek V4 Flash.
Batch API
Asynchronous API tier ~50% cheaper than real-time on both input and output. Minutes-to-hours latency. Available on OpenAI, Anthropic, Google, AWS.
Infra overhead
Flat percentage added to monthly cost to cover hosting, vector DB, orchestration, and observability. Default 15%; adjustable under Settings.
Cascading retries
On agent/RAG archetypes, a single retry can compound across N chained calls. Worst Case uses this formula instead of a flat retry rate.
Tokenizer ratio
Per-model multiplier applied to the o200k_base count to approximate that model's native tokenizer. See Tokenization section.
Premium / mid / budget tier
Internal quality classification used by the recommendation engine. Roughly: frontier models, capable mid-range, and small/cheap models.

Further reading

Worked examples and cost breakdowns live in Field Notes — start with How Much Does an LLM App Actually Cost?.