Query LLM Costs From Claude Code (Without Leaving the PR)

The cost question never arrives when you have a calculator tab open. It arrives mid-PR-review (“wait, this adds two LLM calls per request — what does that do to the bill?”), or while an agent is scaffolding a retrieval pipeline and you want to know if the architecture survives 10× growth before it’s merged. So we put the cost model where those questions happen: beforeyouship now runs a remote MCP server, and any MCP client — Claude Code, Cursor, your own agent — can query it directly.

Three tools, same engine

The server exposes the exact engine and pricing data behind the calculator — not a summary of it. Three tools:

list_archetypes — the seven architecture presets (chatbot, RAG, multi-step agent, document processor…) with their default call counts and token shapes. The starting vocabulary for any estimate.
get_model_prices — per-1M-token pricing for the full 18-model catalog, with context windows, cache/batch support, and a last_verified date on every row.
estimate_cost — the full model: pick an archetype, give it calls/day, optionally override tokens and multipliers. Back comes the same Naive / Realistic / Worst Case table the website renders, plus growth scenarios and the opinionated recommendation.

Two-minute setup

Generate an API key (account menu → API keys — it’s shown once, store it), then register the server:

claude mcp add --transport http beforeyouship \
  https://beforeyouship.dev/api/mcp \
  --header "Authorization: Bearer bys_..."

Cursor and other clients take the same URL and header in their remote-server config. That’s the whole setup — no local process, no npm install, nothing to keep updated. The server deploys with the site, so pricing changes reach your editor the same minute they reach the website.

What it looks like in practice

Ask your agent something like “what does this RAG service cost at 1,000 requests a day?” and it calls estimate_cost(archetype: "rag-pipeline", calls_per_day: 1000). The response (real output, abridged):

## RAG pipeline — 1,000 requests/day

| Model              | Naive $/mo | Realistic $/mo | Worst Case $/mo |
|--------------------|-----------|----------------|-----------------|
| Llama 3.1 8B (Groq)|    $16.80 |         $20.19 |          $30.19 |
| DeepSeek V4 Flash  |    $52.92 |         $47.69 |          $70.24 |
| Gemini 2.5 Flash   |      $215 |           $230 |            $341 |
| ...                |           |                |                 |

Recommendation: DeepSeek V4 Flash
- Prompt caching saves ~$16/mo (25% of pre-cache cost). RAG pipelines
  re-send retrieval context on every call — caching directly offsets it.
⚠️ Review DeepSeek's data residency policy before committing.

Assumptions: retry 15% (worst case 25%, cascading), cache hit 40%,
infra overhead 15%, 4 calls/request × 2000 in / 500 out tokens.

Two things worth noticing. The DeepSeek row’s realistic cost is lower than its naive cost — its cache discount is deep enough that a 40% hit rate outweighs retries and infra overhead combined. And the worst-case column for RAG uses the cascading retry model, because in a 4-call chain a failure doesn’t retry one call — it can re-trigger the calls behind it. Both are details an agent eyeballing a pricing page would never surface.

Model this yourself

The same estimate, in the browser — prefilled with the RAG archetype. Adjust the multipliers and compare against the MCP output.

Open in calculator →

Why a tool, and not just a prompt

You could paste a pricing table into your agent’s context and ask it to do the math. The problem isn’t arithmetic — it’s provenance. Models confidently recall outdated prices from training data, and a hand-pasted table starts rotting the day you paste it. The MCP server’s numbers come from the same version-controlled prices.json the website serves, re-verified against provider pages and shipped with a last_verified date per model. If the data goes stale, the response says so — get_model_prices carries explicit staleness metadata instead of letting old numbers pass as current.

The multiplier logic matters just as much. Retry cost (failed calls still burn tokens), cache-write premiums, batch eligibility, cascading agent retries — that’s encoded in the engine, not re-derived by the LLM each time. The agent contributes what it’s good at: reading your code to pick the right archetype and parameters. The engine contributes what it’s good at: getting the same answer twice.

The MCP server is a Pro feature. Keys are scoped to your account, capped at five, and revocable from the account menu. If a subscription lapses, existing keys stay valid but requests return a clear “Pro required” error rather than silently degrading — your agent will tell you exactly what happened.

Where this goes

The obvious next step is wiring it into CI — a GitHub Action that comments a cost delta on PRs that touch prompts or model config. That’s on the roadmap. For now, the loop we wanted is closed: the question “what does this cost?” gets answered in the same window where the architecture is being written, with numbers that have a paper trail.

Setup details, tool reference, and key management — two minutes from key to first estimate.

Read the MCP docs →