LLM Model Routing: Cost Optimization Without the Quality Hit

Model routing is the most reliable LLM cost optimization that doesn’t touch quality: a small routing layer decides, per request, whether a cheap model can handle it or whether it escalates to a frontier model. Easy questions go to a $0.003 call; hard ones get the $0.009 call. Done right, it cuts the bill roughly in half. Done carelessly, the savings quietly evaporate — and the failure modes are all measurable up front.

The baseline math: ~52% off

Take 1,000 requests/day (30,000/month), 1,000 input / 400 output tokens per call. Sending everything straight to Claude Sonnet 4.6 costs about $270/mo. Now add a router:

Routed setup, per request:
  1× classifier call     Gemini 2.5 Flash (300/20 tokens)  ≈ $0.0001
  80% answered by        Claude Haiku 4.5                   $0.0030/call
  20% escalated to       Claude Sonnet 4.6                  $0.0090/call

Monthly:  classifier $4  +  easy $72  +  hard $54  =  $130/mo
vs. all-Sonnet baseline:                               $270/mo   (−52%)

Half the bill, and the 20% of traffic that actually needs frontier reasoning still gets it. That’s the pitch — and unlike most “optimization” advice, the quality floor is structural: hard requests are escalated by design, not downgraded.

The number that decides everything: escalation rate

The 52% figure assumed 20% of requests escalate. That assumption is doing all the work. Re-run it at a 50% escalation rate — a router that’s cautious, or a workload that’s genuinely harder than you thought:

20% escalate  →  $130/mo   (−52% vs all-premium)
50% escalate  →  $184/mo   (−32%)
~100% escalate →  ≈ baseline + router overhead  (worse than no router)

Savings decay roughly linearly with escalation rate, and past a crossover point the router is pure overhead. Before you build one, estimate your real easy/hard split from a traffic sample — don’t assume 80/20 because a blog post (including this one) used it as the example.

This is exactly the kind of assumption the sensitivity tripwire is built to stress-test: pick your cheap and premium models as the pair, and the break-even headline tells you where the ranking flips as conditions drift. If your escalation estimate sits near a crossover, the router’s ROI is fragile — know that before you write the code.

Model this yourself

The multi-model router archetype, prefilled — 2–4 calls/request with a routing mix. Adjust the inputs to your traffic and compare against a single-model setup.

Open in calculator →

Three ways routing quietly stops paying

1 · The classifier runs on the wrong model

The routing call itself should be nearly free — a short prompt, a tiny output, on the cheapest capable model. On Gemini 2.5 Flash the classifier above costs ~$4/mo. Run that same call on Sonnet and it’s ~$36/mo — nine times the overhead, skimmed straight off your savings before a single request is answered. Small models classify intent well; don’t spend frontier tokens deciding who answers.

2 · Escalation creep

Routers drift toward caution: a few bad cheap-model answers, someone widens the escalation criteria, and six months later 60% of traffic is hitting the premium model. The fix is treating escalation rate as a monitored metric with a budget — the same way you’d watch a retry rate — not a set-and-forget config value.

3 · Retries hit the escalated calls hardest

Failures concentrate on the hard requests — the ones already on the expensive model, often mid-chain in a multi-step flow. That means your retry spend is disproportionately premium spend. The engine’s Worst Case column models this; if your realistic and worst-case numbers are far apart on the router archetype, escalated-call retries are usually why.

When routing is worth building

Your traffic is genuinely mixed. If 90% of requests need frontier reasoning, there’s nothing to route — pick the right single model instead.
Volume is high enough to matter. A 52% saving on $50/mo doesn’t pay for the routing layer’s engineering time. On $3,000/mo it does, quickly.
You can measure quality per tier. Routing without evals is how escalation creep starts — you need a signal for “the cheap model handled this fine.”
The break-even isn’t on your doorstep. Model the crossover before building. If modest drift in escalation rate or pricing wipes the saving, spend the effort on caching instead — it’s cheaper to implement.

Model your routing mix — escalation rate, per-tier models, retries — and see the real monthly numbers before you build the router.

Open the router cost model →