Model routing is the most reliable LLM cost optimization that doesn’t touch quality: a small routing layer decides, per request, whether a cheap model can handle it or whether it escalates to a frontier model. Easy questions go to a $0.003 call; hard ones get the $0.009 call. Done right, it cuts the bill roughly in half. Done carelessly, the savings quietly evaporate — and the failure modes are all measurable up front.
The baseline math: ~52% off
Take 1,000 requests/day (30,000/month), 1,000 input / 400 output tokens per call. Sending everything straight to Claude Sonnet 4.6 costs about $270/mo. Now add a router:
Routed setup, per request: 1× classifier call Gemini 2.5 Flash (300/20 tokens) ≈ $0.0001 80% answered by Claude Haiku 4.5 $0.0030/call 20% escalated to Claude Sonnet 4.6 $0.0090/call Monthly: classifier $4 + easy $72 + hard $54 = $130/mo vs. all-Sonnet baseline: $270/mo (−52%)
Half the bill, and the 20% of traffic that actually needs frontier reasoning still gets it. That’s the pitch — and unlike most “optimization” advice, the quality floor is structural: hard requests are escalated by design, not downgraded.
The number that decides everything: escalation rate
The 52% figure assumed 20% of requests escalate. That assumption is doing all the work. Re-run it at a 50% escalation rate — a router that’s cautious, or a workload that’s genuinely harder than you thought:
20% escalate → $130/mo (−52% vs all-premium) 50% escalate → $184/mo (−32%) ~100% escalate → ≈ baseline + router overhead (worse than no router)
Savings decay roughly linearly with escalation rate, and past a crossover point the router is pure overhead. Before you build one, estimate your real easy/hard split from a traffic sample — don’t assume 80/20 because a blog post (including this one) used it as the example.
Model this yourself
The multi-model router archetype, prefilled — 2–4 calls/request with a routing mix. Adjust the inputs to your traffic and compare against a single-model setup.
Open in calculator →Three ways routing quietly stops paying
1 · The classifier runs on the wrong model
The routing call itself should be nearly free — a short prompt, a tiny output, on the cheapest capable model. On Gemini 2.5 Flash the classifier above costs ~$4/mo. Run that same call on Sonnet and it’s ~$36/mo — nine times the overhead, skimmed straight off your savings before a single request is answered. Small models classify intent well; don’t spend frontier tokens deciding who answers.
2 · Escalation creep
Routers drift toward caution: a few bad cheap-model answers, someone widens the escalation criteria, and six months later 60% of traffic is hitting the premium model. The fix is treating escalation rate as a monitored metric with a budget — the same way you’d watch a retry rate — not a set-and-forget config value.
3 · Retries hit the escalated calls hardest
Failures concentrate on the hard requests — the ones already on the expensive model, often mid-chain in a multi-step flow. That means your retry spend is disproportionately premium spend. The engine’s Worst Case column models this; if your realistic and worst-case numbers are far apart on the router archetype, escalated-call retries are usually why.
When routing is worth building
- Your traffic is genuinely mixed. If 90% of requests need frontier reasoning, there’s nothing to route — pick the right single model instead.
- Volume is high enough to matter. A 52% saving on $50/mo doesn’t pay for the routing layer’s engineering time. On $3,000/mo it does, quickly.
- You can measure quality per tier. Routing without evals is how escalation creep starts — you need a signal for “the cheap model handled this fine.”
- The break-even isn’t on your doorstep. Model the crossover before building. If modest drift in escalation rate or pricing wipes the saving, spend the effort on caching instead — it’s cheaper to implement.
Model your routing mix — escalation rate, per-tier models, retries — and see the real monthly numbers before you build the router.
Open the router cost model →