The Haiku→Sonnet escalation mirrors a pattern I settled on for nightly batch
processing of government filings—cheaper model for initial parsing and
classification, escalate to Sonnet when the task involves cross-standard
normalization or generating narrative output.
For async workloads where latency doesn't matter, it's worth adding
Anthropic's batch API to your cost hierarchy: 50% reduction vs. real-time,
24hr turnaround. I use it for AI summary generation on filings where
immediate results aren't needed.
One failure mode I've hit with escalation logic: when the smaller model is
confidently wrong. It routes a complex task to itself, produces a
plausible-looking bad answer, and there's no signal that a better answer
existed. I've partially addressed this by treating low confidence scores from
the smaller model as an escalation trigger—but it's an imperfect heuristic.
Do you surface the escalation decision to users at all, or just tune the
threshold and accept some misrouting as a background error rate?
For async workloads where latency doesn't matter, it's worth adding Anthropic's batch API to your cost hierarchy: 50% reduction vs. real-time, 24hr turnaround. I use it for AI summary generation on filings where immediate results aren't needed.
One failure mode I've hit with escalation logic: when the smaller model is confidently wrong. It routes a complex task to itself, produces a plausible-looking bad answer, and there's no signal that a better answer existed. I've partially addressed this by treating low confidence scores from the smaller model as an escalation trigger—but it's an imperfect heuristic.
Do you surface the escalation decision to users at all, or just tune the threshold and accept some misrouting as a background error rate?