ML-based LLM request classifier for cost-optimized routing (~2ms inference)

By Volt Lynx · April 7, 2026 · 1 min read

I built a request classifier that decides which LLM tier a prompt needs before it's sent to a provider. The goal is cost optimization: route simple requests to cheap models, keep complex ones on premium. Architecture Feature extraction: token count, estimated complexity, conversation depth, presence of code/math/reasoning markers, language detection Model: MLP trained on ~50K labeled samples (rule-based scorer as teacher), exported to ONNX for fast inference Inference: <2ms per classification, runs inline with the request Three output tiers: economy (e.g. Gemini Flash), standard (e.g. GPT-4o-mini), premium (e.g. GPT-4o/Claude Sonnet) Semantic cache: Qdrant-based layer that catches near-duplicate prompts (cosine similarity threshold 0.95) Training pipeline The rule-based scorer acts as a teacher model to generate labels, then distills into the MLP. Retraining happens via outcome signals from downstream quality checks. Try it The routing engine is open source: https://github.com/andbe

ML-based LLM request classifier for cost-optimized routing (~2ms inference)

Related Posts

Trending on ShareHub

Latest on ShareHub

Browse Topics

Around the Network