Back to Blog

ModelRouter: Intelligent Multi-LLM Routing to Stop Burning Claude Pro Tokens

How I built a local proxy that classifies prompts and routes them to the right model — Gemma for trivial tasks, Gemini for features, Codex for debugging, Claude only for hard architecture problems.

5 min read

ModelRouter: Intelligent Multi-LLM Routing to Stop Burning Claude Pro Tokens

I noticed a pattern about three months ago. I was hitting my Claude Pro rate limits mid-afternoon — not because I was doing anything particularly complex, but because I was using Claude for everything. Fixing a typo in a docstring. Reformatting a JSON blob. Renaming a variable.

These are not Claude-level problems. But when Claude is the tool in your terminal, Claude is what you use.

The waste bothered me architecturally. So I built ModelRouter — a local proxy that sits between your terminal and Anthropic, classifies each prompt, and routes it to the most cost-effective model that can actually handle it.

The Core Problem: One Hammer, All Nails

Most AI-assisted development workflows treat LLMs as interchangeable. They're not. There's a rough capability-cost hierarchy:

  • Trivial tasks (typo fixes, formatting, boilerplate): a local 27B model handles these fine
  • Feature implementation: a capable mid-tier model with a large context window is ideal
  • Testing and debugging: specialized code models with strong reasoning
  • Complex architecture: frontier models where nuanced judgment actually matters

When you route everything through Claude Pro, you're paying frontier model prices for formatting work. Worse, you're burning rate-limit budget on tasks that have no business touching a frontier model.

The Architecture

ModelRouter is a local proxy daemon that intercepts traffic destined for Anthropic's API and applies a classification layer before forwarding.

┌─────────────────────────────────────────────────┐
│                  claude-mix CLI                  │
└─────────────────────┬───────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────┐
│              ModelRouter Proxy                   │
│   ┌──────────────────────────────────────────┐  │
│   │         Prompt Classifier (YAML rules)   │  │
│   └──────┬───────────┬──────────┬────────────┘  │
│          │           │          │                │
│   trivial│  feature  │  debug   │  architecture  │
│          │           │          │                │
│   ┌──────▼─┐  ┌──────▼─┐  ┌───▼──┐  ┌─────────▼┐│
│   │ Gemma  │  │Gemini  │  │Codex │  │ Claude  ││
│   │ 3 27B  │  │2.5 Pro │  │ CLI  │  │   Pro   ││
│   │(Ollama)│  │(Free)  │  │      │  │         ││
│   └────────┘  └────────┘  └──────┘  └─────────┘│
└─────────────────────────────────────────────────┘

The classifier reads a YAML rules file. Here's a simplified version of what the routing rules look like:

routes:
  - name: trivial
    patterns:
      - "fix typo"
      - "format this"
      - "rename"
      - "add docstring"
      - "correct grammar"
    backend: ollama
    model: gemma3:27b
 
  - name: feature
    patterns:
      - "implement"
      - "create a function"
      - "add feature"
      - "build"
    token_threshold: 2000
    backend: google_ai_studio
    model: gemini-2.5-pro
 
  - name: debug
    patterns:
      - "why is this failing"
      - "debug"
      - "test this"
      - "write unit tests"
    backend: codex_cli
    model: codex
 
  - name: architecture
    patterns:
      - "design a system"
      - "architect"
      - "tradeoffs"
      - "production"
    backend: anthropic
    model: claude-opus-4-5

The classifier scores each prompt against these patterns and falls back to Claude for anything it can't confidently classify below the architecture tier.

Backend Integrations

Each backend uses a different integration pattern:

Ollama (Gemma 3 27B): Standard local HTTP API at localhost:11434. Zero external calls, zero cost, sub-second response for trivial prompts on a machine with reasonable RAM.

Google AI Studio (Gemini 2.5 Pro): Direct REST API calls. The free tier is generous enough for development workloads. No billing configured — it just works within quota.

Codex CLI: Subprocess pipe. ModelRouter shells out to the Codex CLI, passes the prompt, captures output. Slightly inelegant but it means you get Codex's native behavior without reimplementing its interface.

Anthropic (Claude Pro): Transparent HTTPS proxy — the request is forwarded as-is with your existing API key. No behavior change.

Context Continuity Across Model Switches

The trickiest part was handling conversation history. If you ask a trivial question (routed to Gemma), then ask a follow-up that needs context (routed to Claude), the history from the Gemma exchange needs to be available.

ModelRouter maintains a session transcript in a neutral format and adapts it for each backend's message schema on each call. Gemma and Claude use different conversation formats, so there's a translation layer that normalizes role names, handles system prompt placement, and trims history to fit each model's context window.

This is imperfect — a Gemma response included in Claude's context carries some quality noise — but in practice the architecture questions I send to Claude don't depend heavily on the trivial-task history anyway.

Running It

pip install modelrouter
modelrouter start  # starts the proxy daemon
claude-mix "fix the typo in this docstring"  # routes to Gemma
claude-mix "design a retry mechanism for this service"  # routes to Claude

The claude-mix command replaces direct claude invocations. The proxy runs on localhost and intercepts based on routing rules.

Honest Trade-offs

Latency from classification: Adding a classification step adds ~50-200ms per request. For interactive development this is imperceptible. For automated pipelines it may matter.

Pattern matching is blunt: YAML keyword rules are not semantic classifiers. A prompt like "architect a fix for this typo" might hit the wrong tier. The classifier needs tuning for your specific workflow. I'm planning a lightweight ML classifier as an option.

Context quality degrades at boundaries: When a follow-up question routes to a different model than the original question, coherence can suffer. This is a known limitation.

Local model quality ceiling: Gemma 3 27B is genuinely good at trivial tasks but occasionally produces code that needs correction. That's acceptable for formatting work but would be a problem for anything in the feature tier.

Contributing

ModelRouter is open source: github.com/girishsahu008/ModelRouter

The most useful contributions right now are additional routing backends, better classifier logic, and language-specific routing rules for non-Python stacks. The backend adapter interface is simple — adding a new model provider is about 50 lines of code.

If you're building multi-model workflows and want to talk through the routing architecture — or if you're thinking about applying this pattern inside a product rather than a developer tool — get in touch.

Discussion

Loading…