Multi-LLM Orchestration System
@ Anaqua — Senior Backend Engineer & AI Backend Lead
Intelligent routing between OpenAI, Anthropic, and Google Gemini — optimizing for cost, latency, and quality
$ cat PROBLEM.md
Single-Provider LLM Dependency Was a Business Risk
Our AI-powered platform relied entirely on OpenAI. This created cost unpredictability, availability risk during outages, and inability to leverage newer models from Anthropic or Google. We needed a multi-provider architecture without sacrificing reliability.
Key Challenges:
- OpenAI costs spiked unpredictably — one busy month was 3x the previous
- API outages meant complete AI feature downtime for our users
- Claude and Gemini offered better performance for some tasks, but we couldn't use them
- No visibility into which requests were expensive vs. cheap
$ cat SOLUTION.md
Intelligent Router with Automatic Failover
We built a routing layer that classifies requests by complexity and routes to the optimal provider. Circuit breakers handle failures automatically, and prompt caching reduces redundant API calls.
Technical Approach:
Task Complexity Classification
Lightweight classifier categorizes requests: simple extraction vs. nuanced generation vs. complex reasoning. Each category maps to optimal model tiers.
Provider Health Monitoring
Continuous health checks and latency tracking for each provider. Automatic traffic shifting when issues detected.
Semantic Prompt Caching
Redis-based caching using embedding similarity. Similar prompts return cached responses, reducing API calls by 30%.
Cost Attribution
Token-level tracking per feature and user. Dashboards show exactly where costs originate.
$ cat tech-stack.json
🚀 Core Technologies
OpenAI GPT-4
Complex reasoning and nuanced generation
Why: Best quality for challenging tasks requiring world knowledge
Anthropic Claude
Long-context tasks and analysis
Why: 100K context window, excellent for document analysis
Google Gemini
Fast, cost-effective general tasks
Why: Good balance of speed and capability for common requests
🔧 Supporting Technologies
☁️ Infrastructure
$ man implementation-details
The Routing Algorithm
Our router makes decisions in three stages:
1. Task Classification (< 20ms)
| |
2. Provider Selection
- EXTRACTION tasks → Gemini (fast, cheap)
- GENERATION tasks → Claude or GPT-4 (quality-focused)
- REASONING tasks → GPT-4 (best complex reasoning)
3. Fallback Chain If the primary provider fails or is slow:
| |
Semantic Caching Implementation
How it works:
- Hash prompt + relevant context into cache key
- Also embed prompt for semantic similarity search
- Check exact match first (fastest)
- Check semantic similarity (cosine > 0.92)
- Return cached response if found
Cache invalidation:
- Time-based TTL (24 hours default)
- Context-aware (user settings change → invalidate)
- Quality feedback (negative feedback → remove from cache)
Results:
- 30% cache hit rate overall
- 50%+ for structured extraction tasks
- 10-15% for creative generation
$ echo $RESULTS
40% Cost Reduction with Better Reliability
Additional Outcomes:
- Zero downtime during multiple OpenAI outages in 2024
- Product teams gained visibility into AI costs per feature
- Enabled experimentation with new models without infrastructure changes
$ cat LESSONS_LEARNED.md
Classification Doesn't Need to Be Perfect
An 80% accurate classifier with fast inference beats a 95% accurate one with 500ms latency. Route aggressively, refine iteratively.
Cache Hits Compound Savings
Every cached response saves tokens AND latency. The ROI on semantic caching infrastructure exceeded our projections.
Observability Drives Optimization
Once we could see cost per feature, product owners started optimizing prompts. Visibility changed behavior.
$ cat README.md
The Problem with Single-Provider AI
When our AI platform relied 100% on OpenAI, we experienced every problem you’d expect:
Cost Volatility: One month’s bill was $15K. The next was $45K. Same features, just more usage. CFO was not happy.
Availability Risk: During OpenAI’s December 2023 outages, our AI features went completely dark. Users were frustrated.
Missed Opportunities: Claude’s 100K context window would have been perfect for our document analysis feature. But we couldn’t use it.
Designing the Multi-Provider System
The Router Architecture
The key insight: not all LLM requests are equal. Some need GPT-4’s reasoning. Some just need fast, cheap extraction. The router’s job is matching requests to optimal providers.
| |
Task Classification
We trained a lightweight classifier on labeled examples:
EXTRACTION (simple, routine)
- Parsing structured data from documents
- Entity extraction
- Format conversion
GENERATION (medium complexity)
- Email drafts
- Summaries
- Documentation
REASONING (complex)
- Legal analysis
- Multi-step problem solving
- Nuanced interpretation
Classification happens in <20ms using a distilled model. The accuracy is ~85% — good enough for cost savings, with quality maintained through fallbacks.
Provider Characteristics
Each provider has strengths:
| Provider | Best For | Latency | Cost |
|---|---|---|---|
| GPT-4 | Complex reasoning | Slow | High |
| Claude | Long documents | Medium | Medium |
| Gemini | Fast extraction | Fast | Low |
The router leverages these differences. Quick extraction goes to Gemini. Document analysis goes to Claude. Only genuinely complex reasoning hits GPT-4.
Fallback Chains
Every request has a fallback plan:
| |
If the primary fails (timeout, error, rate limit), we automatically try the next option. Users rarely notice — they just get a response.
Semantic Caching
Beyond provider routing, we cache aggressively:
Exact Match Cache: Same prompt → same response. Simple but effective for repetitive operations.
Semantic Cache: Similar prompts → similar responses. Using embedding similarity (cosine > 0.92), we can often reuse responses for paraphrased requests.
The cache hit rate surprised us: 30% overall, 50%+ for extraction tasks. That’s 30% of API calls we never make.
Results and Learnings
Quantitative Impact
- 40% cost reduction vs. OpenAI-only (routing + caching combined)
- 99.9% availability including through multiple provider outages
- <100ms overhead for classification and routing
- 30% cache hit rate across all requests
Qualitative Impact
Zero downtime during outages: When OpenAI had issues in Q1 2024, our system automatically shifted traffic to Claude and Gemini. Users didn’t notice.
Product team empowerment: Cost attribution dashboards let product owners see which features were expensive. Several teams optimized their prompts without engineering involvement.
Experimentation velocity: Testing a new model became a config change instead of a rewrite. We evaluated GPT-4 Turbo within hours of release.
Key Takeaways
Start with classification, not perfection: An 80% accurate classifier that ships beats a 95% one in development
Caching ROI exceeds expectations: We estimated 15% cache hits. Achieved 30%. The infrastructure paid for itself in weeks.
Visibility changes behavior: Once teams saw their costs, they started caring about prompt efficiency
Fallbacks are insurance: Build them before you need them. During outages, there’s no time to implement
Building a multi-provider LLM system? Let’s discuss architecture.
Related
Experience: Senior Backend Engineer & AI Lead at Anaqua
Technologies: LangChain, FastAPI, Python, OpenAI, Anthropic Claude, Redis
Related Case Studies: Enterprise RAG for Legal Documents | LLM Email Assistant
Need Multi-LLM Architecture?
Let's discuss how I can help solve your engineering challenges.