What technologies were used in Multi-LLM Orchestration System?

OpenAI GPT-4, Anthropic Claude, Google Gemini. Best quality for challenging tasks requiring world knowledge

What were the results of Multi-LLM Orchestration System?

40% Cost Reduction with Better Reliability Key metrics: 40% cost reduction, 30% cache hit rate, 99.9% availability, <100ms routing overhead.

How long did the Multi-LLM Orchestration System project take?

The project took 6 months with a team of 2 engineers. Intelligent Router with Automatic Failover

Who built Multi-LLM Orchestration System?

This project was built by Nazmul Hoque Khan (Shuvro), a senior software engineer with 10+ years experience. Role: Senior Backend Engineer & AI Backend Lead at Anaqua. Contact for similar projects: cal.com/nazmul

← All Case Studies

📖 3 min read 570 words

Enterprise AI 2024 6 months

Multi-LLM Orchestration System

Q: What was the challenge in the Multi-LLM Orchestration System project?

Our AI-powered platform relied entirely on OpenAI. This created cost unpredictability, availability risk during outages, and inability to use newer models from Anthropic or Google. We needed a multi-provider architecture without sacrificing reliability.

Q: How long did the Multi-LLM Orchestration System project take?

The project took 6 months with a team of 2 engineers. Intelligent Router with Automatic Failover

Q: Who built Multi-LLM Orchestration System?

This project was built by Nazmul Hoque Khan (Shuvro), a senior software engineer with 10+ years experience. Role: Senior Backend Engineer & AI Backend Lead at Anaqua. Contact for similar projects: cal.com/nazmul

Anaqua — Senior Backend Engineer & AI Backend Lead

Intelligent routing between OpenAI, Anthropic, and Google Gemini, optimizing for cost, latency, and quality

40% Cost Reduction

99.9% Availability

3+ LLM Providers

🎯 The Challenge

Single-Provider LLM Dependency Was a Business Risk

Our AI-powered platform relied entirely on OpenAI. This created cost unpredictability, availability risk during outages, and inability to use newer models from Anthropic or Google. We needed a multi-provider architecture without sacrificing reliability.

Key Pain Points

OpenAI costs spiked unpredictably; one busy month was 3x the previous
API outages meant complete AI feature downtime for our users
Claude and Gemini offered better performance for some tasks, but we couldn't use them
No visibility into which requests were expensive vs. cheap

💡 The Solution

Intelligent Router with Automatic Failover

We built a routing layer that classifies requests by complexity and routes to the optimal provider. Circuit breakers handle failures automatically, and prompt caching reduces redundant API calls.

Technical Approach

Task Complexity Classification

Lightweight classifier categorizes requests: simple extraction vs. nuanced generation vs. complex reasoning. Each category maps to optimal model tiers.

Provider Health Monitoring

Continuous health checks and latency tracking for each provider. Automatic traffic shifting when issues detected.

Semantic Prompt Caching

Redis-based caching using embedding similarity. Similar prompts return cached responses, reducing API calls by 30%.

Cost Attribution

Token-level tracking per feature and user. Dashboards show exactly where costs originate.

🛠️ Technology Stack

🚀 Core Technologies

OpenAI GPT-4

Complex reasoning and nuanced generation

Best quality for challenging tasks requiring world knowledge

Anthropic Claude

Long-context tasks and analysis

100K context window, excellent for document analysis

Google Gemini

Fast, cost-effective general tasks

Good balance of speed and capability for common requests

🔧 Supporting Technologies

Python / FastAPI Redis LangChain LangSmith

☁️ Infrastructure

Kubernetes Prometheus/Grafana

⚙️ Implementation Details

The Routing Algorithm

Our router makes decisions in three stages:

1. Task Classification (< 20ms)

1
2
3
4
5
class TaskClassifier:
    def classify(self, prompt: str, context: dict) -> TaskType:
        features = self.extract_features(prompt, context)
        # Lightweight model trained on labeled examples
        return self.classifier.predict(features)

2. Provider Selection

EXTRACTION tasks → Gemini (fast, cheap)
GENERATION tasks → Claude or GPT-4 (quality-focused)
REASONING tasks → GPT-4 (best complex reasoning)

3. Fallback Chain If the primary provider fails or is slow:

1
GPT-4 → Claude → Gemini → Cached Response → Graceful Error

Semantic Caching Implementation

How it works:

Hash prompt + relevant context into cache key
Also embed prompt for semantic similarity search
Check exact match first (fastest)
Check semantic similarity (cosine > 0.92)
Return cached response if found

Cache invalidation:

Time-based TTL (24 hours default)
Context-aware (user settings change → invalidate)
Quality feedback (negative feedback → remove from cache)

Results:

30% cache hit rate overall
50%+ for structured extraction tasks
10-15% for creative generation

📊 Results & Impact

40% Cost Reduction with Better Reliability

40% Cost Reduction Compared to OpenAI-only

30% Cache Hit Rate Semantic prompt caching

99.9% Availability With automatic failover

<100ms Routing Overhead Classification + routing

Additional Outcomes

Zero downtime during multiple OpenAI outages in 2024
Product teams gained visibility into AI costs per feature
Enabled experimentation with new models without infrastructure changes

📚 Key Takeaways

Classification Doesn't Need to Be Perfect

An 80% accurate classifier with fast inference beats a 95% accurate one with 500ms latency. Route aggressively, refine iteratively.

Cache Hits Compound Savings

Every cached response saves tokens AND latency. The ROI on semantic caching infrastructure exceeded our projections.

Observability Drives Optimization

Once we could see cost per feature, product owners started optimizing prompts. Visibility changed behavior.

📝 Additional Details

The Problem with Single-Provider AI

When our AI platform relied 100% on OpenAI, we experienced every problem you’d expect:

Cost Volatility: One month’s bill was $15K. The next was $45K. Same features, just more usage. CFO was not happy.

Availability Risk: During OpenAI’s December 2023 outages, our AI features went completely dark. Users were frustrated.

Missed Opportunities: Claude’s 100K context window would have been perfect for our document analysis feature. But we couldn’t use it.

Designing the Multi-Provider System

The Router Architecture

The key insight: not all LLM requests are equal. Some need GPT-4’s reasoning. Some just need fast, cheap extraction. The router’s job is matching requests to optimal providers.

1
2
3
4
5
6
7
Request → Classification → Provider Selection → Execution → Response
                               ↓
              ┌────────────────┼────────────────┐
              ↓                ↓                ↓
           Simple          Medium           Complex
              ↓                ↓                ↓
           Gemini          Claude            GPT-4

Task Classification

We trained a lightweight classifier on labeled examples:

EXTRACTION (simple, routine)

Parsing structured data from documents
Entity extraction
Format conversion

GENERATION (medium complexity)

Email drafts
Summaries
Documentation

REASONING (complex)

Legal analysis
Multi-step problem solving
Nuanced interpretation

Classification happens in <20ms using a distilled model. The accuracy is ~85%, good enough for cost savings, with quality maintained through fallbacks.

Provider Characteristics

Each provider has strengths:

Provider	Best For	Latency	Cost
GPT-4	Complex reasoning	Slow	High
Claude	Long documents	Medium	Medium
Gemini	Fast extraction	Fast	Low

The router leverages these differences. Quick extraction goes to Gemini. Document analysis goes to Claude. Only genuinely complex reasoning hits GPT-4.

Fallback Chains

Every request has a fallback plan:

1
2
3
4
5
FALLBACK_CHAINS = {
    TaskType.EXTRACTION: ["gemini", "claude", "gpt-4"],
    TaskType.GENERATION: ["claude", "gpt-4", "gemini"],
    TaskType.REASONING: ["gpt-4", "claude", "gemini"],
}

If the primary fails (timeout, error, rate limit), we automatically try the next option. Users rarely notice; they just get a response.

Semantic Caching

Beyond provider routing, we cache aggressively:

Exact Match Cache: Same prompt → same response. Simple but effective for repetitive operations.

Semantic Cache: Similar prompts → similar responses. Using embedding similarity (cosine > 0.92), we can often reuse responses for paraphrased requests.

The cache hit rate surprised us: 30% overall, 50%+ for extraction tasks. That’s 30% of API calls we never make.

Results and Learnings

Quantitative Impact

40% cost reduction vs. OpenAI-only (routing + caching combined)
99.9% availability including through multiple provider outages
<100ms overhead for classification and routing
30% cache hit rate across all requests

Qualitative Impact

Zero downtime during outages: When OpenAI had issues in Q1 2024, our system automatically shifted traffic to Claude and Gemini. Users didn’t notice.

Product team empowerment: Cost attribution dashboards let product owners see which features were expensive. Several teams optimized their prompts without engineering involvement.

Experimentation velocity: Testing a new model became a config change instead of a rewrite. We evaluated GPT-4 Turbo within hours of release.

Key Takeaways

Start with classification, not perfection: An 80% accurate classifier that ships beats a 95% one in development
Caching ROI exceeds expectations: We estimated 15% cache hits. Achieved 30%. The infrastructure paid for itself in weeks.
Visibility changes behavior: Once teams saw their costs, they started caring about prompt efficiency
Fallbacks are insurance: Build them before you need them. During outages, there’s no time to implement

Building a multi-provider LLM system? Let’s discuss architecture.

Experience: Senior Backend Engineer & AI Lead at Anaqua

Technologies: LangChain, FastAPI, Python, OpenAI, Anthropic Claude, Redis

Related Case Studies: Enterprise RAG for Legal Documents | LLM Email Assistant

Need Multi-LLM Architecture?

Let's discuss how I can help solve your engineering challenges.

📅 Schedule a Call 📧 Send Email