multi-llm-orchestration-system@anaqua:~/case-study
Enterprise AI 6 months 2024

Multi-LLM Orchestration System

@ Anaqua — Senior Backend Engineer & AI Backend Lead

Intelligent routing between OpenAI, Anthropic, and Google Gemini — optimizing for cost, latency, and quality

40% Cost Reduction
99.9% Availability
3+ LLM Providers

$ cat PROBLEM.md

Single-Provider LLM Dependency Was a Business Risk

Our AI-powered platform relied entirely on OpenAI. This created cost unpredictability, availability risk during outages, and inability to leverage newer models from Anthropic or Google. We needed a multi-provider architecture without sacrificing reliability.

Key Challenges:

  • 🔴 OpenAI costs spiked unpredictably — one busy month was 3x the previous
  • 🔴 API outages meant complete AI feature downtime for our users
  • 🔴 Claude and Gemini offered better performance for some tasks, but we couldn't use them
  • 🔴 No visibility into which requests were expensive vs. cheap

$ cat SOLUTION.md

Intelligent Router with Automatic Failover

We built a routing layer that classifies requests by complexity and routes to the optimal provider. Circuit breakers handle failures automatically, and prompt caching reduces redundant API calls.

Technical Approach:

1
Task Complexity Classification

Lightweight classifier categorizes requests: simple extraction vs. nuanced generation vs. complex reasoning. Each category maps to optimal model tiers.

2
Provider Health Monitoring

Continuous health checks and latency tracking for each provider. Automatic traffic shifting when issues detected.

3
Semantic Prompt Caching

Redis-based caching using embedding similarity. Similar prompts return cached responses, reducing API calls by 30%.

4
Cost Attribution

Token-level tracking per feature and user. Dashboards show exactly where costs originate.

$ cat tech-stack.json

🚀 Core Technologies

OpenAI GPT-4

Complex reasoning and nuanced generation

Why: Best quality for challenging tasks requiring world knowledge

Anthropic Claude

Long-context tasks and analysis

Why: 100K context window, excellent for document analysis

Google Gemini

Fast, cost-effective general tasks

Why: Good balance of speed and capability for common requests

🔧 Supporting Technologies

Python / FastAPI Redis LangChain LangSmith

☁️ Infrastructure

Kubernetes Prometheus/Grafana

$ man implementation-details

The Routing Algorithm

Our router makes decisions in three stages:

1. Task Classification (< 20ms)

1
2
3
4
5
class TaskClassifier:
    def classify(self, prompt: str, context: dict) -> TaskType:
        features = self.extract_features(prompt, context)
        # Lightweight model trained on labeled examples
        return self.classifier.predict(features)

2. Provider Selection

  • EXTRACTION tasks → Gemini (fast, cheap)
  • GENERATION tasks → Claude or GPT-4 (quality-focused)
  • REASONING tasks → GPT-4 (best complex reasoning)

3. Fallback Chain If the primary provider fails or is slow:

1
GPT-4 → Claude → Gemini → Cached Response → Graceful Error

Semantic Caching Implementation

How it works:

  1. Hash prompt + relevant context into cache key
  2. Also embed prompt for semantic similarity search
  3. Check exact match first (fastest)
  4. Check semantic similarity (cosine > 0.92)
  5. Return cached response if found

Cache invalidation:

  • Time-based TTL (24 hours default)
  • Context-aware (user settings change → invalidate)
  • Quality feedback (negative feedback → remove from cache)

Results:

  • 30% cache hit rate overall
  • 50%+ for structured extraction tasks
  • 10-15% for creative generation

$ echo $RESULTS

40% Cost Reduction with Better Reliability

40% Cost Reduction Compared to OpenAI-only
30% Cache Hit Rate Semantic prompt caching
99.9% Availability With automatic failover
<100ms Routing Overhead Classification + routing

Additional Outcomes:

  • Zero downtime during multiple OpenAI outages in 2024
  • Product teams gained visibility into AI costs per feature
  • Enabled experimentation with new models without infrastructure changes

$ cat LESSONS_LEARNED.md

Classification Doesn't Need to Be Perfect

An 80% accurate classifier with fast inference beats a 95% accurate one with 500ms latency. Route aggressively, refine iteratively.

Cache Hits Compound Savings

Every cached response saves tokens AND latency. The ROI on semantic caching infrastructure exceeded our projections.

Observability Drives Optimization

Once we could see cost per feature, product owners started optimizing prompts. Visibility changed behavior.

$ cat README.md

The Problem with Single-Provider AI

When our AI platform relied 100% on OpenAI, we experienced every problem you’d expect:

Cost Volatility: One month’s bill was $15K. The next was $45K. Same features, just more usage. CFO was not happy.

Availability Risk: During OpenAI’s December 2023 outages, our AI features went completely dark. Users were frustrated.

Missed Opportunities: Claude’s 100K context window would have been perfect for our document analysis feature. But we couldn’t use it.

Designing the Multi-Provider System

The Router Architecture

The key insight: not all LLM requests are equal. Some need GPT-4’s reasoning. Some just need fast, cheap extraction. The router’s job is matching requests to optimal providers.

1
2
3
4
5
6
7
Request → Classification → Provider Selection → Execution → Response
              ┌────────────────┼────────────────┐
              ↓                ↓                ↓
           Simple          Medium           Complex
              ↓                ↓                ↓
           Gemini          Claude            GPT-4

Task Classification

We trained a lightweight classifier on labeled examples:

EXTRACTION (simple, routine)

  • Parsing structured data from documents
  • Entity extraction
  • Format conversion

GENERATION (medium complexity)

  • Email drafts
  • Summaries
  • Documentation

REASONING (complex)

  • Legal analysis
  • Multi-step problem solving
  • Nuanced interpretation

Classification happens in <20ms using a distilled model. The accuracy is ~85% — good enough for cost savings, with quality maintained through fallbacks.

Provider Characteristics

Each provider has strengths:

ProviderBest ForLatencyCost
GPT-4Complex reasoningSlowHigh
ClaudeLong documentsMediumMedium
GeminiFast extractionFastLow

The router leverages these differences. Quick extraction goes to Gemini. Document analysis goes to Claude. Only genuinely complex reasoning hits GPT-4.

Fallback Chains

Every request has a fallback plan:

1
2
3
4
5
FALLBACK_CHAINS = {
    TaskType.EXTRACTION: ["gemini", "claude", "gpt-4"],
    TaskType.GENERATION: ["claude", "gpt-4", "gemini"],
    TaskType.REASONING: ["gpt-4", "claude", "gemini"],
}

If the primary fails (timeout, error, rate limit), we automatically try the next option. Users rarely notice — they just get a response.

Semantic Caching

Beyond provider routing, we cache aggressively:

Exact Match Cache: Same prompt → same response. Simple but effective for repetitive operations.

Semantic Cache: Similar prompts → similar responses. Using embedding similarity (cosine > 0.92), we can often reuse responses for paraphrased requests.

The cache hit rate surprised us: 30% overall, 50%+ for extraction tasks. That’s 30% of API calls we never make.

Results and Learnings

Quantitative Impact

  • 40% cost reduction vs. OpenAI-only (routing + caching combined)
  • 99.9% availability including through multiple provider outages
  • <100ms overhead for classification and routing
  • 30% cache hit rate across all requests

Qualitative Impact

Zero downtime during outages: When OpenAI had issues in Q1 2024, our system automatically shifted traffic to Claude and Gemini. Users didn’t notice.

Product team empowerment: Cost attribution dashboards let product owners see which features were expensive. Several teams optimized their prompts without engineering involvement.

Experimentation velocity: Testing a new model became a config change instead of a rewrite. We evaluated GPT-4 Turbo within hours of release.

Key Takeaways

  1. Start with classification, not perfection: An 80% accurate classifier that ships beats a 95% one in development

  2. Caching ROI exceeds expectations: We estimated 15% cache hits. Achieved 30%. The infrastructure paid for itself in weeks.

  3. Visibility changes behavior: Once teams saw their costs, they started caring about prompt efficiency

  4. Fallbacks are insurance: Build them before you need them. During outages, there’s no time to implement


Building a multi-provider LLM system? Let’s discuss architecture.


Experience: Senior Backend Engineer & AI Lead at Anaqua

Technologies: LangChain, FastAPI, Python, OpenAI, Anthropic Claude, Redis

Related Case Studies: Enterprise RAG for Legal Documents | LLM Email Assistant

Need Multi-LLM Architecture?

Let's discuss how I can help solve your engineering challenges.