llm-email-assistant-at-scale@flowrite:~/case-study
Productivity / AI SaaS 14 months 2022-2023

LLM Email Assistant at Scale

@ Flowrite — Senior Software Engineer

Scaling Europe's first LLM-powered email assistant during the explosive growth of generative AI

10x User Growth
40-50% Cost Reduction
99.9% Uptime

$ cat PROBLEM.md

Scaling LLMs Before the Playbook Existed

In mid-2022, we were among a handful of companies globally shipping production LLM products. There was no established playbook for cost management, latency optimization, or observability. Meanwhile, our user base was growing 10x and expectations were high.

Key Challenges:

  • 🔴 LLM inference costs were unpredictable — a viral mention could spike costs 10x overnight
  • 🔴 Users expected instant email suggestions, but GPT-3 took 2-5 seconds per request
  • 🔴 No standard tools existed for monitoring LLM output quality in production
  • 🔴 Single provider dependency (OpenAI) was a business risk during frequent API outages

$ cat SOLUTION.md

Multi-Provider LLM Architecture with Intelligent Routing

We built a resilient, cost-optimized AI backend that could handle 10x growth while actually reducing per-user costs through intelligent provider routing, aggressive caching, and streaming delivery.

Technical Approach:

1
Multi-LLM Provider Strategy

Integrated OpenAI and Cohere with an intelligent router that classified requests by complexity. Simple completions went to cheaper, faster models; nuanced emails used GPT-3.5.

2
Streaming Response Delivery

Implemented Server-Sent Events (SSE) to stream responses character-by-character. Users saw text generating in real-time, dramatically improving perceived latency.

3
Aggressive Prompt Caching

Built a semantic caching layer that recognized similar email contexts. Cache hit rates of 40%+ meant significant cost and latency savings.

4
AI-Specific Observability

Integrated Gantry for LLM monitoring — tracking output quality, detecting prompt injection, and measuring model drift over time.

$ cat tech-stack.json

🚀 Core Technologies

OpenAI GPT-3/3.5

Primary LLM for complex email generation

Why: Best quality for nuanced, professional email writing

Cohere

Secondary LLM for simpler completions

Why: Lower latency, lower cost for straightforward suggestions

TypeScript / Node.js

Backend API services

Why: Type safety for complex LLM response handling

🔧 Supporting Technologies

Redis RabbitMQ Gantry BigQuery

☁️ Infrastructure

AWS Nomad NixOS Terraform

$ cat ARCHITECTURE.md

The system routes requests through complexity classification before hitting LLM providers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Chrome Extension → API Gateway → Complexity Classifier
                ┌───────────────────────┴───────────────────────┐
                ↓                                               ↓
          Simple Requests                              Complex Requests
                ↓                                               ↓
          Check Cache                                     Check Cache
                ↓                                               ↓
           Cohere API ←─── fallback ───→ OpenAI API ←─── fallback ───→ Cohere
                ↓                                               ↓
          Stream Response                              Stream Response

System Components:

Request Classifier

Lightweight model that categorizes email complexity based on context length, formality requirements, and task type

Semantic Cache

Redis-based cache using embedding similarity to identify reusable responses

Provider Router

Manages multiple LLM providers with health checks, circuit breakers, and fallback chains

Streaming Gateway

SSE endpoint that buffers and delivers tokens to the Chrome extension

$ man implementation-details

Multi-Provider LLM Routing

Our router made real-time decisions about which provider to use:

Classification Criteria:

  • Context length (longer = OpenAI)
  • Formality level (formal = OpenAI)
  • Previous user feedback on quality
  • Current provider latency/availability

Fallback Chain:

1
OpenAI GPT-3.5 → Cohere Command → OpenAI Davinci → Graceful Degradation

Each hop had a timeout of 1 second. If all providers failed, we showed a cached similar response with a disclaimer.

Cost Impact:

  • 60% of requests went to Cohere (cheaper)
  • 40% went to OpenAI (necessary for quality)
  • Net savings: 40-50% vs. OpenAI-only architecture

Semantic Caching Strategy

Traditional caching didn’t work for LLMs — every prompt is slightly different. We built semantic caching:

  1. Embed the prompt context using a lightweight embedding model
  2. Search Redis for similar embeddings (cosine similarity > 0.95)
  3. Return cached response if found, with minor personalization

What we cached:

  • Email reply suggestions for common patterns (meeting requests, follow-ups)
  • Template-based emails with variable substitution
  • Frequently requested tones (professional, friendly, brief)

What we didn’t cache:

  • Unique context that wouldn’t generalize
  • User-specific information
  • Low-similarity matches (better to generate fresh)

This achieved 40%+ cache hit rates without quality degradation.

$ echo $RESULTS

10x Growth with Lower Per-User Costs

10,000→100,000 User Growth In 14 months
40-50% Cost Reduction Per-user infrastructure cost
40% Cache Hit Rate Semantic prompt caching
99.9% Uptime During 10x growth

Additional Outcomes:

  • Multi-provider architecture meant zero downtime during major OpenAI outages
  • Streaming responses improved user satisfaction scores by 25%
  • AI observability caught a prompt injection vulnerability before exploitation
  • Technical foundation contributed to MailMerge acquisition in 2024

$ cat LESSONS_LEARNED.md

Streaming Changes Everything

A 3-second response that streams feels faster than a 2-second response that appears all at once. Perceived latency matters more than actual latency.

Multi-Provider is Insurance, Not Optimization

We implemented Cohere for cost savings, but it paid for itself in reliability. During OpenAI's December 2022 outages, we had zero downtime.

Cache Semantically, Not Literally

Exact-match caching had 5% hit rates. Semantic similarity caching (same email intent, different words) achieved 40%+.

$ cat README.md

Building Before the Playbook Existed

When I joined Flowrite in June 2022, the generative AI landscape was unrecognizable from today. ChatGPT hadn’t launched. There were no LangChain tutorials, no best practices for LLM cost management, no established patterns for production AI systems.

We were figuring it out as we went — and our users were growing 10x while we did it.

The Early LLM Production Challenge

Flowrite was Europe’s first LLM-powered email assistant. Users installed a Chrome extension, and as they composed emails, AI would suggest completions. Simple concept, complex execution.

Why This Was Hard in 2022

No Established Patterns: Today, you can Google “LLM production best practices” and get dozens of articles. In 2022, there were maybe 5 companies globally shipping production LLM products, and none were sharing their architecture.

Cost Unpredictability: LLM costs scale with tokens. A viral Product Hunt mention could spike our OpenAI bill 10x overnight. We needed dynamic cost controls.

Latency Expectations: Users expected instant suggestions. GPT-3 took 2-5 seconds per request. That gap had to be bridged.

Quality Monitoring: Traditional APM tools tell you if your service is up. They don’t tell you if your AI is generating garbage emails.

Our Technical Solutions

Multi-Provider Architecture

We couldn’t depend on OpenAI alone. Beyond cost, there was reliability — OpenAI had several major outages in late 2022 that would have killed our product.

Our solution: intelligent routing between OpenAI and Cohere.

A lightweight classifier (simple ML model) categorized each request:

  • Simple: “Reply to confirm meeting” → Cohere (faster, cheaper)
  • Complex: “Decline politely while leaving door open” → OpenAI (better nuance)

The router also handled fallbacks. If OpenAI was down or slow, requests automatically shifted to Cohere with graceful quality degradation.

Streaming: Perceived vs. Actual Latency

Here’s a counterintuitive finding: a response that takes 3 seconds but streams progressively feels faster than a 2-second response that appears all at once.

We implemented Server-Sent Events (SSE) to stream tokens as they generated. The Chrome extension rendered text character-by-character, giving users the experience of watching the AI “think.”

User satisfaction scores improved 25% — not because we made the AI faster, but because we made the wait feel productive.

Semantic Caching That Actually Works

Traditional caching keys on exact input matches. For LLMs, this gives terrible hit rates because every prompt is slightly different.

We built semantic caching:

  1. Embed the prompt context using a fast embedding model
  2. Search Redis for similar embeddings (cosine similarity > 0.95)
  3. Return cached response with minor personalization

What this looked like in practice:

  • “Can you reply to confirm the meeting tomorrow at 3pm?” → Cache hit
  • “Reply confirming meeting, tomorrow 3pm” → Cache hit (same semantic meaning)
  • “Decline the meeting because I’m busy” → Cache miss (different intent)

This achieved 40%+ cache hit rates while maintaining quality.

AI Observability with Gantry

Traditional monitoring answers: “Is the service up? What’s the latency?”

For LLMs, you need to answer: “Is the AI output actually good?”

We integrated Gantry for AI-specific observability:

  • Quality metrics: Track user edits to AI suggestions (more edits = lower quality)
  • Prompt injection detection: Flag suspicious inputs trying to manipulate the AI
  • Drift monitoring: Detect when model behavior changes unexpectedly
  • A/B testing: Compare prompt variations scientifically

This observability caught a prompt injection vulnerability before it was exploited — the system flagged unusual patterns in user inputs that were attempting to extract system prompts.

Results and Acquisition

The technical foundation we built enabled:

  • 10x user growth without proportional infrastructure scaling
  • 40-50% cost reduction per user through intelligent routing and caching
  • 99.9% uptime including during major OpenAI outages
  • Zero security incidents thanks to proactive observability

In 2024, Flowrite was acquired by MailMerge. The scalable AI infrastructure we built was a key asset in the acquisition.

Lessons for LLM Practitioners

1. Build Multi-Provider From Day One

Even if you don’t need cost optimization, you need reliability. Single-provider dependency is a business risk.

2. Streaming Is Not Optional

For any user-facing LLM application, streaming responses should be default. The UX improvement is dramatic.

3. Invest in AI-Specific Observability Early

You can’t improve what you can’t measure. LLM output quality requires specialized monitoring beyond traditional APM.

4. Semantic Caching Is Underrated

With proper embedding similarity thresholds, you can achieve significant cache hit rates without quality degradation.

5. Cost Management Is a Feature

For AI startups, LLM costs are a major expense. Building intelligent cost controls is as important as building features.


Scaling your LLM application? Let’s discuss architecture strategies.


Experience: Senior Software Engineer at Flowrite

Technologies: TypeScript, Node.js, FastAPI, GraphQL, OpenAI, Redis, RabbitMQ, Prompt Engineering

Related Case Studies: Multi-LLM Orchestration | Enterprise RAG System

Scaling Your LLM Application?

Let's discuss how I can help solve your engineering challenges.