LLM Email Assistant at Scale
@ Flowrite — Senior Software Engineer
Scaling Europe's first LLM-powered email assistant during the explosive growth of generative AI
$ cat PROBLEM.md
Scaling LLMs Before the Playbook Existed
In mid-2022, we were among a handful of companies globally shipping production LLM products. There was no established playbook for cost management, latency optimization, or observability. Meanwhile, our user base was growing 10x and expectations were high.
Key Challenges:
- LLM inference costs were unpredictable — a viral mention could spike costs 10x overnight
- Users expected instant email suggestions, but GPT-3 took 2-5 seconds per request
- No standard tools existed for monitoring LLM output quality in production
- Single provider dependency (OpenAI) was a business risk during frequent API outages
$ cat SOLUTION.md
Multi-Provider LLM Architecture with Intelligent Routing
We built a resilient, cost-optimized AI backend that could handle 10x growth while actually reducing per-user costs through intelligent provider routing, aggressive caching, and streaming delivery.
Technical Approach:
Multi-LLM Provider Strategy
Integrated OpenAI and Cohere with an intelligent router that classified requests by complexity. Simple completions went to cheaper, faster models; nuanced emails used GPT-3.5.
Streaming Response Delivery
Implemented Server-Sent Events (SSE) to stream responses character-by-character. Users saw text generating in real-time, dramatically improving perceived latency.
Aggressive Prompt Caching
Built a semantic caching layer that recognized similar email contexts. Cache hit rates of 40%+ meant significant cost and latency savings.
AI-Specific Observability
Integrated Gantry for LLM monitoring — tracking output quality, detecting prompt injection, and measuring model drift over time.
$ cat tech-stack.json
🚀 Core Technologies
OpenAI GPT-3/3.5
Primary LLM for complex email generation
Why: Best quality for nuanced, professional email writing
Cohere
Secondary LLM for simpler completions
Why: Lower latency, lower cost for straightforward suggestions
TypeScript / Node.js
Backend API services
Why: Type safety for complex LLM response handling
🔧 Supporting Technologies
☁️ Infrastructure
$ cat ARCHITECTURE.md
The system routes requests through complexity classification before hitting LLM providers:
| |
System Components:
Request Classifier
Lightweight model that categorizes email complexity based on context length, formality requirements, and task type
Semantic Cache
Redis-based cache using embedding similarity to identify reusable responses
Provider Router
Manages multiple LLM providers with health checks, circuit breakers, and fallback chains
Streaming Gateway
SSE endpoint that buffers and delivers tokens to the Chrome extension
$ man implementation-details
Multi-Provider LLM Routing
Our router made real-time decisions about which provider to use:
Classification Criteria:
- Context length (longer = OpenAI)
- Formality level (formal = OpenAI)
- Previous user feedback on quality
- Current provider latency/availability
Fallback Chain:
| |
Each hop had a timeout of 1 second. If all providers failed, we showed a cached similar response with a disclaimer.
Cost Impact:
- 60% of requests went to Cohere (cheaper)
- 40% went to OpenAI (necessary for quality)
- Net savings: 40-50% vs. OpenAI-only architecture
Semantic Caching Strategy
Traditional caching didn’t work for LLMs — every prompt is slightly different. We built semantic caching:
- Embed the prompt context using a lightweight embedding model
- Search Redis for similar embeddings (cosine similarity > 0.95)
- Return cached response if found, with minor personalization
What we cached:
- Email reply suggestions for common patterns (meeting requests, follow-ups)
- Template-based emails with variable substitution
- Frequently requested tones (professional, friendly, brief)
What we didn’t cache:
- Unique context that wouldn’t generalize
- User-specific information
- Low-similarity matches (better to generate fresh)
This achieved 40%+ cache hit rates without quality degradation.
$ echo $RESULTS
10x Growth with Lower Per-User Costs
Additional Outcomes:
- Multi-provider architecture meant zero downtime during major OpenAI outages
- Streaming responses improved user satisfaction scores by 25%
- AI observability caught a prompt injection vulnerability before exploitation
- Technical foundation contributed to MailMerge acquisition in 2024
$ cat LESSONS_LEARNED.md
Streaming Changes Everything
A 3-second response that streams feels faster than a 2-second response that appears all at once. Perceived latency matters more than actual latency.
Multi-Provider is Insurance, Not Optimization
We implemented Cohere for cost savings, but it paid for itself in reliability. During OpenAI's December 2022 outages, we had zero downtime.
Cache Semantically, Not Literally
Exact-match caching had 5% hit rates. Semantic similarity caching (same email intent, different words) achieved 40%+.
$ cat README.md
Building Before the Playbook Existed
When I joined Flowrite in June 2022, the generative AI landscape was unrecognizable from today. ChatGPT hadn’t launched. There were no LangChain tutorials, no best practices for LLM cost management, no established patterns for production AI systems.
We were figuring it out as we went — and our users were growing 10x while we did it.
The Early LLM Production Challenge
Flowrite was Europe’s first LLM-powered email assistant. Users installed a Chrome extension, and as they composed emails, AI would suggest completions. Simple concept, complex execution.
Why This Was Hard in 2022
No Established Patterns: Today, you can Google “LLM production best practices” and get dozens of articles. In 2022, there were maybe 5 companies globally shipping production LLM products, and none were sharing their architecture.
Cost Unpredictability: LLM costs scale with tokens. A viral Product Hunt mention could spike our OpenAI bill 10x overnight. We needed dynamic cost controls.
Latency Expectations: Users expected instant suggestions. GPT-3 took 2-5 seconds per request. That gap had to be bridged.
Quality Monitoring: Traditional APM tools tell you if your service is up. They don’t tell you if your AI is generating garbage emails.
Our Technical Solutions
Multi-Provider Architecture
We couldn’t depend on OpenAI alone. Beyond cost, there was reliability — OpenAI had several major outages in late 2022 that would have killed our product.
Our solution: intelligent routing between OpenAI and Cohere.
A lightweight classifier (simple ML model) categorized each request:
- Simple: “Reply to confirm meeting” → Cohere (faster, cheaper)
- Complex: “Decline politely while leaving door open” → OpenAI (better nuance)
The router also handled fallbacks. If OpenAI was down or slow, requests automatically shifted to Cohere with graceful quality degradation.
Streaming: Perceived vs. Actual Latency
Here’s a counterintuitive finding: a response that takes 3 seconds but streams progressively feels faster than a 2-second response that appears all at once.
We implemented Server-Sent Events (SSE) to stream tokens as they generated. The Chrome extension rendered text character-by-character, giving users the experience of watching the AI “think.”
User satisfaction scores improved 25% — not because we made the AI faster, but because we made the wait feel productive.
Semantic Caching That Actually Works
Traditional caching keys on exact input matches. For LLMs, this gives terrible hit rates because every prompt is slightly different.
We built semantic caching:
- Embed the prompt context using a fast embedding model
- Search Redis for similar embeddings (cosine similarity > 0.95)
- Return cached response with minor personalization
What this looked like in practice:
- “Can you reply to confirm the meeting tomorrow at 3pm?” → Cache hit
- “Reply confirming meeting, tomorrow 3pm” → Cache hit (same semantic meaning)
- “Decline the meeting because I’m busy” → Cache miss (different intent)
This achieved 40%+ cache hit rates while maintaining quality.
AI Observability with Gantry
Traditional monitoring answers: “Is the service up? What’s the latency?”
For LLMs, you need to answer: “Is the AI output actually good?”
We integrated Gantry for AI-specific observability:
- Quality metrics: Track user edits to AI suggestions (more edits = lower quality)
- Prompt injection detection: Flag suspicious inputs trying to manipulate the AI
- Drift monitoring: Detect when model behavior changes unexpectedly
- A/B testing: Compare prompt variations scientifically
This observability caught a prompt injection vulnerability before it was exploited — the system flagged unusual patterns in user inputs that were attempting to extract system prompts.
Results and Acquisition
The technical foundation we built enabled:
- 10x user growth without proportional infrastructure scaling
- 40-50% cost reduction per user through intelligent routing and caching
- 99.9% uptime including during major OpenAI outages
- Zero security incidents thanks to proactive observability
In 2024, Flowrite was acquired by MailMerge. The scalable AI infrastructure we built was a key asset in the acquisition.
Lessons for LLM Practitioners
1. Build Multi-Provider From Day One
Even if you don’t need cost optimization, you need reliability. Single-provider dependency is a business risk.
2. Streaming Is Not Optional
For any user-facing LLM application, streaming responses should be default. The UX improvement is dramatic.
3. Invest in AI-Specific Observability Early
You can’t improve what you can’t measure. LLM output quality requires specialized monitoring beyond traditional APM.
4. Semantic Caching Is Underrated
With proper embedding similarity thresholds, you can achieve significant cache hit rates without quality degradation.
5. Cost Management Is a Feature
For AI startups, LLM costs are a major expense. Building intelligent cost controls is as important as building features.
Scaling your LLM application? Let’s discuss architecture strategies.
Related
Experience: Senior Software Engineer at Flowrite
Technologies: TypeScript, Node.js, FastAPI, GraphQL, OpenAI, Redis, RabbitMQ, Prompt Engineering
Related Case Studies: Multi-LLM Orchestration | Enterprise RAG System
Scaling Your LLM Application?
Let's discuss how I can help solve your engineering challenges.