LLM Observability Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Your AI Application โ
โ (Chains, Agents, RAG, Chatbots, etc.) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Instrumentation
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Trace Collection โ
โ (Spans, events, metadata, inputs/outputs, tokens) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ LangSmith โ โ Langfuse โ โ Custom โ
โ (Managed) โ โ (Self-host) โ โ (Your DB) โ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ Debugging โ โ Metrics โ โ Evaluation โ
โ & Analysis โ โ & Dashboards โ โ & Feedback โ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
|
LangSmith Integration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| from langsmith import traceable
from langchain.callbacks import LangChainTracer
# Automatic tracing for LangChain
tracer = LangChainTracer(project_name="production")
# Custom function tracing
@traceable(name="process_document")
def process_document(doc: Document) -> ProcessedDoc:
# Step 1: Classification
with trace("classify") as span:
doc_type = classifier.classify(doc)
span.set_metadata({"doc_type": doc_type})
# Step 2: Extraction
with trace("extract") as span:
data = extractor.extract(doc, doc_type)
span.set_output(data)
# Step 3: LLM Processing
with trace("llm_process") as span:
result = llm.process(data)
span.set_metadata({
"model": llm.model_name,
"tokens": result.usage.total_tokens,
"cost": calculate_cost(result)
})
return result
|
Cost Tracking Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
| from dataclasses import dataclass
from datetime import datetime
@dataclass
class LLMCostTracker:
# Token costs per model (per 1K tokens)
COSTS = {
"gpt-4-turbo": {"input": 0.01, "output": 0.03},
"gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
"claude-3-opus": {"input": 0.015, "output": 0.075},
"claude-3-sonnet": {"input": 0.003, "output": 0.015},
}
def track(
self,
model: str,
input_tokens: int,
output_tokens: int,
metadata: dict = None
) -> CostRecord:
costs = self.COSTS[model]
input_cost = (input_tokens / 1000) * costs["input"]
output_cost = (output_tokens / 1000) * costs["output"]
record = CostRecord(
timestamp=datetime.utcnow(),
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens,
cost=input_cost + output_cost,
metadata=metadata or {}
)
# Store for analysis
self.db.insert(record)
# Check budget alerts
if self.over_budget(record.metadata.get("project")):
self.send_alert(record)
return record
|
Observability Dashboards
| Metric | What It Tells You | Action |
|---|
| Latency P95 | User experience | Optimize slow chains |
| Error Rate | Reliability | Fix error patterns |
| Token Usage | Cost drivers | Optimize prompts |
| Success Rate | Task completion | Improve prompts/logic |
| Cost per User | Unit economics | Model selection |
| Quality Score | Output value | Prompt engineering |
Evaluation Pipeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
| from langsmith.evaluation import evaluate
class AIEvaluator:
def evaluate_model(
self,
dataset: Dataset,
model: str
) -> EvaluationReport:
results = evaluate(
model,
data=dataset,
evaluators=[
self.accuracy_evaluator,
self.format_evaluator,
self.safety_evaluator,
self.relevance_evaluator
]
)
return EvaluationReport(
accuracy=results.aggregate("accuracy"),
format_compliance=results.aggregate("format"),
safety_score=results.aggregate("safety"),
relevance=results.aggregate("relevance"),
regression=self.detect_regression(results)
)
def detect_regression(self, results) -> list[Regression]:
"""Compare to baseline and flag regressions"""
baseline = self.load_baseline()
regressions = []
for metric, value in results.metrics.items():
if value < baseline[metric] * 0.95: # 5% threshold
regressions.append(Regression(
metric=metric,
baseline=baseline[metric],
current=value,
severity="high" if value < baseline[metric] * 0.9 else "medium"
))
return regressions
|
Technologies for LLM Observability
- Tracing: LangSmith, Langfuse, Arize Phoenix
- Metrics: Prometheus, Grafana, DataDog
- Logging: Structured logging, OpenTelemetry
- Evaluation: LangSmith Eval, custom frameworks
- Alerting: PagerDuty, Slack integrations
Frequently Asked Questions
What is LLM observability?
LLM observability involves monitoring, tracing, and analyzing LLM application behavior. This includes tracking: request/response logs, latency, cost, token usage, quality metrics, error rates, and user feedback. It’s essential for production LLM applications.
How much does LLM observability setup cost?
LLM observability typically costs $90-140 per hour. A basic logging and dashboard setup starts around $5,000-10,000, while thorough observability with evaluation pipelines and alerting ranges from $15,000-40,000+.
I work with: LangSmith, Langfuse, Helicone, PromptLayer, and custom solutions. The choice depends on: LangChain usage, privacy requirements, and specific needs. I also build custom dashboards with Grafana for unified observability.
How do you evaluate LLM output quality?
I implement: automated evaluation with LLM-as-judge, reference-based metrics (BLEU, ROUGE for summarization), human evaluation workflows, A/B testing frameworks, and user feedback collection. Quality monitoring catches regression before users notice.
What should I monitor in production LLM apps?
Essential metrics: latency (P50, P95, P99), error rates, token usage and cost, cache hit rates, quality scores, and user feedback. I also monitor: prompt template versions, model performance by query type, and cost per user segment.
Experience:
Case Studies:
Related Technologies: LangChain, AI Agents, Python, OpenAI, Claude