Enterprise RAG for Legal Documents
@ Anaqua — Senior Backend Engineer & AI Backend Lead
Transforming how legal professionals search millions of patents and IP documents using AI-powered semantic search
$ cat PROBLEM.md
Generic RAG Failed for Legal Documents
Legal and patent documents contain highly specialized terminology, complex document structures (claims, citations, legal provisions), and reference networks that generic RAG approaches couldn't handle effectively. Standard chunking strategies broke apart critical context, and general-purpose embeddings performed poorly on IP-specific vocabulary.
Key Challenges:
- Legal terminology confused standard embedding models — 'claim' in patents means something completely different than common usage
- Document structure matters — splitting patents arbitrarily destroyed the relationship between claims and their dependent claims
- Citation networks are critical — a relevant document often references 20+ related patents that users also need
- Enterprise users expected search results in under 2 seconds with sub-100ms reranking
$ cat SOLUTION.md
Domain-Specific RAG with Structure-Aware Processing
We built a custom RAG pipeline specifically designed for legal and IP documents, respecting document structure and domain terminology while maintaining enterprise-grade performance.
Technical Approach:
Structure-Aware Chunking
Developed a document parser that understands patent and legal document formats. Chunks respect claim boundaries, keep citations intact, and maintain parent-child relationships between document sections.
Domain-Specific Embeddings
Fine-tuned embedding models on a corpus of 500K+ legal documents. The resulting embeddings correctly capture that 'prior art' is related to 'novelty' even though they share no words.
Citation-Aware Retrieval
Built a graph layer on top of vector search that follows citation chains. When retrieving a relevant patent, we also surface the most important documents it references.
Hybrid Search Architecture
Combined semantic vector search with BM25 keyword matching. Legal professionals often search for exact patent numbers or legal terms that benefit from keyword precision.
$ cat tech-stack.json
🚀 Core Technologies
PGVector
Vector storage and similarity search
Why: Integrates with existing PostgreSQL infrastructure, supports hybrid queries, production-proven at scale
LangChain
LLM orchestration and retrieval pipelines
Why: Flexible abstractions for building complex RAG flows with multiple retrievers
Python / FastAPI
Backend API and processing pipelines
Why: Async support for high-throughput AI workloads, excellent ML ecosystem
🔧 Supporting Technologies
☁️ Infrastructure
$ cat ARCHITECTURE.md
The system follows a three-stage retrieval architecture:
| |
System Components:
Document Ingestion Pipeline
Processes new documents through structure parsing, chunking, embedding generation, and citation extraction
Hybrid Retriever
Combines PGVector semantic search with PostgreSQL full-text search (BM25)
Citation Graph Service
Manages document relationships and performs graph traversal for related document discovery
Reranking Service
Cross-encoder model that reorders candidates for maximum relevance
$ man implementation-details
Structure-Aware Chunking Strategy
Patent documents have a specific structure that must be respected:
- Abstract: High-level summary, good for initial retrieval
- Claims: The legally binding scope (independent + dependent claims)
- Description: Detailed explanation with drawings references
- Prior Art: Citations to related patents and publications
Our chunking strategy:
- Never split claims — each claim becomes its own chunk with metadata linking to parent claims
- Preserve section context — every chunk includes section type metadata for filtering
- Overlapping windows for description — 512 tokens with 128 token overlap
- Citation extraction — all references extracted and stored in graph database
| |
Custom Embedding Model Training
Standard embedding models (OpenAI, Cohere) struggled with legal terminology. We fine-tuned a model using contrastive learning on patent pairs.
Training Data:
- 500K patent documents
- 50K manually labeled similar/dissimilar pairs from patent examiners
- Synthetic pairs from citation networks (cited patents are similar)
Key Improvements:
- Domain vocabulary: “anticipation” and “novelty rejection” are related
- Technical precision: Chemical formulas and patent numbers embedded correctly
- Citation context: “See US 9,123,456” correctly links to the referenced patent
The fine-tuned model improved retrieval accuracy from 67% to 85% on our test set.
$ echo $RESULTS
50% Faster Search with Higher Relevance
Additional Outcomes:
- Legal professionals reported finding relevant prior art in a single search that previously required multiple attempts
- The citation graph feature was adopted as a core workflow by patent analysts
- System became a key factor in the RightHub → Anaqua acquisition
$ cat LESSONS_LEARNED.md
Domain Expertise Beats Generic Models
Investing time to understand legal document structure paid dividends. The 2 weeks spent interviewing patent attorneys about their search patterns directly informed our chunking strategy.
Hybrid Search is Non-Negotiable for Enterprise
Pure semantic search excites demos but frustrates power users who know exactly what they're looking for. Hybrid search satisfies both exploratory and precision use cases.
Evaluation Requires Domain Experts
Standard IR metrics like NDCG weren't enough. We created a custom evaluation set with legal experts rating relevance, which revealed issues that automated metrics missed.
$ cat README.md
The Challenge: AI That Understands Legal Language
When I joined RightHub (later acquired by Anaqua), the company had a clear vision: bring AI-powered search to intellectual property management. The challenge? Legal documents are unlike any other content.
A patent attorney searching for prior art needs:
- Semantic understanding — finding relevant patents even when different terminology is used
- Structural precision — understanding that Claim 1 defines the core invention
- Citation awareness — knowing that Patent A citing Patent B means they’re related
- Speed — results in seconds, not minutes
Generic RAG solutions failed on all counts.
Our Approach: Domain-First Design
Rather than forcing legal documents into a generic RAG framework, we designed the system around how patent professionals actually work.
Understanding the Domain
I spent the first two weeks interviewing patent attorneys, watching them search, and understanding their mental models. Key insights:
- Claims are everything — The legal scope of a patent is defined entirely by its claims
- Citation networks are gold — Experienced searchers follow citation chains to find related art
- Exact matching still matters — When you know the patent number, you want exact results
- Context is critical — A claim only makes sense in the context of its dependent claims
Technical Deep-Dive: The Retrieval Pipeline
Our final architecture processes queries through multiple stages:
Stage 1: Query Understanding
- Entity extraction for patent numbers, company names, technical terms
- Query expansion using domain synonyms (e.g., “mobile device” → “smartphone”, “cellular phone”, “handheld device”)
- Intent classification (prior art search vs. freedom-to-operate vs. general research)
Stage 2: Hybrid Retrieval
- Vector search using PGVector with custom legal embeddings
- BM25 keyword search for exact matching
- Citation graph traversal for related documents
- Score fusion to combine results
Stage 3: Reranking
- Cross-encoder model for precise relevance scoring
- Diversity injection to avoid redundant results
- Explanation generation for why each result matched
Results That Mattered
The system’s success was measured not just in technical metrics, but in user adoption:
- Power users reported finding relevant prior art in a single search that previously required 3-4 attempts
- Time savings translated to real cost reduction — patent searches that took hours now took minutes
- Confidence — attorneys trusted the AI results enough to cite them in legal filings
This success was a key factor in Anaqua’s decision to acquire RightHub.
Key Takeaways for RAG Practitioners
Invest in domain understanding before writing code — The chunking strategy that emerged from user interviews was nothing like what I’d have designed in isolation
Hybrid search is essential for enterprise — Don’t get seduced by pure semantic search. Power users need exact matching.
Build evaluation datasets with domain experts — Standard benchmarks won’t tell you if your legal RAG is working
Citation/reference networks are underutilized — If your domain has documents that reference each other, leverage those relationships
Fine-tuned embeddings are worth the effort — The jump from 67% to 85% accuracy justified the investment in custom training
Want to discuss building a RAG system for your domain? Let’s talk.
Related
Experience: Senior Backend Engineer & AI Lead at Anaqua
Technologies: LangChain, RAG Systems, PGVector, FastAPI, Python, PostgreSQL
Related Case Studies: Multi-LLM Orchestration | Agentic AI Knowledge Systems
Building an Enterprise RAG System?
Let's discuss how I can help solve your engineering challenges.