enterprise-rag-for-legal-documents@anaqua:~/case-study
Legal Tech / IP Management 8 months 2023-2024

Enterprise RAG for Legal Documents

@ Anaqua — Senior Backend Engineer & AI Backend Lead

Transforming how legal professionals search millions of patents and IP documents using AI-powered semantic search

50% Faster Search
10K+ Daily AI Queries
99.9% System Uptime

$ cat PROBLEM.md

Generic RAG Failed for Legal Documents

Legal and patent documents contain highly specialized terminology, complex document structures (claims, citations, legal provisions), and reference networks that generic RAG approaches couldn't handle effectively. Standard chunking strategies broke apart critical context, and general-purpose embeddings performed poorly on IP-specific vocabulary.

Key Challenges:

  • 🔴 Legal terminology confused standard embedding models — 'claim' in patents means something completely different than common usage
  • 🔴 Document structure matters — splitting patents arbitrarily destroyed the relationship between claims and their dependent claims
  • 🔴 Citation networks are critical — a relevant document often references 20+ related patents that users also need
  • 🔴 Enterprise users expected search results in under 2 seconds with sub-100ms reranking

$ cat SOLUTION.md

Domain-Specific RAG with Structure-Aware Processing

We built a custom RAG pipeline specifically designed for legal and IP documents, respecting document structure and domain terminology while maintaining enterprise-grade performance.

Technical Approach:

1
Structure-Aware Chunking

Developed a document parser that understands patent and legal document formats. Chunks respect claim boundaries, keep citations intact, and maintain parent-child relationships between document sections.

2
Domain-Specific Embeddings

Fine-tuned embedding models on a corpus of 500K+ legal documents. The resulting embeddings correctly capture that 'prior art' is related to 'novelty' even though they share no words.

3
Citation-Aware Retrieval

Built a graph layer on top of vector search that follows citation chains. When retrieving a relevant patent, we also surface the most important documents it references.

4
Hybrid Search Architecture

Combined semantic vector search with BM25 keyword matching. Legal professionals often search for exact patent numbers or legal terms that benefit from keyword precision.

$ cat tech-stack.json

🚀 Core Technologies

PGVector

Vector storage and similarity search

Why: Integrates with existing PostgreSQL infrastructure, supports hybrid queries, production-proven at scale

LangChain

LLM orchestration and retrieval pipelines

Why: Flexible abstractions for building complex RAG flows with multiple retrievers

Python / FastAPI

Backend API and processing pipelines

Why: Async support for high-throughput AI workloads, excellent ML ecosystem

🔧 Supporting Technologies

PostgreSQL Redis HuggingFace Transformers

☁️ Infrastructure

Google Cloud Platform Docker / Kubernetes GitLab CI/CD

$ cat ARCHITECTURE.md

The system follows a three-stage retrieval architecture:

1
2
3
4
Query → Query Understanding → Hybrid Retrieval → Reranking → Response
          ↓                      ↓                 ↓
     Entity extraction    Vector + BM25     Cross-encoder
     Query expansion      Citation graph    Score fusion

System Components:

Document Ingestion Pipeline

Processes new documents through structure parsing, chunking, embedding generation, and citation extraction

Hybrid Retriever

Combines PGVector semantic search with PostgreSQL full-text search (BM25)

Citation Graph Service

Manages document relationships and performs graph traversal for related document discovery

Reranking Service

Cross-encoder model that reorders candidates for maximum relevance

$ man implementation-details

Structure-Aware Chunking Strategy

Patent documents have a specific structure that must be respected:

  • Abstract: High-level summary, good for initial retrieval
  • Claims: The legally binding scope (independent + dependent claims)
  • Description: Detailed explanation with drawings references
  • Prior Art: Citations to related patents and publications

Our chunking strategy:

  1. Never split claims — each claim becomes its own chunk with metadata linking to parent claims
  2. Preserve section context — every chunk includes section type metadata for filtering
  3. Overlapping windows for description — 512 tokens with 128 token overlap
  4. Citation extraction — all references extracted and stored in graph database
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
class PatentChunker:
    def chunk_patent(self, document: Patent) -> List[Chunk]:
        chunks = []
        # Claims are sacred - never split
        for claim in document.claims:
            chunks.append(Chunk(
                text=claim.text,
                metadata={
                    "section": "claim",
                    "claim_type": claim.type,  # independent/dependent
                    "parent_claim": claim.depends_on
                }
            ))
        # Description uses sliding window
        chunks.extend(self.sliding_window_chunk(
            document.description,
            window_size=512,
            overlap=128
        ))
        return chunks

Custom Embedding Model Training

Standard embedding models (OpenAI, Cohere) struggled with legal terminology. We fine-tuned a model using contrastive learning on patent pairs.

Training Data:

  • 500K patent documents
  • 50K manually labeled similar/dissimilar pairs from patent examiners
  • Synthetic pairs from citation networks (cited patents are similar)

Key Improvements:

  • Domain vocabulary: “anticipation” and “novelty rejection” are related
  • Technical precision: Chemical formulas and patent numbers embedded correctly
  • Citation context: “See US 9,123,456” correctly links to the referenced patent

The fine-tuned model improved retrieval accuracy from 67% to 85% on our test set.

$ echo $RESULTS

50% Faster Search with Higher Relevance

50% Reduction in Search Time From minutes to seconds for complex queries
85% Retrieval Accuracy Measured on held-out legal expert annotations
10,000+ Daily Queries Processed Handling enterprise-scale workload
99.9% System Uptime With zero data loss incidents

Additional Outcomes:

  • Legal professionals reported finding relevant prior art in a single search that previously required multiple attempts
  • The citation graph feature was adopted as a core workflow by patent analysts
  • System became a key factor in the RightHub → Anaqua acquisition

$ cat LESSONS_LEARNED.md

Domain Expertise Beats Generic Models

Investing time to understand legal document structure paid dividends. The 2 weeks spent interviewing patent attorneys about their search patterns directly informed our chunking strategy.

Hybrid Search is Non-Negotiable for Enterprise

Pure semantic search excites demos but frustrates power users who know exactly what they're looking for. Hybrid search satisfies both exploratory and precision use cases.

Evaluation Requires Domain Experts

Standard IR metrics like NDCG weren't enough. We created a custom evaluation set with legal experts rating relevance, which revealed issues that automated metrics missed.

$ cat README.md

When I joined RightHub (later acquired by Anaqua), the company had a clear vision: bring AI-powered search to intellectual property management. The challenge? Legal documents are unlike any other content.

A patent attorney searching for prior art needs:

  • Semantic understanding — finding relevant patents even when different terminology is used
  • Structural precision — understanding that Claim 1 defines the core invention
  • Citation awareness — knowing that Patent A citing Patent B means they’re related
  • Speed — results in seconds, not minutes

Generic RAG solutions failed on all counts.

Our Approach: Domain-First Design

Rather than forcing legal documents into a generic RAG framework, we designed the system around how patent professionals actually work.

Understanding the Domain

I spent the first two weeks interviewing patent attorneys, watching them search, and understanding their mental models. Key insights:

  1. Claims are everything — The legal scope of a patent is defined entirely by its claims
  2. Citation networks are gold — Experienced searchers follow citation chains to find related art
  3. Exact matching still matters — When you know the patent number, you want exact results
  4. Context is critical — A claim only makes sense in the context of its dependent claims

Technical Deep-Dive: The Retrieval Pipeline

Our final architecture processes queries through multiple stages:

Stage 1: Query Understanding

  • Entity extraction for patent numbers, company names, technical terms
  • Query expansion using domain synonyms (e.g., “mobile device” → “smartphone”, “cellular phone”, “handheld device”)
  • Intent classification (prior art search vs. freedom-to-operate vs. general research)

Stage 2: Hybrid Retrieval

  • Vector search using PGVector with custom legal embeddings
  • BM25 keyword search for exact matching
  • Citation graph traversal for related documents
  • Score fusion to combine results

Stage 3: Reranking

  • Cross-encoder model for precise relevance scoring
  • Diversity injection to avoid redundant results
  • Explanation generation for why each result matched

Results That Mattered

The system’s success was measured not just in technical metrics, but in user adoption:

  • Power users reported finding relevant prior art in a single search that previously required 3-4 attempts
  • Time savings translated to real cost reduction — patent searches that took hours now took minutes
  • Confidence — attorneys trusted the AI results enough to cite them in legal filings

This success was a key factor in Anaqua’s decision to acquire RightHub.

Key Takeaways for RAG Practitioners

  1. Invest in domain understanding before writing code — The chunking strategy that emerged from user interviews was nothing like what I’d have designed in isolation

  2. Hybrid search is essential for enterprise — Don’t get seduced by pure semantic search. Power users need exact matching.

  3. Build evaluation datasets with domain experts — Standard benchmarks won’t tell you if your legal RAG is working

  4. Citation/reference networks are underutilized — If your domain has documents that reference each other, leverage those relationships

  5. Fine-tuned embeddings are worth the effort — The jump from 67% to 85% accuracy justified the investment in custom training


Want to discuss building a RAG system for your domain? Let’s talk.


Experience: Senior Backend Engineer & AI Lead at Anaqua

Technologies: LangChain, RAG Systems, PGVector, FastAPI, Python, PostgreSQL

Related Case Studies: Multi-LLM Orchestration | Agentic AI Knowledge Systems

Building an Enterprise RAG System?

Let's discuss how I can help solve your engineering challenges.