Microservices Decomposition — Enterprise-Scale RAG (Chapter 3)

Monolithic RAG implementations fail at the seams where components have conflicting scaling needs. Your embedding service needs many CPU cores during business hours when documents upload in bulk, but your inference service needs GPU access whenever user queries arrive. Co-locating them means overprovisioning both.

Decompose into these services: Ingestion Service (handles document upload, parsing, chunking), Embedding Service (converts chunks to vectors), Index Service (manages vector database writes and index updates), Retrieval Service (handles search queries, access control filtering), Context Service (assembles LLM prompts with retrieved chunks).

Each service owns its data store. The Ingestion Service writes parsed documents to object storage. The Embedding Service writes vectors to the vector database. The Retrieval Service reads from both—but never writes.

# Service boundary violation example - DON'T DO THIS
class RetrievalService:
    def search(self, query: str, user_id: str):
        # Anti-pattern: retrieval service directly accessing 
        # document store it doesn't own
        doc = document_store.get(query.doc_id)  
        # Should only read through owned interfaces

The async boundary is critical. Direct synchronous calls between services create cascading timeouts. When the embedding service takes 2 seconds to embed a chunk (GPU contention), the ingestion service waits 2 seconds before processing the next document. With 1,000 concurrent uploads, this creates a backlog spiral.

Use async task queues between services. Ingestion publishes chunk events; Embedding subscribes and publishes vector events; Index subscribes and writes. Each service operates at its own pace.

Service discovery needs careful design. A naive approach hardcodes service URLs—but embedding service pods restart constantly in Kubernetes. A service mesh (Istio, Linkerd) handles this, but adds operational complexity.

The hardest problem is distributed transactions. When a document updates, you need to: delete old vectors, insert new vectors, update document metadata, invalidate cache entries, and publish a "document updated" event—all atomically. You cannot achieve true atomicity across services. Implement compensating transactions and idempotency keys instead.

Failure modes include orphaned vectors (embedding deleted but index entry remains), phantom documents (document store has it but vector search can't find it), and infinite retry loops when a service marks tasks as failed but never recovers.