Event Queue with Kafka — Enterprise-Scale RAG (Chapter 4)

Kafka is the backbone of production RAG ingestion pipelines. It provides durability (messages survive broker restarts), replay (consumers can re-read old messages), and fan-out (multiple consumers process the same event).

Design topics around document lifecycle stages: documents.parsed, chunks.embedded, vectors.indexed, documents.failed. Each topic represents a state transition.

# Kafka topic creation with retention for replay
kafka-topics.sh --create \
  --topic documents.parsed \
  --bootstrap-server kafka:9092 \
  --partitions 12 \
  --replication-factor 3 \
  --config retention.ms=604800000 \
  --config cleanup.policy=delete

Partitioning strategy determines throughput and ordering guarantees. Partition by document ID for in-order processing of the same document. Partition by tenant for multi-tenant isolation. Partition by priority for differentiated SLAs (legal documents processed before marketing materials).

Consumer groups enable horizontal scaling. Add consumers to a group and Kafka automatically distributes partitions. But beware: if a consumer holds locks on partitions while processing slowly, you create processing bottlenecks.

# Slow consumer problem
class EmbeddingConsumer:
    def consume(self, messages):
        for msg in messages:
            # This blocks the partition
            result = self.embed_single_chunk(msg.value)
            # If GPU is contended, this takes 5+ seconds
            # Kafka thinks this consumer is slow

Offset management is tricky. Auto-commit commits offsets after session.timeout.ms, which can lose messages if the consumer crashes between processing and commit. Manual offset management (commit after successful processing) provides at-least-once delivery but requires idempotent processing.

Dead letter queues catch poison messages. When a document fails parsing 3 times, send it to documents.dlq for manual review. Otherwise, a corrupted PDF can halt your entire ingestion pipeline.

Schema evolution needs governance. If you change the ChunkEvent schema to add a language field, old consumers fail deserializing new messages. Use Avro or Protobuf with backward-compatible schema changes.

Failure modes include consumer lag accumulation (ingestion outpaces processing), duplicate processing (consumer restarts and re-reads uncommitted offsets), partition imbalance (hot partitions receive 10x more messages), and broker disk exhaustion from uncommitted consumer groups.