RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Enterprise-Scale RAG
  6. /Ch. 4
Enterprise-Scale RAG

04. Event Queue with Kafka

Chapter 4 of 24 · 15 min
KEY INSIGHT

Kafka provides durability and fan-out at the cost of operational complexity. Consumer lag monitoring, schema registry governance, and dead letter queue processing are not optional—they are load-bearing infrastructure.

Kafka is the backbone of production RAG ingestion pipelines. It provides durability (messages survive broker restarts), replay (consumers can re-read old messages), and fan-out (multiple consumers process the same event).

Design topics around document lifecycle stages: documents.parsed, chunks.embedded, vectors.indexed, documents.failed. Each topic represents a state transition.

# Kafka topic creation with retention for replay
kafka-topics.sh --create \
  --topic documents.parsed \
  --bootstrap-server kafka:9092 \
  --partitions 12 \
  --replication-factor 3 \
  --config retention.ms=604800000 \
  --config cleanup.policy=delete

Partitioning strategy determines throughput and ordering guarantees. Partition by document ID for in-order processing of the same document. Partition by tenant for multi-tenant isolation. Partition by priority for differentiated SLAs (legal documents processed before marketing materials).

Consumer groups enable horizontal scaling. Add consumers to a group and Kafka automatically distributes partitions. But beware: if a consumer holds locks on partitions while processing slowly, you create processing bottlenecks.

# Slow consumer problem
class EmbeddingConsumer:
    def consume(self, messages):
        for msg in messages:
            # This blocks the partition
            result = self.embed_single_chunk(msg.value)
            # If GPU is contended, this takes 5+ seconds
            # Kafka thinks this consumer is slow

Offset management is tricky. Auto-commit commits offsets after session.timeout.ms, which can lose messages if the consumer crashes between processing and commit. Manual offset management (commit after successful processing) provides at-least-once delivery but requires idempotent processing.

Dead letter queues catch poison messages. When a document fails parsing 3 times, send it to documents.dlq for manual review. Otherwise, a corrupted PDF can halt your entire ingestion pipeline.

Schema evolution needs governance. If you change the ChunkEvent schema to add a language field, old consumers fail deserializing new messages. Use Avro or Protobuf with backward-compatible schema changes.

Failure modes include consumer lag accumulation (ingestion outpaces processing), duplicate processing (consumer restarts and re-reads uncommitted offsets), partition imbalance (hot partitions receive 10x more messages), and broker disk exhaustion from uncommitted consumer groups.

EXERCISE

Design a Kafka consumer for the chunks.embedded topic that processes 10,000 embeddings per minute while maintaining ordering per-document and handling GPU failures gracefully.

← Chapter 3
Microservices Decomposition
Chapter 5 →
Document Ingestion Pipeline