Resource Limits — Production Local AI Deployment (Chapter 6)

Container resource limits prevent individual services from monopolizing shared infrastructure. Kubernetes and Docker enforce limits at the container level while operating system cgroup settings enforce those limits at the host level. Understanding the interaction between limits prevents common failure modes like OOM kills and CPU throttling.

Memory limits trigger terminating containers that exceed their allocated memory. The OOM killer preferentially selects containers without memory limits because the kernel cannot reliably reclaim memory from unlimited containers. Setting limits improves predictability.

Memory requests establish scheduling preferences without guaranteeing availability. The scheduler places pods on nodes with sufficient available memory. Limits and requests need not match; requests establish expected usage while limits establish maximum consumption.

CPU limits affect throttling behavior. Containers exceeding CPU limits receive reduced scheduling time. Single-threaded inference services typically do not benefit from excessive CPU limits but suffer from throttling during batch processing. Understanding CPU units matters: 100m represents 0.1 CPU cores, and 1.0 represents a full core.

resources:
  requests:
    memory: "4Gi"
    cpu: "500m"
  limits:
    memory: "8Gi"
    cpu: "2000m"

Ephemeral storage requests and limits address temporary storage for scratch space, logs, and emptyDir volumes. The storage pressure condition triggers when node disk usage exceeds thresholds. Pod eviction follows when pods exceed storage limits.

GPU memory limits require platform-specific handling. Kubernetes device plugins report GPU memory as allocatable resources. Containers requesting GPUs receive device allocation but must manage GPU memory internally through CUDA memory management or framework-level pooling.

Resource quotas limit aggregate consumption per namespace. Limits ensure fairness across teams while requests guarantee service availability. Quota计算的 requests和limits分别计算，enforcing different policies based on scheduling versus runtime enforcement.

The OOM score adjustment kernel parameter influences which processes the OOM killer selects when memory pressure triggers. Containers with lower OOM scores receive preferential treatment during memory exhaustion. Kubernetes sets appropriate values automatically when resource limits exist.

# Deployment with memory-proportional limits apiVersion: apps/v1 kind: Deployment metadata: name: inference-server spec: template: spec: containers: - name: inference image: inference/model-server:v1.4.2 resources: requests: memory: "2Gi" cpu: "1000m" limits: memory: "6Gi" # Accommodates longer inputs with padding cpu: "2000m" env: - name: MAX_SEQUENCE_LENGTH value: "2048" - name: MAX_BATCH_SIZE value: "16" livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 60 periodSeconds: 5