Kubernetes Cluster Setup for AI Workloads

When I first deployed an AI workload to Kubernetes, I watched it get OOMKilled within seconds because the embedding model consumed 4GB of memory while the pod had a 512Mi limit. That crash course in resource management for AI workloads — debugging why the pod kept restarting, learning about requests vs limits, and finally getting a stable deployment — shaped everything in this guide.

Running AI workloads on Kubernetes isn’t just about deploying a model behind an API. You need GPU scheduling, large model storage, health probes for inference services, and namespace isolation between AI and transactional workloads. This guide covers the practical decisions for setting up a cluster that handles both.

[Managing Resources for Containers] — Kubernetes Documentation

Namespace Topology

Organize workloads by concern, not by team:

data-layer          PostgreSQL, MinIO, Qdrant, FalkorDB
messaging           NATS JetStream
ai                  Ollama, Docling, spaCy
apps-staging        API, Workers, Web (staging)
apps-production     API, Workers, Web (production)
monitoring          SigNoz, Grafana, Prometheus
flux-system         Flux CD controllers

[Kubernetes Patterns: Reusable Elements for Designing Cloud-Native Applications] — Bilgin Ibryam & Roland Huss

AI Service Requirements

Ollama (LLM Inference)

Ollama needs persistent storage for model files (7-30 GB per model):

ollama-statefulset.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
          resources:
            requests:
              cpu: "2"
              memory: 8Gi
            limits:
              memory: 16Gi
          readinessProbe:
            httpGet:
              path: /api/tags
              port: 11434
            initialDelaySeconds: 30
            periodSeconds: 30
  volumeClaimTemplates:
    - metadata:
        name: models
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 50Gi

[NVIDIA GPU Operator Documentation] — NVIDIA [Schedule GPUs] — Kubernetes Documentation

Health Probes for AI Services

AI services have longer startup times and can become unresponsive under load. Configure probes accordingly:

Service	Ready Probe	Live Probe	Initial Delay
Ollama	GET /api/tags	GET /api/tags	30s
Docling	GET /health	GET /health	60s
spaCy	GET /health	GET /health	20s

Data Layer

Vector Store (Qdrant)

Qdrant stores embeddings and needs persistent storage. Size the storage based on your collection count and vector dimensions:

Storage = vectors x dimensions x 4 bytes x 1.5 (overhead)

For 100K documents with 8 embedding models averaging 800 dimensions:

100,000 x 800 x 4 x 1.5 x 8 = 3.8 GB

[Device Plugins] — Kubernetes Documentation

Graph Database (FalkorDB)

FalkorDB runs as a Redis-compatible server with graph extensions. It needs minimal resources for moderate graph sizes (under 1M nodes):

resources:
  requests:
    cpu: 500m
    memory: 1Gi
  limits:
    memory: 2Gi

GitOps with Flux

Source of Truth

The infra repository contains all Kubernetes manifests. Flux watches for changes and reconciles:

clusters/
  production/
    flux-system/
      gotk-components.yaml
      gotk-sync.yaml
    apps.yaml           # Kustomization pointing to apps/
    infrastructure.yaml # Kustomization pointing to platform/

Force Reconciliation

When you need a change applied immediately:

flux reconcile source git flux-system --timeout=2m
flux reconcile kustomization apps --timeout=5m

Service Discovery

Always use fully qualified domain names (FQDNs) for cross-namespace communication:

postgres-rw.data-layer.svc.cluster.local:5432
ollama.ai.svc.cluster.local:11434
qdrant.data-layer.svc.cluster.local:6334
nats.messaging.svc.cluster.local:4222

Network Policies

Restrict traffic to the minimum necessary:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-apps-to-ai
  namespace: ai
spec:
  podSelector: {}
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              purpose: apps
      ports:
        - port: 11434  # Ollama
        - port: 8080   # Docling

[Network Policies] — Kubernetes Documentation

Conclusion

Setting up Kubernetes for AI workloads taught me that the gap between “it works on my machine” and “it runs reliably in a cluster” is wider than I expected. Every section of this guide came from a real outage or a late-night debugging session — from OOMKilled pods to GPU contention to cross-namespace DNS failures. The investment in proper namespace topology, resource tuning, and GitOps pays off every time you need to add a new model or scale an existing service.

Key Takeaways

Organize namespaces by concern (data, messaging, ai, apps) not by team
AI services need persistent storage, generous health probe timeouts, and memory limits
Always use FQDNs for cross-namespace service discovery
GitOps keeps your cluster state reproducible and auditable
Size vector store storage based on vectors x dimensions x 4 bytes x 1.5

Next Steps

Shipping a New AI Service with GitOps — How to deploy a new service into this cluster using Kustomize overlays and ExternalSecrets
Integrating Docling OCR in a .NET Document Pipeline — Building the OCR service that runs in the ai namespace

Namespace Topology

AI Service Requirements

Ollama (LLM Inference)

Health Probes for AI Services

Data Layer

Vector Store (Qdrant)

Graph Database (FalkorDB)

GitOps with Flux

Source of Truth

Force Reconciliation

Service Discovery

Network Policies

Conclusion

Key Takeaways

Next Steps

Further Reading