Infrastructure Advanced 12 min

Kubernetes Cluster Setup for AI Workloads

A practical cluster setup guide for running AI services alongside traditional backends with reliable data, messaging, and observability.

By Victor Robin Updated:

When I first deployed an AI workload to Kubernetes, I watched it get OOMKilled within seconds because the embedding model consumed 4GB of memory while the pod had a 512Mi limit. That crash course in resource management for AI workloads — debugging why the pod kept restarting, learning about requests vs limits, and finally getting a stable deployment — shaped everything in this guide.

Running AI workloads on Kubernetes isn’t just about deploying a model behind an API. You need GPU scheduling, large model storage, health probes for inference services, and namespace isolation between AI and transactional workloads. This guide covers the practical decisions for setting up a cluster that handles both.

[Managing Resources for Containers] — Kubernetes Documentation

Namespace Topology

Organize workloads by concern, not by team:

data-layer          PostgreSQL, MinIO, Qdrant, FalkorDB
messaging           NATS JetStream
ai                  Ollama, Docling, spaCy
apps-staging        API, Workers, Web (staging)
apps-production     API, Workers, Web (production)
monitoring          SigNoz, Grafana, Prometheus
flux-system         Flux CD controllers
[Kubernetes Patterns: Reusable Elements for Designing Cloud-Native Applications] — Bilgin Ibryam & Roland Huss

AI Service Requirements

Ollama (LLM Inference)

Ollama needs persistent storage for model files (7-30 GB per model):

ollama-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
          resources:
            requests:
              cpu: "2"
              memory: 8Gi
            limits:
              memory: 16Gi
          readinessProbe:
            httpGet:
              path: /api/tags
              port: 11434
            initialDelaySeconds: 30
            periodSeconds: 30
  volumeClaimTemplates:
    - metadata:
        name: models
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 50Gi
[NVIDIA GPU Operator Documentation] — NVIDIA [Schedule GPUs] — Kubernetes Documentation

Health Probes for AI Services

AI services have longer startup times and can become unresponsive under load. Configure probes accordingly:

ServiceReady ProbeLive ProbeInitial Delay
OllamaGET /api/tagsGET /api/tags30s
DoclingGET /healthGET /health60s
spaCyGET /healthGET /health20s

Data Layer

Vector Store (Qdrant)

Qdrant stores embeddings and needs persistent storage. Size the storage based on your collection count and vector dimensions:

Storage = vectors x dimensions x 4 bytes x 1.5 (overhead)

For 100K documents with 8 embedding models averaging 800 dimensions:

100,000 x 800 x 4 x 1.5 x 8 = 3.8 GB
[Device Plugins] — Kubernetes Documentation

Graph Database (FalkorDB)

FalkorDB runs as a Redis-compatible server with graph extensions. It needs minimal resources for moderate graph sizes (under 1M nodes):

resources:
  requests:
    cpu: 500m
    memory: 1Gi
  limits:
    memory: 2Gi

GitOps with Flux

Source of Truth

The infra repository contains all Kubernetes manifests. Flux watches for changes and reconciles:

clusters/
  production/
    flux-system/
      gotk-components.yaml
      gotk-sync.yaml
    apps.yaml           # Kustomization pointing to apps/
    infrastructure.yaml # Kustomization pointing to platform/

Force Reconciliation

When you need a change applied immediately:

flux reconcile source git flux-system --timeout=2m
flux reconcile kustomization apps --timeout=5m

Service Discovery

Always use fully qualified domain names (FQDNs) for cross-namespace communication:

postgres-rw.data-layer.svc.cluster.local:5432
ollama.ai.svc.cluster.local:11434
qdrant.data-layer.svc.cluster.local:6334
nats.messaging.svc.cluster.local:4222

Network Policies

Restrict traffic to the minimum necessary:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-apps-to-ai
  namespace: ai
spec:
  podSelector: {}
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              purpose: apps
      ports:
        - port: 11434  # Ollama
        - port: 8080   # Docling
[Network Policies] — Kubernetes Documentation

Conclusion

Setting up Kubernetes for AI workloads taught me that the gap between “it works on my machine” and “it runs reliably in a cluster” is wider than I expected. Every section of this guide came from a real outage or a late-night debugging session — from OOMKilled pods to GPU contention to cross-namespace DNS failures. The investment in proper namespace topology, resource tuning, and GitOps pays off every time you need to add a new model or scale an existing service.

Key Takeaways

  • Organize namespaces by concern (data, messaging, ai, apps) not by team
  • AI services need persistent storage, generous health probe timeouts, and memory limits
  • Always use FQDNs for cross-namespace service discovery
  • GitOps keeps your cluster state reproducible and auditable
  • Size vector store storage based on vectors x dimensions x 4 bytes x 1.5

Next Steps

Further Reading

[Flux CD Documentation] — Flux CD Project [Kubernetes Scheduling Framework] — Kubernetes Documentation