Kubernetes Cluster Setup for AI Workloads
A practical cluster setup guide for running AI services alongside traditional backends with reliable data, messaging, and observability.
When I first deployed an AI workload to Kubernetes, I watched it get OOMKilled within seconds because the embedding model consumed 4GB of memory while the pod had a 512Mi limit. That crash course in resource management for AI workloads — debugging why the pod kept restarting, learning about requests vs limits, and finally getting a stable deployment — shaped everything in this guide.
Running AI workloads on Kubernetes isn’t just about deploying a model behind an API. You need GPU scheduling, large model storage, health probes for inference services, and namespace isolation between AI and transactional workloads. This guide covers the practical decisions for setting up a cluster that handles both.
[Managing Resources for Containers] — Kubernetes DocumentationNamespace Topology
Organize workloads by concern, not by team:
data-layer PostgreSQL, MinIO, Qdrant, FalkorDB
messaging NATS JetStream
ai Ollama, Docling, spaCy
apps-staging API, Workers, Web (staging)
apps-production API, Workers, Web (production)
monitoring SigNoz, Grafana, Prometheus
flux-system Flux CD controllers
[Kubernetes Patterns: Reusable Elements for Designing Cloud-Native Applications]
— Bilgin Ibryam & Roland Huss
AI Service Requirements
Ollama (LLM Inference)
Ollama needs persistent storage for model files (7-30 GB per model):
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ollama
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
volumeMounts:
- name: models
mountPath: /root/.ollama
resources:
requests:
cpu: "2"
memory: 8Gi
limits:
memory: 16Gi
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 30
periodSeconds: 30
volumeClaimTemplates:
- metadata:
name: models
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi Health Probes for AI Services
AI services have longer startup times and can become unresponsive under load. Configure probes accordingly:
| Service | Ready Probe | Live Probe | Initial Delay |
|---|---|---|---|
| Ollama | GET /api/tags | GET /api/tags | 30s |
| Docling | GET /health | GET /health | 60s |
| spaCy | GET /health | GET /health | 20s |
Data Layer
Vector Store (Qdrant)
Qdrant stores embeddings and needs persistent storage. Size the storage based on your collection count and vector dimensions:
Storage = vectors x dimensions x 4 bytes x 1.5 (overhead)
For 100K documents with 8 embedding models averaging 800 dimensions:
100,000 x 800 x 4 x 1.5 x 8 = 3.8 GB
[Device Plugins]
— Kubernetes Documentation
Graph Database (FalkorDB)
FalkorDB runs as a Redis-compatible server with graph extensions. It needs minimal resources for moderate graph sizes (under 1M nodes):
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
memory: 2Gi
GitOps with Flux
Source of Truth
The infra repository contains all Kubernetes manifests. Flux watches for changes and reconciles:
clusters/
production/
flux-system/
gotk-components.yaml
gotk-sync.yaml
apps.yaml # Kustomization pointing to apps/
infrastructure.yaml # Kustomization pointing to platform/
Force Reconciliation
When you need a change applied immediately:
flux reconcile source git flux-system --timeout=2m
flux reconcile kustomization apps --timeout=5m
Service Discovery
Always use fully qualified domain names (FQDNs) for cross-namespace communication:
postgres-rw.data-layer.svc.cluster.local:5432
ollama.ai.svc.cluster.local:11434
qdrant.data-layer.svc.cluster.local:6334
nats.messaging.svc.cluster.local:4222
Network Policies
Restrict traffic to the minimum necessary:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-apps-to-ai
namespace: ai
spec:
podSelector: {}
ingress:
- from:
- namespaceSelector:
matchLabels:
purpose: apps
ports:
- port: 11434 # Ollama
- port: 8080 # Docling
[Network Policies]
— Kubernetes Documentation
Conclusion
Setting up Kubernetes for AI workloads taught me that the gap between “it works on my machine” and “it runs reliably in a cluster” is wider than I expected. Every section of this guide came from a real outage or a late-night debugging session — from OOMKilled pods to GPU contention to cross-namespace DNS failures. The investment in proper namespace topology, resource tuning, and GitOps pays off every time you need to add a new model or scale an existing service.
Key Takeaways
- Organize namespaces by concern (data, messaging, ai, apps) not by team
- AI services need persistent storage, generous health probe timeouts, and memory limits
- Always use FQDNs for cross-namespace service discovery
- GitOps keeps your cluster state reproducible and auditable
- Size vector store storage based on
vectors x dimensions x 4 bytes x 1.5
Next Steps
- Shipping a New AI Service with GitOps — How to deploy a new service into this cluster using Kustomize overlays and ExternalSecrets
- Integrating Docling OCR in a .NET Document Pipeline — Building the OCR service that runs in the
ainamespace