Kubernetes & GitOps
Kubernetes, GitOps, and Slurm on Kubernetes
How to use Kubernetes, Argo CD, and Slurm-on-Kubernetes patterns for rollout safety, drift control, GPU workload scheduling, and platform evidence without hiding failure domains.
Context
This resource is a public field note, not a private runbook. It explains the operating model behind Kubernetes, GitOps, Slurm, SUNK-style Slurm on Kubernetes, SR-IOV, and RDMA without exposing customer-specific manifests, topology diagrams, repo paths, or support-only recovery procedures.
The useful question is not simply Kubernetes or Slurm. It is what kind of workload is running, who owns scheduling and accounting, what network path carries the hot traffic, how storage is attached, how rollback works, and whether the evidence proves the control planes agree with the physical hardware.
Kubernetes and GitOps Are the Operating Contract
Kubernetes is most valuable when it becomes the shared contract for platform operations. It gives teams a consistent way to describe workloads, attach storage, expose services, apply policy, observe health, and roll changes forward or back. Argo CD adds the source-of-truth loop: the cluster should be explainable from Git, not from a collection of manual repairs that disappear during the next sync.
That contract matters because AI infrastructure has many moving parts. Device plugins, drivers, container images, storage mounts, service routing, certificates, auth, probes, and resource policy can each be the reason a workload fails. A useful GitOps design makes those layers visible enough that operators can classify the problem before changing state.
Slurm Still Matters for Training Workloads
Slurm is not old-fashioned just because Kubernetes exists. It is built around jobs, queues, accounting, reservations, partitions, topology-aware placement, and multi-node allocation. Those are first-class needs for HPC and large training workloads where users expect login nodes, `sbatch`, `srun`, `squeue`, and a scheduler that can explain why a job waited and where it landed.
The mistake is treating Slurm and Kubernetes as mutually exclusive religions. Many AI platforms need Kubernetes for lifecycle, observability, storage integration, policy, and services, while still needing Slurm semantics for training. Slurm on Kubernetes exists to make that boundary explicit instead of leaving two disconnected estates to fight over GPUs, logs, identity, storage, and node lifecycle.
The Data Plane Decides Whether the Architecture Is Real
The SR-IOV and RDMA diagrams are here because scheduled Pods are not the same thing as a working high-performance data path. Multi-node GPU jobs and storage-heavy pipelines may depend on specific NICs, GIDs, MTU, PFC or ECN behavior, network attachments, and GPU-to-NIC locality. If those details are wrong, the control plane can look green while the workload behaves like the infrastructure is broken.
This is especially important for Slurm-on-Kubernetes designs. Slurm may account for GPUs and topology, while Kubernetes exposes devices and network attachments through a different set of objects. The platform has to prove those views describe the same hardware reality, or users inherit the worst outcome: jobs that schedule cleanly but run on the wrong path.
Public Guidance, Protected Implementation
The public material should teach how to reason about the system: which control plane owns scheduling, how rollout and rollback work, where hot traffic flows, and what evidence proves the design. That is useful to readers without exposing reserved implementation detail.
The protected layer is where sensitive artifacts belong: private app-of-apps layouts, real cluster manifests, topology diagrams, customer-specific validation data, incident timelines, and support-only recovery steps. Public pages should build trust by showing the method. Protected pages can later show the exact implementation artifacts for approved readers.
Comparison
Slurm and SUNK Strengths
Slurm remains strong for model training and HPC-style work because it treats jobs, queues, accounting, topology, and multi-node allocation as first-class concerns. CoreWeave SUNK shows the next step: run Slurm inside Kubernetes so researchers keep Slurm job semantics while operators keep Kubernetes lifecycle, storage, observability, scaling, and GitOps controls.
| Dimension | Traditional Slurm | SUNK / Slurm on Kubernetes |
|---|---|---|
| Primary job | Schedule finite HPC and AI training jobs with queues, partitions, QoS, accounting, reservations, and topology-aware placement. | Expose the same Slurm job model while Kubernetes manages the lifecycle around Slurm control-plane, login, and compute Pods. |
| Best fit | Multi-node training, MPI-style work, batch pipelines, shared research clusters, and teams already fluent in `sbatch`, `srun`, and `squeue`. | AI platforms that need Slurm training semantics and Kubernetes-native operations, GitOps deployment, shared GPU pools, storage integration, and elastic platform workflows. |
| User experience | Users submit jobs through Slurm login nodes and reason about queues, allocations, partitions, accounts, and job state. | Users still get the Slurm experience, but the platform can run login and compute nodes as Kubernetes-managed components. |
| Operator strength | Deep control over job placement, accounting, reservations, and HPC scheduling policy. | Kubernetes lifecycle management, observability, storage attachment, GitOps, scaling hooks, and shared operations around Slurm-managed work. |
| Hardware and topology | Strong fit when GPU, NIC, CPU, memory, and fabric locality need to be part of scheduling policy. | Useful when Slurm topology expectations must be reconciled with Kubernetes device plugins, node labels, Pods, network attachments, and physical Nodes. |
| Main risk | Can become a separate estate from Kubernetes services, inference paths, platform policy, and cloud-native operations. | Can hide integration boundaries if teams forget that Slurm nodes, Kubernetes Pods, and physical Nodes are different objects with different failure modes. |
| Why it matters | Slurm explains why a job waited, where it ran, what resources it consumed, and how queue policy shaped the result. | SUNK-style designs let teams keep that Slurm value while using Kubernetes as the operational substrate for repeatable platform management. |
What to Understand
- -Start with context safety: cluster, namespace, owning Application, workload owner, blast radius, and rollback path.
- -Classify the failure before changing state: manifest generation, comparison, apply, health, permission, runtime, or dependency.
- -Choose the scheduler by workload class: service, batch job, multi-node training run, interactive research workflow, or topology-sensitive GPU job.
- -Verify the data plane, not only the control plane: GPU device exposure, network attachments, RDMA or SR-IOV path, storage, and observability.
Common Failure Modes
- -Rollouts rely on manual fixes that never return to Git, so the next sync reintroduces the incident.
- -Training workloads are forced into vanilla Kubernetes Jobs even though the team needs Slurm-style accounting, multi-node allocation, topology-aware placement, partitions, QoS, reservations, or familiar HPC workflows.
- -Slurm is bolted beside Kubernetes with no contract for GPU ownership, node drains, logs, storage mounts, identity, queue priority, accounting, or who wins when both systems think they own capacity.
- -Pods schedule successfully, but the SR-IOV or RDMA path is not actually wired, so the workload falls back to the wrong network, loses locality, or bypasses the fabric assumptions the job was designed around.
What Good Looks Like
- -Every workload has an owning source, expected namespace, rollout strategy, health signal, and rollback path.
- -Git, Kubernetes events, logs, metrics, and release notes tell the same story during an incident review.
- -The platform states when to use Kubernetes directly, when to use Slurm, and when to use Slurm on Kubernetes: inference and services, finite training jobs, bursty batch work, shared GPU pools, or topology-sensitive multi-node runs.
- -Slurm-on-Kubernetes deployments define the contract between Slurm nodes, Kubernetes Pods, physical Nodes, storage, identity, accounting, topology, and observability before users depend on it.
- -The data plane is explicit: pod network, SR-IOV or RDMA interfaces, device-plugin resources, NIC-to-GPU locality, MTU, fabric policy, and workload-level NCCL, MPI, or storage test evidence are reviewed together.
Field Notes
Public Checks and Protected Preview
These public snippets show the operating questions and evidence I look for. The protected area will add source-code context, diagrams, templates, and implementation examples when ready.
Quick Diagnostic
- -Before changing anything, do you know the cluster, namespace, owning Application, target revision, and rollback path?
- -Is the failure manifest generation, comparison, Kubernetes apply, health assessment, permission, runtime, or dependency behavior?
- -Is this workload a long-running service, finite batch job, multi-node training run, interactive research workflow, or bursty GPU job that needs queue policy?
- -Does the team need Slurm features such as partitions, QoS, accounting, reservations, topology-aware placement, MPI-style steps, or familiar login-node workflows?
- -If Slurm runs on Kubernetes, who owns GPU accounting, node drains, storage mounts, identity, job logs, and physical-node lifecycle?
- -Does the workload use normal pod networking, SR-IOV, RDMA, Multus, hostNetwork, or a service mesh, and does that match the expected GPU, NIC, and storage path?
- -Are Slurm GRES, Kubernetes device plugins, network attachments, and physical NIC-to-GPU locality describing the same hardware reality?
- -Can Git, events, logs, metrics, and release notes explain the same incident?
Evidence to Look For
- -Application source, path or chart, target revision, destination, values wiring, sync policy, waves, hooks, and drift rules.
- -Workload status, recent events, targeted logs, probes, resources, service discovery, ingress, certificates, and auth evidence.
- -Workload placement decision showing whether Kubernetes, Slurm, or Slurm on Kubernetes owns scheduling for each job class.
- -Slurm-on-Kubernetes contract covering Slurm nodes as Pods, physical Kubernetes Nodes, login nodes, compute nodes, storage, accounting, topology, and observability.
- -NetworkAttachmentDefinitions, device-plugin advertised resources, SR-IOV or RDMA interface mapping, NIC-to-GPU locality, MTU, GID, PFC or ECN notes, and workload-level NCCL, MPI, or storage test output.
- -Queue-policy and utilization evidence explaining why a job waited, where it ran, what GPUs it consumed, what topology it used, and whether Kubernetes state agreed with Slurm state.
- -Post-incident record that shows what changed, who owned it, how it was recovered, and what returned to Git.
Protected Preview
- -App-of-apps and ApplicationSet layout examples.
- -SUNK and Slurm-on-Kubernetes architecture-review templates.
- -SR-IOV, RDMA, and Slurm-on-Kubernetes network validation templates.
- -Training-platform workload-placement decision records.
- -GitOps incident timelines and rollback notes.
- -Cluster-operation checklists for safe inspection and recovery.
Further Resources
Protected Resources
Cluster-specific manifests, SUNK or Slurm-on-Kubernetes layouts, app-of-apps designs, incident timelines, and private GitOps repositories stay in the protected area.
View Gated Resources