Kubernetes & GitOps

Kubernetes, GitOps, and Slurm on Kubernetes

How to use Kubernetes, Argo CD, and Slurm-on-Kubernetes patterns for rollout safety, drift control, GPU workload scheduling, and platform evidence without hiding failure domains.

Back to Resources All Resources

GitOps workflow illustration for declarative platform operations

Context

This resource is a public field note, not a private runbook. It explains the operating model behind Kubernetes, GitOps, Slurm, SUNK-style Slurm on Kubernetes, SR-IOV, and RDMA without exposing customer-specific manifests, topology diagrams, repo paths, or support-only recovery procedures.

The useful question is not simply Kubernetes or Slurm. It is what kind of workload is running, who owns scheduling and accounting, what network path carries the hot traffic, how storage is attached, how rollback works, and whether the evidence proves the control planes agree with the physical hardware.

Decision Guide

Frame the decision before choosing the architecture.

Decision

Which control plane should own rollout, scheduling, drift, and workload state for the platform?

Who It Helps

Platform engineers, AI infrastructure teams, and technical leaders aligning Kubernetes, GitOps, Slurm, and operator workflows.

Proof to Look For

Source-of-truth clarity, sync evidence, health signals, scheduling ownership, rollback behavior, and incident-review artifacts.

Kubernetes and GitOps Are the Operating Contract

Kubernetes is most valuable when it becomes the shared contract for platform operations. It gives teams a consistent way to describe workloads, attach storage, expose services, apply policy, observe health, and roll changes forward or back. Argo CD adds the source-of-truth loop: the cluster should be explainable from Git, not from a collection of manual repairs that disappear during the next sync.

That contract matters because AI infrastructure has many moving parts. Device plugins, drivers, container images, storage mounts, service routing, certificates, auth, probes, and resource policy can each be the reason a workload fails. A useful GitOps design makes those layers visible enough that operators can classify the problem before changing state.

Slurm Still Matters for Training Workloads

Slurm is not old-fashioned just because Kubernetes exists. It is built around jobs, queues, accounting, reservations, partitions, topology-aware placement, and multi-node allocation. Those are first-class needs for supercomputing and large training workloads where users expect login nodes, `sbatch`, `srun`, `squeue`, and a scheduler that can explain why a job waited and where it landed.

The mistake is treating Slurm and Kubernetes as mutually exclusive religions. Many AI platforms need Kubernetes for lifecycle, observability, storage integration, policy, and services, while still needing Slurm semantics for training. Slurm on Kubernetes exists to make that boundary explicit instead of leaving two disconnected estates to fight over GPUs, logs, identity, storage, and node lifecycle.

The Data Plane Decides Whether the Architecture Is Real

The SR-IOV and RDMA diagrams are here because scheduled Pods are not the same thing as a working high-performance data path. Multi-node GPU jobs and storage-heavy pipelines may depend on specific NICs, GIDs, MTU, PFC or ECN behavior, network attachments, and GPU-to-NIC locality. If those details are wrong, the control plane can look green while the workload behaves like the infrastructure is broken.

This is especially important for Slurm-on-Kubernetes designs. Slurm may account for GPUs and topology, while Kubernetes exposes devices and network attachments through a different set of objects. The platform has to prove those views describe the same hardware reality, or users inherit the worst outcome: jobs that schedule cleanly but run on the wrong path.

Public Guidance, Protected Implementation

The public material should teach how to reason about the system: which control plane owns scheduling, how rollout and rollback work, where hot traffic flows, and what evidence proves the design. That is useful to readers without exposing reserved implementation detail.

The protected layer is where sensitive artifacts belong: private app-of-apps layouts, real cluster manifests, topology diagrams, customer-specific validation data, incident timelines, and support-only recovery steps. Public pages should build trust by showing the method. Protected pages can later show the exact implementation artifacts for approved readers.

Comparison

Slurm and SUNK Strengths

Slurm remains strong for model training and supercomputing-style work because it treats jobs, queues, accounting, topology, and multi-node allocation as first-class concerns. CoreWeave SUNK shows the next step: run Slurm inside Kubernetes so researchers keep Slurm job semantics while operators keep Kubernetes lifecycle, storage, observability, scaling, and GitOps controls.

Dimension	Traditional Slurm	SUNK / Slurm on Kubernetes
Primary job	Schedule finite supercomputing and AI training jobs with queues, partitions, QoS, accounting, reservations, and topology-aware placement.	Expose the same Slurm job model while Kubernetes manages the lifecycle around Slurm control-plane, login, and compute Pods.
Best fit	Multi-node training, MPI-style work, batch pipelines, shared research clusters, and teams already fluent in `sbatch`, `srun`, and `squeue`.	AI platforms that need Slurm training semantics and Kubernetes-native operations, GitOps deployment, shared GPU pools, storage integration, and elastic platform workflows.
User experience	Users submit jobs through Slurm login nodes and reason about queues, allocations, partitions, accounts, and job state.	Users still get the Slurm experience, but the platform can run login and compute nodes as Kubernetes-managed components.
Operator strength	Deep control over job placement, accounting, reservations, and supercomputing scheduling policy.	Kubernetes lifecycle management, observability, storage attachment, GitOps, scaling hooks, and shared operations around Slurm-managed work.
Hardware and topology	Strong fit when GPU, NIC, CPU, memory, and fabric locality need to be part of scheduling policy.	Useful when Slurm topology expectations must be reconciled with Kubernetes device plugins, node labels, Pods, network attachments, and physical Nodes.
Main risk	Can become a separate estate from Kubernetes services, inference paths, platform policy, and cloud-native operations.	Can hide integration boundaries if teams forget that Slurm nodes, Kubernetes Pods, and physical Nodes are different objects with different failure modes.
Why it matters	Slurm explains why a job waited, where it ran, what resources it consumed, and how queue policy shaped the result.	SUNK-style designs let teams keep that Slurm value while using Kubernetes as the operational substrate for repeatable platform management.

What to Understand

Start with context safety: cluster, namespace, owning Application, workload owner, blast radius, and rollback path.
Classify the failure before changing state: manifest generation, comparison, apply, health, permission, runtime, or dependency.
Choose the scheduler by workload class: service, batch job, multi-node training run, interactive research workflow, or topology-sensitive GPU job.
Verify the data plane, not only the control plane: GPU device exposure, network attachments, RDMA or SR-IOV path, storage, and observability.

Argo CD overview diagram for GitOps synchronization

Common Failure Modes

Rollouts rely on manual fixes that never return to Git, so the next sync reintroduces the incident.
Training workloads are forced into vanilla Kubernetes Jobs even though the team needs Slurm-style accounting, multi-node allocation, topology-aware placement, partitions, QoS, reservations, or familiar supercomputing workflows.
Slurm is bolted beside Kubernetes with no contract for GPU ownership, node drains, logs, storage mounts, identity, queue priority, accounting, or who wins when both systems think they own capacity.
Pods schedule successfully, but the SR-IOV or RDMA path is not actually wired, so the workload falls back to the wrong network, loses locality, or bypasses the fabric assumptions the job was designed around.

Training on Slurm and serving on Kubernetes fragmentation diagram

What Good Looks Like

Every workload has an owning source, expected namespace, rollout strategy, health signal, and rollback path.
Git, Kubernetes events, logs, metrics, and release notes tell the same story during an incident review.
The platform states when to use Kubernetes directly, when to use Slurm, and when to use Slurm on Kubernetes: inference and services, finite training jobs, bursty batch work, shared GPU pools, or topology-sensitive multi-node runs.
Slurm-on-Kubernetes deployments define the contract between Slurm nodes, Kubernetes Pods, physical Nodes, storage, identity, accounting, topology, and observability before users depend on it.
The data plane is explicit: pod network, SR-IOV or RDMA interfaces, device-plugin resources, NIC-to-GPU locality, MTU, fabric policy, and workload-level NCCL, MPI, or storage test evidence are reviewed together.

SUNK architecture diagram for Slurm on Kubernetes

Quick Diagnostic

Before changing anything, do you know the cluster, namespace, owning Application, target revision, and rollback path?
Is the failure manifest generation, comparison, Kubernetes apply, health assessment, permission, runtime, or dependency behavior?
Is this workload a long-running service, finite batch job, multi-node training run, interactive research workflow, or bursty GPU job that needs queue policy?
Does the team need Slurm features such as partitions, QoS, accounting, reservations, topology-aware placement, MPI-style steps, or familiar login-node workflows?

4 more in private context

Evidence to Look For

Application source, path or chart, target revision, destination, values wiring, sync policy, waves, hooks, and drift rules.
Workload status, recent events, targeted logs, probes, resources, service discovery, ingress, certificates, and auth evidence.
Workload placement decision showing whether Kubernetes, Slurm, or Slurm on Kubernetes owns scheduling for each job class.
Slurm-on-Kubernetes contract covering Slurm nodes as Pods, physical Kubernetes Nodes, login nodes, compute nodes, storage, accounting, topology, and observability.

3 more in private context

Protected Preview

App-of-apps and ApplicationSet layout examples.
SUNK and Slurm-on-Kubernetes architecture-review templates.
SR-IOV, RDMA, and Slurm-on-Kubernetes network validation templates.
Training-platform workload-placement decision records.

2 more in private context

Further Resources

AI InfrastructureUse this to connect platform control loops to GPU, storage, serving, and workload placement.CoreWeave SUNKReference architecture for running Slurm on Kubernetes with shared compute, GitOps, scaling, and topology-aware scheduling.AndromedaShameless plug: my current employer helps teams access and operate GPU clusters, including Slurm on Kubernetes paths for AI infrastructure.Hermetic Build WorkflowsUse this for pinned tools, checks, and repeatable operator environments around cluster work.AI ObservabilityUse this when incidents need to connect infrastructure state to customer-visible AI behavior.

Apply to a Decision

Apply this to a product, infrastructure, or diligence decision.

If this resource matches a decision you need to make, these services turn the framework into a review, roadmap, validation plan, or risk assessment for a specific environment.

Hardware InfrastructureAlign Kubernetes, GitOps, Slurm, and operator workflows with the workloads the platform must support.Engineering LeadershipReview platform-control-plane choices before delivery speed and reliability diverge.

Private Resources

Cluster-specific manifests, SUNK or Slurm-on-Kubernetes layouts, app-of-apps designs, incident timelines, and private GitOps repositories stay in the protected area.

View Private Resources