AI Infrastructure

AI Infrastructure Basics for GPU Clusters and Model-Serving Systems

A practical map of the layers that shape AI behavior in production: workload shape, GPUs, topology, schedulers, storage, networks, cooling, serving patterns, batching, and validation evidence.

Context

This resource is written as a public field note, not a private runbook. The goal is to explain the operating model: what Kubernetes, GitOps, Slurm, SUNK-style patterns, SR-IOV, and RDMA each own, why they matter, and what evidence a team should collect before trusting the platform. Customer-specific manifests, private topology diagrams, repo paths, and support-only procedures belong in the protected area.

Kubernetes is useful because it gives platform teams a declarative control plane for lifecycle, policy, storage attachment, service discovery, observability, and repeatable rollout behavior. Argo CD adds the GitOps loop: the cluster should converge from source-controlled intent, and incidents should be explainable from source, generated manifests, live state, events, logs, metrics, and rollback history.

Slurm solves a different problem. It gives researchers and training teams a job-oriented interface for queues, partitions, accounting, reservations, multi-node allocation, topology-aware placement, and familiar login-node workflows. That matters when a workload behaves like HPC: it may need a finite allocation, direct GPU accounting, MPI or distributed training steps, and a scheduler that understands waiting, placement, and fairness as first-class concerns.

Slurm on Kubernetes exists because modern AI platforms often need both worlds. Training teams want Slurm semantics, while operators want Kubernetes lifecycle, GitOps, storage integration, autoscaling hooks, policy, monitoring, and shared GPU pool management. SUNK-style architectures make that bridge explicit by running Slurm components inside Kubernetes and representing Slurm nodes as Pods, but that also means the platform must be clear about which control plane owns each decision.

The networking diagrams matter because GPU workloads do not only consume compute. They move tensors, gradients, checkpoints, metadata, logs, and storage traffic through specific NICs and fabrics. Normal pod networking may be enough for many services, but multi-node training or storage-heavy jobs can need SR-IOV, RDMA, Multus, tuned MTU, correct GID selection, and a fabric policy that matches the job. If that path is wrong, the job can schedule successfully and still perform like the architecture is broken.

A good public architecture explanation should help a reader ask better questions without exposing reserved detail. The useful question is not simply Kubernetes or Slurm. It is what kind of workload is running, who owns scheduling and accounting, what network path carries the hot traffic, how storage is attached, how rollback works, and whether the evidence proves the control planes agree with the physical hardware.

What to Understand

  • -Start from workload shape: training, fine-tuning, batch inference, online inference, retrieval, simulation, or mixed interactive use. Batch and real-time inference are different products from an infrastructure point of view.
  • -GPU choice is one layer. CPU, memory, PCIe topology, NVMe, shared storage, network fabric, rack power, and cooling decide usable performance.
  • -Serving paths need product evidence: tail latency, concurrency, cold starts, model load behavior, request routing, batching strategy, cache behavior, and rollback behavior.
  • -Batching is not only an optimization. It changes queueing, fairness, memory pressure, timeout behavior, GPU utilization, customer latency, and how failures are retried.
  • -Model-serving patterns should match the workflow: synchronous user request, async job, streaming response, retrieval-augmented generation, offline enrichment, scheduled batch, or human-reviewed automation.
  • -The deployment substrate matters. Kubernetes, VMs, and bare metal change scheduling, isolation, hardware visibility, rollout behavior, failure recovery, and how close the serving stack sits to GPUs, NICs, and storage.
  • -Networking is part of the AI architecture. SDN, CNI behavior, service routing, east-west traffic, RDMA or Ethernet fabric design, and ingress paths decide whether model-serving latency is stable or only looks good in a single-node test.
  • -When AI workloads run on Kubernetes, the data plane may need more than normal service networking. SR-IOV, RDMA, Multus attachments, host networking, MTU, GID selection, and NIC-to-GPU locality can decide whether a Pod has access to the performance the hardware purchase promised.
  • -Storage is not one box on a diagram. Rook/Ceph, Ceph outside Kubernetes, VAST, WEKA, object stores, local NVMe, and shared filesystems each create different metadata, throughput, checkpoint, model-load, and recovery behavior.
  • -Schedulers decide more than placement: fairness, draining, quota, failure recovery, noisy-neighbor behavior, and utilization quality.
  • -Validation should separate failure domains before incidents: host, PCIe, fabric, storage, runtime, scheduler, model server, and application.

Common Failure Modes

  • -The bill of materials is treated as the architecture, with no workload-specific validation plan.
  • -Training benchmarks are used to justify inference infrastructure, or single-node results are generalized to multi-node behavior.
  • -Batch and real-time inference are served by the same path even though they need different latency budgets, retry behavior, queue policy, and cost controls.
  • -The team assumes Kubernetes, VMs, and bare metal are interchangeable deployment targets, then discovers that device plugins, passthrough, host drivers, storage mounts, network paths, and rollout semantics behave differently.
  • -The network is treated as plumbing, so SDN policy, CNI choice, service routing, MTU, east-west traffic, ingress, DNS, and fabric congestion show up as model-server or application symptoms.
  • -Pods schedule successfully, but SR-IOV, RDMA, network attachment, or NIC locality is wrong, so the workload runs through a path that cannot meet the expected latency or throughput.
  • -Rook/Ceph is deployed because it is Kubernetes-native, but model loading, checkpoints, metadata pressure, recovery behavior, and operational ownership are not tested against the actual AI workload.
  • -The storage platform is chosen by brand instead of workload behavior, so VAST, WEKA, Ceph, object storage, local NVMe, and shared filesystems are compared without model-load, checkpoint, metadata, and failure evidence.
  • -Dynamic batching improves utilization but breaks user-facing latency, streaming behavior, or fairness because timeout and queue policies are not tied to product expectations.
  • -The model server is treated as a black box, so routing, warmup, token streaming, cache misses, memory fragmentation, and model loading become invisible failure modes.
  • -Storage, network, and scheduler bottlenecks are diagnosed only after the GPU purchase is fixed.
  • -Teams optimize peak throughput while ignoring tail latency, operator access, upgrade paths, and evidence capture.
  • -Utilization dashboards look healthy while customer-facing workflows wait on storage, routing, cold starts, or queue policy.

What Good Looks Like

  • -Each workload has an acceptance test for throughput, latency, cost, failure behavior, upgrade behavior, and operator evidence.
  • -Batch, real-time, and streaming inference paths have separate SLOs, queue policies, autoscaling triggers, retry behavior, and rollback plans.
  • -Batching decisions are explicit: maximum batch size, timeout, priority, fairness, memory headroom, warmup behavior, and customer-visible latency impact.
  • -Model-serving diagrams show the real request path: gateway, router, scheduler, model server, cache, storage, accelerator, telemetry, and fallback path.
  • -Deployment choices are explicit: Kubernetes for policy, rollout, service discovery, and elastic operations; VMs for stronger lifecycle and tenant boundaries; bare metal for direct hardware control, locality, and fewer abstraction layers.
  • -Network design is visible in the serving plan: ingress, gateway, service routing, CNI or SDN policy, east-west traffic, MTU, fabric path, congestion signals, and failure isolation are part of acceptance testing.
  • -Kubernetes networking decisions are proven with workload evidence: network attachments, SR-IOV or RDMA interfaces, MTU, GIDs, NIC locality, and representative NCCL, MPI, storage, or inference-path tests.
  • -Storage decisions are workload-specific: model artifacts, embeddings, checkpoints, logs, cache, training data, and inference outputs can land on different storage systems with different SLOs.
  • -Topology is known: GPU-to-GPU, GPU-to-NIC, NUMA, PCIe generation, storage path, fabric speed, MTU, and congestion behavior.
  • -Platform teams can explain when to use bare metal, Kubernetes, VMs, model-serving frameworks, or a managed provider.
  • -Capacity planning connects utilization to product value, not only GPU hours consumed.

Field Notes

Public Checks and Protected Preview

These public snippets show the operating questions and evidence I look for. The protected area will add source-code context, diagrams, templates, and implementation examples when ready.

Quick Diagnostic

  • -What workload is this for: training, fine-tuning, batch inference, online inference, retrieval, simulation, or mixed use?
  • -Does this path need real-time latency, async batch throughput, streaming responses, offline enrichment, or human-reviewed automation?
  • -Is the serving path running on Kubernetes, VMs, or bare metal, and what does that change about scheduling, isolation, device access, rollout, and recovery?
  • -Which network layer owns the request path: ingress, gateway, service mesh, CNI, SDN policy, RDMA or Ethernet fabric, DNS, or external load balancer?
  • -If the workload runs on Kubernetes, does it need normal pod networking, SR-IOV, RDMA, Multus, host networking, or a separate storage/fabric path?
  • -Which storage path owns model artifacts, checkpoints, embeddings, logs, cache, training data, and inference outputs?
  • -Where does batching happen, and what timeout, fairness, memory, and tail-latency budget does it obey?
  • -Which layer is most likely to bottleneck the outcome: GPU, CPU, memory, PCIe, storage, fabric, scheduler, model server, or application?
  • -What acceptance test proves the platform is ready before customer traffic depends on it?

Evidence to Look For

  • -Serving-path diagram covering gateway, router, queue, scheduler, model server, cache, accelerator, storage, telemetry, and fallback behavior.
  • -Deployment-substrate decision record comparing Kubernetes, VMs, and bare metal for hardware access, isolation, lifecycle, networking, storage, and operator workflow.
  • -Network-path evidence for ingress, service routing, CNI or SDN policy, east-west traffic, MTU, congestion, DNS, and failure isolation.
  • -SR-IOV or RDMA evidence covering NetworkAttachmentDefinitions, advertised device resources, NIC-to-GPU locality, MTU, GID selection, and representative workload tests.
  • -Storage decision record comparing Rook/Ceph, external Ceph, VAST, WEKA, object storage, local NVMe, and shared filesystems against the actual workload.
  • -Separate acceptance results for batch, real-time, and streaming inference paths.
  • -Batching configuration notes for maximum batch size, timeout, priority, warmup, memory headroom, and retry behavior.
  • -Topology summary covering GPU-to-GPU, GPU-to-NIC, NUMA, PCIe generation, storage path, fabric speed, MTU, and congestion behavior.
  • -Throughput, latency, cost, failure, upgrade, and operator-evidence acceptance results.
  • -Capacity plan that ties utilization to product value instead of raw GPU hours.

Protected Preview

  • -Provider-specific topology notes and benchmark logs.
  • -Model-serving architecture reviews for batch, real-time, and streaming workloads.
  • -Kubernetes, VM, and bare-metal deployment tradeoff templates for AI serving platforms.
  • -SDN/CNI and storage architecture review templates for GPU serving platforms.
  • -SR-IOV, RDMA, and Kubernetes data-plane validation templates for GPU infrastructure.
  • -Rook/Ceph, VAST, WEKA, and local NVMe workload-fit checklists.
  • -Batching and queue-policy templates tied to customer-facing SLOs.
  • -Cluster validation templates and acceptance-test artifacts.
  • -Operational runbook examples for GPU infrastructure failure domains.

Further Resources

Protected Resources

Provider-specific topology notes, benchmark logs, diagrams, and operational runbooks stay in the protected area.

View Gated Resources