Skip to main content
Run GPU workloads on your Ryvn-managed EKS clusters using compute pools to automatically provision and deprovision GPU nodes. Nodes spin up when a workload needs them and scale back to zero when idle, so you only pay for GPU compute while jobs are running.

How It Works

Ryvn’s AWS platform blueprint includes a built-in compute pool engine that handles node autoscaling. By defining a GPU compute pool, you tell Ryvn which GPU instances to provision, how to configure their storage, and when to scale down. When a pod requests nvidia.com/gpu resources, Ryvn automatically launches a matching instance. When the node is empty, Ryvn terminates it.

Node Provisioning Lifecycle

GPU nodes are ephemeral — they exist only while workloads need them. Here’s the full lifecycle from job submission to node termination:

Prerequisites

Before configuring GPU nodes, ensure you have:

Configure a GPU Compute Pool

A compute pool has two parts: a pool configuration that controls instance-level settings (AMI, storage), and the pool itself that controls scaling behavior (instance types, limits, taints). The aws-platform blueprint exposes inputs for both.

Step 1: Define a Pool Configuration

A pool configuration controls the AMI, storage, and instance store policy for your GPU nodes. GPU training workloads typically need large data volumes for model weights, datasets, and checkpoints. Set the Custom Node Classes blueprint input:
gpu-node-class:
  amiSelectorTerms:
    - alias: bottlerocket@v1.52.0
  blockDeviceMappings:
    # Root volume
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 32Gi
        volumeType: gp3
        encrypted: true
    # Data volume — size this for your dataset + model weights + checkpoints
    - deviceName: /dev/xvdb
      ebs:
        volumeSize: 2000Gi
        volumeType: gp3
        encrypted: true
Bottlerocket includes NVIDIA driver support for GPU instances out of the box. For workloads that need a specific driver version, you can use a custom AMI instead of the alias selector.
For instances with NVMe instance stores (like p5.48xlarge), you can use instance store volumes instead of EBS for higher throughput:
gpu-node-class:
  amiSelectorTerms:
    - alias: bottlerocket@v1.52.0
  instanceStorePolicy: "RAID0"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 32Gi
        volumeType: gp3
        encrypted: true

Step 2: Define a GPU Compute Pool

A compute pool controls which instance types Ryvn can launch and how nodes are managed. Set the Custom Node Pools blueprint input:
gpu-training:
  nodeClassName: gpu-node-class
  requirements:
    instanceCategories:
      - "p"
    instanceFamilies:
      - "p5"
    arch: "amd64"
    hypervisor: "nitro"
    minGeneration: 4
  labels:
    workload-type: gpu-training
  taints:
    - key: nvidia.com/gpu
      value: "true"
      effect: NoSchedule
  limits:
    cpu: 192
  disruption:
    consolidationPolicy: "WhenEmpty"
    consolidateAfter: "5m"
Key settings explained:
SettingPurpose
instanceCategories: ["p"]Restricts to GPU-accelerated P-family instances
instanceFamilies: ["p5"]Narrows to H100 instances specifically. Use p4d for A100s, g6 for L4s
taintsPrevents non-GPU workloads from landing on expensive GPU nodes
limits.cpu: 192Caps the pool at one p5.48xlarge (192 vCPUs). Increase for more nodes
GPU instances are expensive. Set limits.cpu conservatively to prevent accidental over-provisioning.

AWS GPU Instance Reference

Instance FamilyGPUsGPU TypevCPUsGPU MemoryUse Case
p5.48xlarge8H100192640 GBLarge-scale training
p4d.24xlarge8A10096320 GBTraining and fine-tuning
g6.xlarge1L4424 GBInference, light fine-tuning
g6.48xlarge8L4192192 GBMulti-GPU inference
g5.xlarge1A10G424 GBInference

Install the NVIDIA Device Plugin

Kubernetes needs the NVIDIA device plugin to expose GPU resources to the scheduler. Install it as a Helm chart service in Ryvn. Create a service for the NVIDIA device plugin:
kind: Service
metadata:
  name: nvidia-device-plugin
spec:
  type: helm-chart.v1
  chart:
    repository: https://nvidia.github.io/k8s-device-plugin
    name: nvidia-device-plugin
    version: "0.19.0"
Then install it into your environment’s kube-system namespace with this config:
tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
nodeSelector:
  workload-type: gpu-training
This ensures the plugin only runs on GPU nodes and tolerates the GPU taint. If you have multiple GPU compute pools with different labels, adjust the nodeSelector to match all of them.
The NVIDIA device plugin must be installed before submitting any GPU workloads. Without it, pods will fail to schedule on GPU nodes.
When a new GPU node joins the cluster, there is a brief window (~60–90 seconds) while the device plugin initializes and registers nvidia.com/gpu resources. Pods scheduled during this window may fail with OutOfnvidia.com/gpu but will recover once registration completes. If the error persists, check that the device plugin DaemonSet is running and healthy on the GPU node.

Deploy a Training Job

With a GPU compute pool configured, deploy a training workload as a Kubernetes Job. The Job requests GPU resources and tolerates the GPU taint — Ryvn handles the rest. Example Helm values for a training job:
podAnnotations:
  karpenter.sh/do-not-disrupt: "true"

resources:
  requests:
    nvidia.com/gpu: 1
    cpu: "4"
    memory: "32Gi"
  limits:
    nvidia.com/gpu: 1

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

nodeSelector:
  workload-type: gpu-training
The karpenter.sh/do-not-disrupt annotation prevents the compute pool from evicting the pod for any reason — including node expiry and drift. This is the primary mechanism for protecting long-running training jobs. The WhenEmpty consolidation policy provides a natural complement: while a training pod is running, the node is not empty and won’t be consolidated. When the job completes and the pod terminates, the node becomes empty and is reclaimed after consolidateAfter. When this job is submitted:
  1. Kubernetes sees the nvidia.com/gpu resource request and the node selector
  2. No matching nodes exist, so the pod stays Pending
  3. The compute pool detects the pending pod and launches a matching GPU instance
  4. The instance joins the cluster, the NVIDIA device plugin registers the GPUs
  5. Kubernetes schedules the pod onto the new node
  6. When the job completes and the node is empty, the compute pool terminates it after consolidateAfter
Node startup time for GPU instances is typically 3-5 minutes, including instance launch, cluster join, and NVIDIA driver initialization.

Scaling to Multi-Node Training

For distributed training across multiple GPU nodes, increase the compute pool limits and deploy your workload with multiple replicas or a framework like PyTorch Distributed.

EFA for High-Performance Networking

P5 and P4d instances include Elastic Fabric Adapter (EFA) network interfaces for high-bandwidth, low-latency inter-node communication. Without EFA, multi-node NCCL traffic falls back to standard TCP networking, which is dramatically slower for distributed training. To enable EFA:
  1. Security group: The compute pool’s security group must allow all traffic between nodes in the same pool. The aws-platform blueprint handles this automatically.
  2. EFA device plugin: Install the AWS EFA Kubernetes device plugin as a Helm chart service, similar to the NVIDIA device plugin:
kind: Service
metadata:
  name: aws-efa-k8s-device-plugin
spec:
  type: helm-chart.v1
  chart:
    repository: https://aws.github.io/eks-charts
    name: aws-efa-k8s-device-plugin
  1. Pod spec: Request EFA interfaces in your training job:
resources:
  requests:
    vpc.amazonaws.com/efa: 32   # p5.48xlarge has 32 EFA interfaces
  limits:
    vpc.amazonaws.com/efa: 32

Scaling the Compute Pool

Increase the limits.cpu to allow Ryvn to provision multiple nodes:
gpu-training:
  nodeClassName: gpu-node-class
  requirements:
    instanceCategories:
      - "p"
    instanceFamilies:
      - "p5"
    arch: "amd64"
    hypervisor: "nitro"
    minGeneration: 4
  labels:
    workload-type: gpu-training
  taints:
    - key: nvidia.com/gpu
      value: "true"
      effect: NoSchedule
  limits:
    cpu: 1536    # Up to 8x p5.48xlarge nodes
  disruption:
    consolidationPolicy: "WhenEmpty"
    consolidateAfter: "30m"
For multi-node training, set consolidateAfter to a longer duration (e.g., 30m) to prevent the compute pool from terminating nodes between distributed training phases.

Cost Optimization

The WhenEmpty consolidation policy automatically terminates GPU nodes when no pods are scheduled. Set consolidateAfter to control how quickly nodes are reclaimed — 5m is a good default for interactive workloads, 30m for batch training with multiple phases.
Use the smallest GPU instance that meets your workload requirements. A single g6.xlarge (1x L4, ~0.80/hr)issufficientforinferenceandlightfinetuning.Reservep5.48xlarge(8xH100, 0.80/hr) is sufficient for inference and light fine-tuning. Reserve `p5.48xlarge` (8x H100, ~98/hr) for large-scale training that genuinely needs it.
P5 and P4d instances include high-throughput NVMe instance stores. Set instanceStorePolicy: "RAID0" in your pool configuration to use these for scratch data instead of provisioning large EBS volumes. Instance store data is lost when the node terminates.
Always set limits.cpu on your GPU compute pool to prevent runaway provisioning. Calculate the limit as vCPUs per instance * max desired nodes.