GPU Workloads

Run GPU workloads on your Ryvn-managed EKS clusters using compute pools to automatically provision and deprovision GPU nodes. Nodes spin up when a workload needs them and scale back to zero when idle, so you only pay for GPU compute while jobs are running.

How it works

Ryvn’s AWS platform blueprint includes a built-in compute pool engine that handles node autoscaling. By defining a GPU compute pool, you tell Ryvn which GPU instances to provision, how to configure their storage, and when to scale down. When a pod requests nvidia.com/gpu resources, Ryvn automatically launches a matching instance. When the node is empty, Ryvn terminates it.

Node provisioning lifecycle

GPU nodes are ephemeral — they exist only while workloads need them. Here’s the full lifecycle from job submission to node termination:

Prerequisites

Before configuring GPU nodes, ensure you have:

An AWS environment provisioned with the aws-platform blueprint
The NVIDIA device plugin installed in your cluster (see Install the NVIDIA Device Plugin)

Configure a GPU compute pool

A compute pool has two parts: a pool configuration that controls instance-level settings (AMI, storage), and the pool itself that controls scaling behavior (instance types, limits, taints). The aws-platform blueprint exposes inputs for both.

Step 1: define a pool configuration

A pool configuration controls the AMI, storage, and instance store policy for your GPU nodes. GPU training workloads typically need large data volumes for model weights, datasets, and checkpoints. Set the Custom Node Classes blueprint input:

gpu-node-class:
  amiSelectorTerms:
    - alias: bottlerocket@v1.52.0
  blockDeviceMappings:
    # Root volume
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 32Gi
        volumeType: gp3
        encrypted: true
    # Data volume — size this for your dataset + model weights + checkpoints
    - deviceName: /dev/xvdb
      ebs:
        volumeSize: 2000Gi
        volumeType: gp3
        encrypted: true

Bottlerocket includes NVIDIA driver support for GPU instances out of the box. For workloads that need a specific driver version, you can use a custom AMI instead of the alias selector.

For instances with NVMe instance stores (like p5.48xlarge), you can use instance store volumes instead of EBS for higher throughput:

gpu-node-class:
  amiSelectorTerms:
    - alias: bottlerocket@v1.52.0
  instanceStorePolicy: "RAID0"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 32Gi
        volumeType: gp3
        encrypted: true

Step 2: define a GPU compute pool

A compute pool controls which instance types Ryvn can launch and how nodes are managed. Set the Custom Node Pools blueprint input:

gpu-training:
  nodeClassName: gpu-node-class
  requirements:
    instanceCategories:
      - "p"
    instanceFamilies:
      - "p5"
    arch: "amd64"
    hypervisor: "nitro"
    minGeneration: 4
  labels:
    workload-type: gpu-training
  taints:
    - key: nvidia.com/gpu
      value: "true"
      effect: NoSchedule
  limits:
    cpu: 192
  disruption:
    consolidationPolicy: "WhenEmpty"
    consolidateAfter: "5m"

Key settings explained:

Setting	Purpose
`instanceCategories: ["p"]`	Restricts to GPU-accelerated P-family instances
`instanceFamilies: ["p5"]`	Narrows to H100 instances specifically. Use `p4d` for A100s, `g6` for L4s
`taints`	Prevents non-GPU workloads from landing on expensive GPU nodes
`limits.cpu: 192`	Caps the pool at one `p5.48xlarge` (192 vCPUs). Increase for more nodes

GPU instances are expensive. Set limits.cpu conservatively to prevent accidental over-provisioning.

AWS GPU instance reference

Instance Family	GPUs	GPU Type	vCPUs	GPU Memory	Use Case
`p5.48xlarge`	8	H100	192	640 GB	Large-scale training
`p4d.24xlarge`	8	A100	96	320 GB	Training and fine-tuning
`g6.xlarge`	1	L4	4	24 GB	Inference, light fine-tuning
`g6.48xlarge`	8	L4	192	192 GB	Multi-GPU inference
`g5.xlarge`	1	A10G	4	24 GB	Inference

Install the NVIDIA device plugin

Kubernetes needs the NVIDIA device plugin to expose GPU resources to the scheduler. Install it as a Helm chart service in Ryvn. Create a service for the NVIDIA device plugin:

kind: Service
metadata:
  name: nvidia-device-plugin
spec:
  type: helm-chart.v1
  chart:
    repository: https://nvidia.github.io/k8s-device-plugin
    name: nvidia-device-plugin
    version: "0.19.0"

Then install it into your environment’s kube-system namespace with this config:

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
nodeSelector:
  workload-type: gpu-training

This ensures the plugin only runs on GPU nodes and tolerates the GPU taint. If you have multiple GPU compute pools with different labels, adjust the nodeSelector to match all of them.

The NVIDIA device plugin must be installed before submitting any GPU workloads. Without it, pods will fail to schedule on GPU nodes.

When a new GPU node joins the cluster, there is a brief window (~60–90 seconds) while the device plugin initializes and registers nvidia.com/gpu resources. Pods scheduled during this window may fail with OutOfnvidia.com/gpu but will recover once registration completes. If the error persists, check that the device plugin DaemonSet is running and healthy on the GPU node.

Deploy a training job

With a GPU compute pool configured, deploy a training workload as a Kubernetes Job. The Job requests GPU resources and tolerates the GPU taint — Ryvn handles the rest. Example Helm values for a training job:

podAnnotations:
  karpenter.sh/do-not-disrupt: "true"

resources:
  requests:
    nvidia.com/gpu: 1
    cpu: "4"
    memory: "32Gi"
  limits:
    nvidia.com/gpu: 1

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

nodeSelector:
  workload-type: gpu-training

The karpenter.sh/do-not-disrupt annotation prevents the compute pool from evicting the pod for any reason — including node expiry and drift. This is the primary mechanism for protecting long-running training jobs. The WhenEmpty consolidation policy provides a natural complement: while a training pod is running, the node is not empty and won’t be consolidated. When the job completes and the pod terminates, the node becomes empty and is reclaimed after consolidateAfter. When this job is submitted:

Kubernetes sees the nvidia.com/gpu resource request and the node selector
No matching nodes exist, so the pod stays Pending
The compute pool detects the pending pod and launches a matching GPU instance
The instance joins the cluster, the NVIDIA device plugin registers the GPUs
Kubernetes schedules the pod onto the new node
When the job completes and the node is empty, the compute pool terminates it after consolidateAfter

Node startup time for GPU instances is typically 3-5 minutes, including instance launch, cluster join, and NVIDIA driver initialization.

Scaling to multi-node training

For distributed training across multiple GPU nodes, increase the compute pool limits and deploy your workload with multiple replicas or a framework like PyTorch Distributed.

EFA for high-performance networking

P5 and P4d instances include Elastic Fabric Adapter (EFA) network interfaces for high-bandwidth, low-latency inter-node communication. Without EFA, multi-node NCCL traffic falls back to standard TCP networking, which is dramatically slower for distributed training. To enable EFA:

Security group: The compute pool’s security group must allow all traffic between nodes in the same pool. The aws-platform blueprint handles this automatically.
EFA device plugin: Install the AWS EFA Kubernetes device plugin as a Helm chart service, similar to the NVIDIA device plugin:

kind: Service
metadata:
  name: aws-efa-k8s-device-plugin
spec:
  type: helm-chart.v1
  chart:
    repository: https://aws.github.io/eks-charts
    name: aws-efa-k8s-device-plugin

Pod spec: Request EFA interfaces in your training job:

resources:
  requests:
    vpc.amazonaws.com/efa: 32   # p5.48xlarge has 32 EFA interfaces
  limits:
    vpc.amazonaws.com/efa: 32

Scaling the compute pool

Increase the limits.cpu to allow Ryvn to provision multiple nodes:

gpu-training:
  nodeClassName: gpu-node-class
  requirements:
    instanceCategories:
      - "p"
    instanceFamilies:
      - "p5"
    arch: "amd64"
    hypervisor: "nitro"
    minGeneration: 4
  labels:
    workload-type: gpu-training
  taints:
    - key: nvidia.com/gpu
      value: "true"
      effect: NoSchedule
  limits:
    cpu: 1536    # Up to 8x p5.48xlarge nodes
  disruption:
    consolidationPolicy: "WhenEmpty"
    consolidateAfter: "30m"

For multi-node training, set consolidateAfter to a longer duration (e.g., 30m) to prevent the compute pool from terminating nodes between distributed training phases.

Cost optimization

Scale to zero when idle

The WhenEmpty consolidation policy automatically terminates GPU nodes when no pods are scheduled. Set consolidateAfter to control how quickly nodes are reclaimed — 5m is a good default for interactive workloads, 30m for batch training with multiple phases.

Right-size your instances

Use the smallest GPU instance that meets your workload requirements. A single g6.xlarge (1x L4, ~

0.80/hr) is sufficient for inference and light fine-tuning. Reserve `p5.48xlarge` (8x H100, ~

98/hr) for large-scale training that genuinely needs it.

Use instance store for ephemeral data

P5 and P4d instances include high-throughput NVMe instance stores. Set instanceStorePolicy: "RAID0" in your pool configuration to use these for scratch data instead of provisioning large EBS volumes. Instance store data is lost when the node terminates.

Set strict compute pool limits

Always set limits.cpu on your GPU compute pool to prevent runaway provisioning. Calculate the limit as vCPUs per instance * max desired nodes.

Workflows

Constraints

Compute

Reference

GPU Workloads

How it works

Node provisioning lifecycle

Prerequisites

Configure a GPU compute pool

Step 1: define a pool configuration

Step 2: define a GPU compute pool

AWS GPU instance reference

Install the NVIDIA device plugin

Deploy a training job

Scaling to multi-node training

EFA for high-performance networking

Scaling the compute pool

Cost optimization

Workflows

Constraints

Compute

Reference

Documentation Index

​How it works

​Node provisioning lifecycle

​Prerequisites

​Configure a GPU compute pool

​Step 1: define a pool configuration

​Step 2: define a GPU compute pool

​AWS GPU instance reference

​Install the NVIDIA device plugin

​Deploy a training job

​Scaling to multi-node training

​EFA for high-performance networking

​Scaling the compute pool

​Cost optimization

How it works

Node provisioning lifecycle

Prerequisites

Configure a GPU compute pool

Step 1: define a pool configuration

Step 2: define a GPU compute pool

AWS GPU instance reference

Install the NVIDIA device plugin

Deploy a training job

Scaling to multi-node training

EFA for high-performance networking

Scaling the compute pool

Cost optimization