How It Works
Ryvn’s AWS platform blueprint includes a built-in compute pool engine that handles node autoscaling. By defining a GPU compute pool, you tell Ryvn which GPU instances to provision, how to configure their storage, and when to scale down. When a pod requestsnvidia.com/gpu resources, Ryvn automatically
launches a matching instance. When the node is empty, Ryvn terminates it.
Node Provisioning Lifecycle
GPU nodes are ephemeral — they exist only while workloads need them. Here’s the full lifecycle from job submission to node termination:Prerequisites
Before configuring GPU nodes, ensure you have:- An AWS environment provisioned with the
aws-platformblueprint - The NVIDIA device plugin installed in your cluster (see Install the NVIDIA Device Plugin)
Configure a GPU Compute Pool
A compute pool has two parts: a pool configuration that controls instance-level settings (AMI, storage), and the pool itself that controls scaling behavior (instance types, limits, taints). Theaws-platform
blueprint exposes inputs for both.
Step 1: Define a Pool Configuration
A pool configuration controls the AMI, storage, and instance store policy for your GPU nodes. GPU training workloads typically need large data volumes for model weights, datasets, and checkpoints. Set the Custom Node Classes blueprint input:Bottlerocket includes NVIDIA driver support for GPU instances out of the box.
For workloads that need a specific driver version, you can use a custom AMI instead of the alias selector.
p5.48xlarge), you can use instance store volumes
instead of EBS for higher throughput:
Step 2: Define a GPU Compute Pool
A compute pool controls which instance types Ryvn can launch and how nodes are managed. Set the Custom Node Pools blueprint input:| Setting | Purpose |
|---|---|
instanceCategories: ["p"] | Restricts to GPU-accelerated P-family instances |
instanceFamilies: ["p5"] | Narrows to H100 instances specifically. Use p4d for A100s, g6 for L4s |
taints | Prevents non-GPU workloads from landing on expensive GPU nodes |
limits.cpu: 192 | Caps the pool at one p5.48xlarge (192 vCPUs). Increase for more nodes |
AWS GPU Instance Reference
| Instance Family | GPUs | GPU Type | vCPUs | GPU Memory | Use Case |
|---|---|---|---|---|---|
p5.48xlarge | 8 | H100 | 192 | 640 GB | Large-scale training |
p4d.24xlarge | 8 | A100 | 96 | 320 GB | Training and fine-tuning |
g6.xlarge | 1 | L4 | 4 | 24 GB | Inference, light fine-tuning |
g6.48xlarge | 8 | L4 | 192 | 192 GB | Multi-GPU inference |
g5.xlarge | 1 | A10G | 4 | 24 GB | Inference |
Install the NVIDIA Device Plugin
Kubernetes needs the NVIDIA device plugin to expose GPU resources to the scheduler. Install it as a Helm chart service in Ryvn. Create a service for the NVIDIA device plugin:kube-system namespace with this config:
nodeSelector to match all of them.
When a new GPU node joins the cluster, there is a brief window (~60–90 seconds) while the device
plugin initializes and registers nvidia.com/gpu resources. Pods scheduled during this window may
fail with OutOfnvidia.com/gpu but will recover once registration completes. If the error persists,
check that the device plugin DaemonSet is running and healthy on the GPU node.
Deploy a Training Job
With a GPU compute pool configured, deploy a training workload as a Kubernetes Job. The Job requests GPU resources and tolerates the GPU taint — Ryvn handles the rest. Example Helm values for a training job:karpenter.sh/do-not-disrupt annotation prevents the compute pool from evicting the pod
for any reason — including node expiry and drift. This is the primary mechanism for protecting
long-running training jobs. The WhenEmpty consolidation policy provides a natural complement:
while a training pod is running, the node is not empty and won’t be consolidated. When the job
completes and the pod terminates, the node becomes empty and is reclaimed after consolidateAfter.
When this job is submitted:
- Kubernetes sees the
nvidia.com/gpuresource request and the node selector - No matching nodes exist, so the pod stays Pending
- The compute pool detects the pending pod and launches a matching GPU instance
- The instance joins the cluster, the NVIDIA device plugin registers the GPUs
- Kubernetes schedules the pod onto the new node
- When the job completes and the node is empty, the compute pool terminates it after
consolidateAfter
Node startup time for GPU instances is typically 3-5 minutes, including instance launch,
cluster join, and NVIDIA driver initialization.
Scaling to Multi-Node Training
For distributed training across multiple GPU nodes, increase the compute pool limits and deploy your workload with multiple replicas or a framework like PyTorch Distributed.EFA for High-Performance Networking
P5 and P4d instances include Elastic Fabric Adapter (EFA) network interfaces for high-bandwidth, low-latency inter-node communication. Without EFA, multi-node NCCL traffic falls back to standard TCP networking, which is dramatically slower for distributed training. To enable EFA:-
Security group: The compute pool’s security group must allow all traffic between nodes in the
same pool. The
aws-platformblueprint handles this automatically. - EFA device plugin: Install the AWS EFA Kubernetes device plugin as a Helm chart service, similar to the NVIDIA device plugin:
- Pod spec: Request EFA interfaces in your training job:
Scaling the Compute Pool
Increase thelimits.cpu to allow Ryvn to provision multiple nodes:
For multi-node training, set
consolidateAfter to a longer duration (e.g., 30m) to prevent
the compute pool from terminating nodes between distributed training phases.Cost Optimization
Scale to zero when idle
Scale to zero when idle
The
WhenEmpty consolidation policy automatically terminates GPU nodes when no pods are
scheduled. Set consolidateAfter to control how quickly nodes are reclaimed — 5m is a good
default for interactive workloads, 30m for batch training with multiple phases.Right-size your instances
Right-size your instances
Use the smallest GPU instance that meets your workload requirements. A single
g6.xlarge (1x L4, ~98/hr) for
large-scale training that genuinely needs it.Use instance store for ephemeral data
Use instance store for ephemeral data
P5 and P4d instances include high-throughput NVMe instance stores. Set
instanceStorePolicy: "RAID0"
in your pool configuration to use these for scratch data instead of provisioning large EBS volumes.
Instance store data is lost when the node terminates.Set strict compute pool limits
Set strict compute pool limits
Always set
limits.cpu on your GPU compute pool to prevent runaway provisioning. Calculate the limit
as vCPUs per instance * max desired nodes.