Still heavily WIP!
Kubernetes operator for ephemeral GPU provisioning across cloud providers using Tailscale mesh networking.
Provisions GPU instances on-demand and integrates them into Talos Kubernetes clusters.
- Multi-cloud support: Google Cloud Platform, Vultr
- Pricing optimised GPU instance selection
- Instance lifecycle management
- Talos Kubernetes cluster
- Mesh networking (e.g., tailscale) or expose control plane publicly or use Omni
- Cloud provider credentials
# Install the TGP operator
helm install tgp-operator oci://ghcr.io/solanyn/charts/tgp-operator \
--namespace tgp-system \
--create-namespaceWe provide two resource types:
GPUNodeClass- Cluster-scoped infrastructure templatesGPUNodePool- Namespaced provisioning requests
# Create provider credentials secret
kubectl create secret generic tgp-operator-secret \
--from-literal=GOOGLE_APPLICATION_CREDENTIALS_JSON='{"type":"service_account","project_id":"your-project",...}' \
--from-literal=VULTR_API_KEY=your-vultr-api-key \
--from-literal=client-id=your-tailscale-oauth-client-id \
--from-literal=client-secret=your-tailscale-oauth-client-secret \
-n tgp-systemRequired credentials:
- Google Cloud service account JSON with IAM permissions
- Vultr API key from account API section
- Tailscale OAuth credentials from admin console
Prepare Talos Linux images in your GCP project.
Option 1: Manual Upload
# Download Talos GCP image
wget https://github.com/siderolabs/talos/releases/download/v1.10.5/gcp-amd64.tar.gz
# Create a temporary GCS bucket (if you don't have one)
gsutil mb gs://YOUR-BUCKET-NAME
# Upload to GCS bucket
gsutil cp gcp-amd64.tar.gz gs://YOUR-BUCKET-NAME/talos-v1.10.5.tar.gz
# Create compute image from bucket
gcloud compute images create talos-linux-latest \
--source-uri gs://YOUR-BUCKET-NAME/talos-v1.10.5.tar.gz \
--family talos-linux \
--description "Talos Linux v1.10.5 for GPU workloads"
# Clean up temporary files
rm gcp-amd64.tar.gz
gsutil rm gs://YOUR-BUCKET-NAME/talos-v1.10.5.tar.gzOption 2: Custom Image Reference
Specify image URL in GPUNodeClass:
spec:
talosConfig:
image: "projects/MY-PROJECT/global/images/my-custom-talos-image"Required GCP IAM roles:
Compute Instance Admin (v1)Compute Image UserService Account User
- Get API key from Vultr Control Panel → Account → API
- Talos Linux available via marketplace (OS ID: 2284)
- GPU types: H100, L40S, A100, A40, A16, MI325X, MI300X
Required permissions:
- Instance management
- Plan access
- Region access
GPUNodeClass requires Talos machine configuration template with variables:
{{.MachineToken}}{{.ClusterCA}}{{.ClusterID}}{{.ClusterSecret}}{{.ControlPlaneEndpoint}}{{.ClusterName}}{{.TailscaleAuthKey}}{{.NodeName}}- Generated node name{{.NodePool}}- NodePool name{{.NodeIndex}}- Node index in pool
apiVersion: tgp.io/v1
kind: GPUNodeClass
metadata:
name: standard-gpu-class
spec:
providers:
- name: gcp
priority: 1
enabled: true
credentialsRef:
name: tgp-operator-secret
key: GOOGLE_APPLICATION_CREDENTIALS_JSON
talosConfig:
image: "ghcr.io/siderolabs/talos:v1.10.5"
machineConfigTemplate: |
version: v1alpha1
debug: false
persist: true
machine:
token: {{.MachineToken}}
ca:
crt: {{.ClusterCA}}
certSANs:
- 127.0.0.1
kubelet:
extraMounts:
- destination: /var/mnt/extra
type: bind
source: /var/mnt/extra
options: [bind, rshared, rw]
files:
- path: /etc/tailscale/authkey
permissions: 0o600
op: create
content: {{.TailscaleAuthKey}}
- path: /etc/systemd/system/tailscaled.service
op: create
content: |
[Unit]
Description=Tailscale VPN
After=network.target
[Service]
Type=notify
ExecStart=/usr/bin/tailscaled --state=/var/lib/tailscale/tailscaled.state
ExecStartPost=/usr/bin/tailscale up --authkey-file=/etc/tailscale/authkey --hostname={{.NodeName}}
Restart=always
[Install]
WantedBy=multi-user.target
systemd:
services:
- name: tailscaled.service
enabled: true
nodeLabels:
tgp.io/nodepool: {{.NodePool}}
tgp.io/provisioned: "true"
node.kubernetes.io/instance-type: "gpu"
cluster:
id: {{.ClusterID}}
secret: {{.ClusterSecret}}
controlPlane:
endpoint: {{.ControlPlaneEndpoint}}
clusterName: {{.ClusterName}}
tailscaleConfig:
tags: ["tag:k8s", "tag:gpu"]
ephemeral: true
acceptRoutes: true
instanceRequirements:
gpuTypes: ["RTX4090", "RTX3090"]
spotAllowed: true
limits:
maxNodes: 10
maxHourlyCost: "50.0"apiVersion: tgp.io/v1
kind: GPUNodePool
metadata:
name: ml-workload-pool
namespace: default
spec:
nodeClassRef:
kind: GPUNodeClass
name: standard-gpu-class
template:
spec:
requirements:
- key: "tgp.io/gpu-type"
operator: In
values: ["RTX4090"]
- key: "tgp.io/region"
operator: In
values: ["us-west", "us-east"]
taints:
- key: "gpu-node"
value: "true"
effect: NoSchedule
maxHourlyPrice: "2.0"
weight: 10# Check node classes and pools
kubectl get gpunodeclass
kubectl get gpunodepool -A
# Check specific resources
kubectl describe gpunodeclass standard-gpu-class
kubectl describe gpunodepool ml-workload-pool -n default
# Monitor operator logs
kubectl logs -n tgp-system deployment/tgp-operator-controller-manager -fThis operator exposes models CRDs inspired by Karpenter:
-
GPUNodeClass- Infrastructure configuration:- Provider credentials and settings
- Talos OS and Tailscale configuration
- Instance requirements and cost limits
- Security policies
-
GPUNodePool- Provisioning requests:- References
GPUNodeClass - Node requirements and constraints
- Lifecycle and disruption policies
- References
See DEVELOPMENT.md for development setup, testing and contribution guidelines.
MIT License - see LICENSE file for details.