A Kubernetes controller for managing automated upgrades of Talos Linux and Kubernetes.
- π Automated Talos node upgrades with intelligent orchestration
- π― Kubernetes upgrades - upgrade Kubernetes to newer versions
- π Safe upgrade execution - upgrades always run from healthy nodes (never self-upgrade)
- π Built-in health checks - CEL-based expressions for custom cluster validation
- π Configurable reboot modes - default or powercycle options
- π Comprehensive status tracking with real-time progress reporting
- β‘ Resilient job execution with automatic retry and pod replacement
- π Prometheus metrics - detailed monitoring of upgrade progress and health
- Talos cluster with API access configured
- Namespace for the controller (e.g.,
system-upgrade
)
Allow Talos API access to the desired namespace by applying this config to all of you nodes:
machine:
features:
kubernetesTalosAPIAccess:
allowedKubernetesNamespaces:
- system-upgrade # or the namespace the controller will be installed to
allowedRoles:
- os:admin
enabled: true
Install the Helm chart:
# Install via Helm
helm install tuppr oci://ghcr.io/home-operations/charts/tuppr \
--version 0.1.0 \
--namespace system-upgrade
Create a TalosUpgrade
resource:
apiVersion: tuppr.home-operations.com/v1alpha1
kind: TalosUpgrade
metadata:
name: cluster
spec:
talos:
# renovate: datasource=docker depName=ghcr.io/siderolabs/installer
version: v1.11.0 # Required - target Talos version
policy:
debug: true # Optional, verbose logging
force: false # Optional, skip etcd health checks
rebootMode: default # Optional, default|powercycle
placement: soft # Optional, hard|soft
# Custom health checks (optional)
healthChecks:
- apiVersion: v1
kind: Node
expr: status.conditions.exists(c, c.type == "Ready" && c.status == "True")
# Talosctl configuration (optional)
talosctl:
image:
repository: ghcr.io/siderolabs/talosctl # Optional, default
tag: v1.11.0 # Optional, auto-detected
pullPolicy: IfNotPresent # Optional, default
Create a KubernetesUpgrade
resource:
apiVersion: tuppr.home-operations.com/v1alpha1
kind: KubernetesUpgrade
metadata:
name: kubernetes
spec:
kubernetes:
# renovate: datasource=docker depName=ghcr.io/siderolabs/kubelet
version: v1.34.0 # Required - target Kubernetes version
# Custom health checks (optional)
healthChecks:
- apiVersion: v1
kind: Node
expr: status.conditions.exists(c, c.type == "Ready" && c.status == "True")
timeout: 10m
# Talosctl configuration (optional)
talosctl:
image:
repository: ghcr.io/siderolabs/talosctl # Optional, default
tag: v1.11.0 # Optional, auto-detected
pullPolicy: IfNotPresent # Optional, default
Define custom health checks using CEL expressions. These health checks are evaluated before each upgrade and run concurrently.
healthChecks:
# Check all nodes are ready
- apiVersion: v1
kind: Node
expr: |
status.conditions.filter(c, c.type == "Ready").all(c, c.status == "True")
timeout: 10m
# Check specific deployment replicas
- apiVersion: apps/v1
kind: Deployment
name: critical-app
namespace: production
expr: status.readyReplicas == status.replicas
# Check custom resources
- apiVersion: ceph.rook.io/v1
kind: CephCluster
name: rook-ceph
namespace: rook-ceph
expr: status.ceph.health in ["HEALTH_OK"]
Fine-tune upgrade behavior:
policy:
# Enable debug logging for troubleshooting
debug: true
# Force upgrade even if etcd is unhealthy (dangerous!)
force: true
# Controls how strictly upgrade jobs avoid the target node
placement: hard # or "soft"
# Use powercycle reboot for problematic nodes
rebootMode: powercycle # or "default"
Tuppr exposes comprehensive Prometheus metrics for monitoring upgrade progress, health check performance, and job execution:
# Current phase of Talos upgrades (0=Pending, 1=InProgress, 2=Completed, 3=Failed)
tuppr_talos_upgrade_phase{name="cluster", phase="InProgress"} 1
# Node counts for Talos upgrades
tuppr_talos_upgrade_nodes_total{name="cluster"} 5
tuppr_talos_upgrade_nodes_completed{name="cluster"} 3
tuppr_talos_upgrade_nodes_failed{name="cluster"} 0
# Duration of Talos upgrade phases (histogram)
tuppr_talos_upgrade_duration_seconds{name="cluster", phase="InProgress"}
# Current phase of Kubernetes upgrades (0=Pending, 1=InProgress, 2=Completed, 3=Failed)
tuppr_kubernetes_upgrade_phase{name="kubernetes", phase="Completed"} 2
# Duration of Kubernetes upgrade phases (histogram)
tuppr_kubernetes_upgrade_duration_seconds{name="kubernetes", phase="InProgress"}
# Time taken for health checks to pass (histogram)
tuppr_health_check_duration_seconds{upgrade_type="talos", upgrade_name="cluster"}
# Total number of health check failures (counter)
tuppr_health_check_failures_total{upgrade_type="talos", upgrade_name="cluster", check_index="0"}
# Number of active upgrade jobs
tuppr_upgrade_jobs_active{upgrade_type="talos"} 1
# Time taken for upgrade jobs to complete (histogram)
tuppr_upgrade_job_duration_seconds{upgrade_type="talos", node_name="worker-01", result="success"}
# Upgrade phase status
tuppr_talos_upgrade_phase or tuppr_kubernetes_upgrade_phase
# Node upgrade progress for Talos
tuppr_talos_upgrade_nodes_completed / tuppr_talos_upgrade_nodes_total * 100
# Health check success rate
rate(tuppr_health_check_failures_total[5m]) == 0
# Average health check duration
histogram_quantile(0.95, rate(tuppr_health_check_duration_seconds_bucket[5m]))
# Job completion rate
rate(tuppr_upgrade_job_duration_seconds_count{result="success"}[5m])
# Active jobs by type
sum(tuppr_upgrade_jobs_active) by (upgrade_type)
Example Prometheus alerting rules:
groups:
- name: tuppr
rules:
- alert: TupprUpgradeFailed
expr: tuppr_talos_upgrade_phase{phase="Failed"} == 3 or tuppr_kubernetes_upgrade_phase{phase="Failed"} == 3
for: 0m
labels:
severity: critical
annotations:
summary: "Tuppr upgrade failed"
description: "Upgrade {{ $labels.name }} has failed"
- alert: TupprUpgradeStuck
expr: |
(
tuppr_talos_upgrade_phase{phase="InProgress"} == 1 and
increase(tuppr_talos_upgrade_nodes_completed[30m]) == 0
) or (
tuppr_kubernetes_upgrade_phase{phase="InProgress"} == 1
)
for: 30m
labels:
severity: warning
annotations:
summary: "Tuppr upgrade appears stuck"
description: "Upgrade {{ $labels.name }} has been in progress for 30+ minutes without completing nodes"
- alert: TupprHealthCheckFailures
expr: rate(tuppr_health_check_failures_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High health check failure rate"
description: "Health checks for {{ $labels.upgrade_name }} are failing frequently"
# Watch Talos upgrade progress
kubectl get talosupgrade -w
# Watch Kubernetes upgrade progress
kubectl get kubernetesupgrade -w
# Check detailed status
kubectl describe talosupgrade cluster-upgrade
kubectl describe kubernetesupgrade kubernetes
# View upgrade logs
kubectl logs -f deployment/tuppr -n system-upgrade
# Check metrics endpoint
kubectl port-forward -n system-upgrade deployment/tuppr 8080:8080
curl http://localhost:8080/metrics | grep tuppr_
Suspending upgrades can be useful if you want to upgrade manually and not have the controller interfere.
# Suspend Talos upgrade
kubectl annotate talosupgrade cluster-upgrade tuppr.home-operations.com/suspend="true"
# Suspend Kubernetes upgrade
kubectl annotate kubernetesupgrade kubernetes tuppr.home-operations.com/suspend="true"
# Remove the suspend annotation to resume
kubectl annotate talosupgrade cluster-upgrade tuppr.home-operations.com/suspend-
kubectl annotate kubernetesupgrade kubernetes tuppr.home-operations.com/suspend-
# Reset failed Talos upgrade
kubectl annotate talosupgrade cluster-upgrade tuppr.home-operations.com/reset="$(date)"
# Reset failed Kubernetes upgrade
kubectl annotate kubernetesupgrade kubernetes tuppr.home-operations.com/reset="$(date)"
# Check job logs
kubectl logs job/tuppr-xyz -n system-upgrade
# Check controller health
kubectl get pods -n system-upgrade -l app.kubernetes.io/name=tuppr
# View metrics for debugging
kubectl port-forward -n system-upgrade deployment/tuppr 8080:8080
curl http://localhost:8080/metrics | grep -E "(tuppr_.*_phase|tuppr_.*_duration)"
# Pause all upgrades (scale down controller)
kubectl scale deployment tuppr --replicas=0 -n system-upgrade
# Emergency cleanup
kubectl delete talosupgrade --all
kubectl delete kubernetesupgrade --all
kubectl delete jobs -l app.kubernetes.io/name=talos-upgrade -n system-upgrade
kubectl delete jobs -l app.kubernetes.io/name=kubernetes-upgrade -n system-upgrade
# Resume operations
kubectl scale deployment tuppr --replicas=1 -n system-upgrade
Feature | TalosUpgrade | KubernetesUpgrade |
---|---|---|
Scope | Talos nodes | Kubernetes cluster |
Multiple CRs | β Only one per cluster | β Only one per cluster |
Execution | Sequential node-by-node | Single controller node |
Reboot Required | β Yes | β No |
Health Checks | β Before each node | β Before upgrade |
Concurrent Execution | β Blocked by other upgrades | β Blocked by other upgrades |
Handling Failures | β Manual | β Manual |
Metrics | β Comprehensive | β Comprehensive |
-
TalosUpgrade: Only one
TalosUpgrade
resource is allowed per cluster. This constraint simplifies the upgrade orchestration by processing all nodes sequentially in a single upgrade workflow, eliminating the need for complex queueing and coordination between multiple upgrade resources. The admission webhook will reject attempts to create additionalTalosUpgrade
resources. -
KubernetesUpgrade: Only one
KubernetesUpgrade
resource is allowed per cluster. This constraint exists because Kubernetes upgrades affect the entire cluster, and multiple concurrent upgrades would conflict with each other. The admission webhook will reject attempts to create additionalKubernetesUpgrade
resources. -
Cross-Upgrade Coordination: TalosUpgrade and KubernetesUpgrade resources cannot run concurrently. If one upgrade is in progress (status.phase == "InProgress"), the other will wait in a "Pending" state until the active upgrade completes. This prevents conflicts between Talos node changes and Kubernetes cluster changes that could destabilize the cluster.
# β
Valid: Single TalosUpgrade resource
apiVersion: tuppr.home-operations.com/v1alpha1
kind: TalosUpgrade
metadata:
name: talos
spec:
talos:
version: v1.11.0
---
# β Invalid: Second TalosUpgrade will be rejected by webhook
apiVersion: tuppr.home-operations.com/v1alpha1
kind: TalosUpgrade
metadata:
name: another-talos # This will fail validation
spec:
talos:
version: v1.11.1
---
# β
Valid: Single KubernetesUpgrade resource
apiVersion: tuppr.home-operations.com/v1alpha1
kind: KubernetesUpgrade
metadata:
name: kubernetes
spec:
kubernetes:
version: v1.34.0
---
# β Invalid: Second KubernetesUpgrade will be rejected by webhook
apiVersion: tuppr.home-operations.com/v1alpha1
kind: KubernetesUpgrade
metadata:
name: another-kubernetes # This will fail validation
spec:
kubernetes:
version: v1.35.0
Scenario 1: TalosUpgrade starts first
kubectl apply -f talos-upgrade.yaml
# β
TalosUpgrade starts immediately (phase: InProgress)
kubectl apply -f kubernetes-upgrade.yaml
# β³ KubernetesUpgrade waits (phase: Pending)
# message: "Waiting for Talos upgrade 'talos' to complete before starting Kubernetes upgrade"
# After TalosUpgrade completes (phase: Completed)
# β
KubernetesUpgrade starts automatically (phase: InProgress)
Scenario 2: KubernetesUpgrade starts first
kubectl apply -f kubernetes-upgrade.yaml
# β
KubernetesUpgrade starts immediately (phase: InProgress)
kubectl apply -f talos-upgrade.yaml
# β³ TalosUpgrade waits (phase: Pending)
# message: "Waiting for Kubernetes upgrade 'kubernetes' to complete before starting Talos upgrade"
# After KubernetesUpgrade completes (phase: Completed)
# β
TalosUpgrade starts automatically (phase: InProgress)
Scenario 3: Only one upgrade type needed
# If you only need Talos upgrades
kubectl apply -f talos-upgrade.yaml
# β
Starts immediately - no blocking
# If you only need Kubernetes upgrades
kubectl apply -f kubernetes-upgrade.yaml
# β
Starts immediately - no blocking
We welcome contributions! Please see our Contributing Guide for details.
This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details.
- Talos Linux - The modern OS for Kubernetes that inspired this project
- System Upgrade Controller - Inspiration for upgrade orchestration patterns
- Kubebuilder - Excellent framework for building Kubernetes controllers
- Controller Runtime - Powerful runtime for Kubernetes controllers
- CEL - Common Expression Language for flexible health checks
- Prometheus - Monitoring and alerting toolkit for metrics collection
β If this project helps you, please consider giving it a star!
For questions, issues, or feature requests, please visit our GitHub Issues.