Skip to content

home-operations/tuppr

Repository files navigation

tuppr - Talos Linux Upgrade Controller

A Kubernetes controller for managing automated upgrades of Talos Linux and Kubernetes.

✨ Features

Core Capabilities

  • πŸš€ Automated Talos node upgrades with intelligent orchestration
  • 🎯 Kubernetes upgrades - upgrade Kubernetes to newer versions
  • πŸ”’ Safe upgrade execution - upgrades always run from healthy nodes (never self-upgrade)
  • πŸ“Š Built-in health checks - CEL-based expressions for custom cluster validation
  • πŸ”„ Configurable reboot modes - default or powercycle options
  • πŸ“‹ Comprehensive status tracking with real-time progress reporting
  • ⚑ Resilient job execution with automatic retry and pod replacement
  • πŸ“ˆ Prometheus metrics - detailed monitoring of upgrade progress and health

πŸš€ Quick Start

Prerequisites

  1. Talos cluster with API access configured
  2. Namespace for the controller (e.g., system-upgrade)

Installation

Allow Talos API access to the desired namespace by applying this config to all of you nodes:

machine:
  features:
    kubernetesTalosAPIAccess:
      allowedKubernetesNamespaces:
        - system-upgrade # or the namespace the controller will be installed to
      allowedRoles:
        - os:admin
      enabled: true

Install the Helm chart:

# Install via Helm
helm install tuppr oci://ghcr.io/home-operations/charts/tuppr \
  --version 0.1.0 \
  --namespace system-upgrade

Basic Usage

Talos Node Upgrades

Create a TalosUpgrade resource:

apiVersion: tuppr.home-operations.com/v1alpha1
kind: TalosUpgrade
metadata:
  name: cluster
spec:
  talos:
    # renovate: datasource=docker depName=ghcr.io/siderolabs/installer
    version: v1.11.0  # Required - target Talos version

  policy:
    debug: true          # Optional, verbose logging
    force: false         # Optional, skip etcd health checks
    rebootMode: default  # Optional, default|powercycle
    placement: soft      # Optional, hard|soft

  # Custom health checks (optional)
  healthChecks:
    - apiVersion: v1
      kind: Node
      expr: status.conditions.exists(c, c.type == "Ready" && c.status == "True")

  # Talosctl configuration (optional)
  talosctl:
    image:
      repository: ghcr.io/siderolabs/talosctl  # Optional, default
      tag: v1.11.0                             # Optional, auto-detected
      pullPolicy: IfNotPresent                 # Optional, default

Kubernetes Upgrades

Create a KubernetesUpgrade resource:

apiVersion: tuppr.home-operations.com/v1alpha1
kind: KubernetesUpgrade
metadata:
  name: kubernetes
spec:
  kubernetes:
    # renovate: datasource=docker depName=ghcr.io/siderolabs/kubelet
    version: v1.34.0  # Required - target Kubernetes version

  # Custom health checks (optional)
  healthChecks:
    - apiVersion: v1
      kind: Node
      expr: status.conditions.exists(c, c.type == "Ready" && c.status == "True")
      timeout: 10m

  # Talosctl configuration (optional)
  talosctl:
    image:
      repository: ghcr.io/siderolabs/talosctl  # Optional, default
      tag: v1.11.0                             # Optional, auto-detected
      pullPolicy: IfNotPresent                 # Optional, default

🎯 Advanced Configuration

Health Checks

Define custom health checks using CEL expressions. These health checks are evaluated before each upgrade and run concurrently.

healthChecks:
  # Check all nodes are ready
  - apiVersion: v1
    kind: Node
    expr: |
      status.conditions.filter(c, c.type == "Ready").all(c, c.status == "True")
    timeout: 10m

  # Check specific deployment replicas
  - apiVersion: apps/v1
    kind: Deployment
    name: critical-app
    namespace: production
    expr: status.readyReplicas == status.replicas

  # Check custom resources
  - apiVersion: ceph.rook.io/v1
    kind: CephCluster
    name: rook-ceph
    namespace: rook-ceph
    expr: status.ceph.health in ["HEALTH_OK"]

Upgrade Policies (TalosUpgrade only)

Fine-tune upgrade behavior:

policy:
  # Enable debug logging for troubleshooting
  debug: true

  # Force upgrade even if etcd is unhealthy (dangerous!)
  force: true

  # Controls how strictly upgrade jobs avoid the target node
  placement: hard  # or "soft"

  # Use powercycle reboot for problematic nodes
  rebootMode: powercycle  # or "default"

πŸ“Š Monitoring & Metrics

Prometheus Metrics

Tuppr exposes comprehensive Prometheus metrics for monitoring upgrade progress, health check performance, and job execution:

Talos Upgrade Metrics

# Current phase of Talos upgrades (0=Pending, 1=InProgress, 2=Completed, 3=Failed)
tuppr_talos_upgrade_phase{name="cluster", phase="InProgress"} 1

# Node counts for Talos upgrades
tuppr_talos_upgrade_nodes_total{name="cluster"} 5
tuppr_talos_upgrade_nodes_completed{name="cluster"} 3
tuppr_talos_upgrade_nodes_failed{name="cluster"} 0

# Duration of Talos upgrade phases (histogram)
tuppr_talos_upgrade_duration_seconds{name="cluster", phase="InProgress"}

Kubernetes Upgrade Metrics

# Current phase of Kubernetes upgrades (0=Pending, 1=InProgress, 2=Completed, 3=Failed)
tuppr_kubernetes_upgrade_phase{name="kubernetes", phase="Completed"} 2

# Duration of Kubernetes upgrade phases (histogram)
tuppr_kubernetes_upgrade_duration_seconds{name="kubernetes", phase="InProgress"}

Health Check Metrics

# Time taken for health checks to pass (histogram)
tuppr_health_check_duration_seconds{upgrade_type="talos", upgrade_name="cluster"}

# Total number of health check failures (counter)
tuppr_health_check_failures_total{upgrade_type="talos", upgrade_name="cluster", check_index="0"}

Job Execution Metrics

# Number of active upgrade jobs
tuppr_upgrade_jobs_active{upgrade_type="talos"} 1

# Time taken for upgrade jobs to complete (histogram)
tuppr_upgrade_job_duration_seconds{upgrade_type="talos", node_name="worker-01", result="success"}

Grafana Dashboard Examples

Upgrade Progress Panel

# Upgrade phase status
tuppr_talos_upgrade_phase or tuppr_kubernetes_upgrade_phase

# Node upgrade progress for Talos
tuppr_talos_upgrade_nodes_completed / tuppr_talos_upgrade_nodes_total * 100

# Health check success rate
rate(tuppr_health_check_failures_total[5m]) == 0

Performance Monitoring

# Average health check duration
histogram_quantile(0.95, rate(tuppr_health_check_duration_seconds_bucket[5m]))

# Job completion rate
rate(tuppr_upgrade_job_duration_seconds_count{result="success"}[5m])

# Active jobs by type
sum(tuppr_upgrade_jobs_active) by (upgrade_type)

Alerting Rules

Example Prometheus alerting rules:

groups:
  - name: tuppr
    rules:
      - alert: TupprUpgradeFailed
        expr: tuppr_talos_upgrade_phase{phase="Failed"} == 3 or tuppr_kubernetes_upgrade_phase{phase="Failed"} == 3
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Tuppr upgrade failed"
          description: "Upgrade {{ $labels.name }} has failed"

      - alert: TupprUpgradeStuck
        expr: |
          (
            tuppr_talos_upgrade_phase{phase="InProgress"} == 1 and
            increase(tuppr_talos_upgrade_nodes_completed[30m]) == 0
          ) or (
            tuppr_kubernetes_upgrade_phase{phase="InProgress"} == 1
          )
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Tuppr upgrade appears stuck"
          description: "Upgrade {{ $labels.name }} has been in progress for 30+ minutes without completing nodes"

      - alert: TupprHealthCheckFailures
        expr: rate(tuppr_health_check_failures_total[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High health check failure rate"
          description: "Health checks for {{ $labels.upgrade_name }} are failing frequently"

πŸ”§ Operations

Monitoring Upgrades

# Watch Talos upgrade progress
kubectl get talosupgrade -w

# Watch Kubernetes upgrade progress
kubectl get kubernetesupgrade -w

# Check detailed status
kubectl describe talosupgrade cluster-upgrade
kubectl describe kubernetesupgrade kubernetes

# View upgrade logs
kubectl logs -f deployment/tuppr -n system-upgrade

# Check metrics endpoint
kubectl port-forward -n system-upgrade deployment/tuppr 8080:8080
curl http://localhost:8080/metrics | grep tuppr_

Suspending Upgrades

Suspending upgrades can be useful if you want to upgrade manually and not have the controller interfere.

# Suspend Talos upgrade
kubectl annotate talosupgrade cluster-upgrade tuppr.home-operations.com/suspend="true"

# Suspend Kubernetes upgrade
kubectl annotate kubernetesupgrade kubernetes tuppr.home-operations.com/suspend="true"

# Remove the suspend annotation to resume
kubectl annotate talosupgrade cluster-upgrade tuppr.home-operations.com/suspend-
kubectl annotate kubernetesupgrade kubernetes tuppr.home-operations.com/suspend-

Troubleshooting

# Reset failed Talos upgrade
kubectl annotate talosupgrade cluster-upgrade tuppr.home-operations.com/reset="$(date)"

# Reset failed Kubernetes upgrade
kubectl annotate kubernetesupgrade kubernetes tuppr.home-operations.com/reset="$(date)"

# Check job logs
kubectl logs job/tuppr-xyz -n system-upgrade

# Check controller health
kubectl get pods -n system-upgrade -l app.kubernetes.io/name=tuppr

# View metrics for debugging
kubectl port-forward -n system-upgrade deployment/tuppr 8080:8080
curl http://localhost:8080/metrics | grep -E "(tuppr_.*_phase|tuppr_.*_duration)"

Emergency Procedures

# Pause all upgrades (scale down controller)
kubectl scale deployment tuppr --replicas=0 -n system-upgrade

# Emergency cleanup
kubectl delete talosupgrade --all
kubectl delete kubernetesupgrade --all
kubectl delete jobs -l app.kubernetes.io/name=talos-upgrade -n system-upgrade
kubectl delete jobs -l app.kubernetes.io/name=kubernetes-upgrade -n system-upgrade

# Resume operations
kubectl scale deployment tuppr --replicas=1 -n system-upgrade

πŸ“‹ Upgrade Comparison

Feature TalosUpgrade KubernetesUpgrade
Scope Talos nodes Kubernetes cluster
Multiple CRs ❌ Only one per cluster ❌ Only one per cluster
Execution Sequential node-by-node Single controller node
Reboot Required βœ… Yes ❌ No
Health Checks βœ… Before each node βœ… Before upgrade
Concurrent Execution ❌ Blocked by other upgrades ❌ Blocked by other upgrades
Handling Failures ❌ Manual ❌ Manual
Metrics βœ… Comprehensive βœ… Comprehensive

Important Resource Constraints

  • TalosUpgrade: Only one TalosUpgrade resource is allowed per cluster. This constraint simplifies the upgrade orchestration by processing all nodes sequentially in a single upgrade workflow, eliminating the need for complex queueing and coordination between multiple upgrade resources. The admission webhook will reject attempts to create additional TalosUpgrade resources.

  • KubernetesUpgrade: Only one KubernetesUpgrade resource is allowed per cluster. This constraint exists because Kubernetes upgrades affect the entire cluster, and multiple concurrent upgrades would conflict with each other. The admission webhook will reject attempts to create additional KubernetesUpgrade resources.

  • Cross-Upgrade Coordination: TalosUpgrade and KubernetesUpgrade resources cannot run concurrently. If one upgrade is in progress (status.phase == "InProgress"), the other will wait in a "Pending" state until the active upgrade completes. This prevents conflicts between Talos node changes and Kubernetes cluster changes that could destabilize the cluster.

Upgrade Coordination Examples

# βœ… Valid: Single TalosUpgrade resource
apiVersion: tuppr.home-operations.com/v1alpha1
kind: TalosUpgrade
metadata:
  name: talos
spec:
  talos:
    version: v1.11.0
---
# ❌ Invalid: Second TalosUpgrade will be rejected by webhook
apiVersion: tuppr.home-operations.com/v1alpha1
kind: TalosUpgrade
metadata:
  name: another-talos  # This will fail validation
spec:
  talos:
    version: v1.11.1
---
# βœ… Valid: Single KubernetesUpgrade resource
apiVersion: tuppr.home-operations.com/v1alpha1
kind: KubernetesUpgrade
metadata:
  name: kubernetes
spec:
  kubernetes:
    version: v1.34.0
---
# ❌ Invalid: Second KubernetesUpgrade will be rejected by webhook
apiVersion: tuppr.home-operations.com/v1alpha1
kind: KubernetesUpgrade
metadata:
  name: another-kubernetes  # This will fail validation
spec:
  kubernetes:
    version: v1.35.0

Cross-Upgrade Coordination Behavior

Scenario 1: TalosUpgrade starts first

kubectl apply -f talos-upgrade.yaml
# βœ… TalosUpgrade starts immediately (phase: InProgress)

kubectl apply -f kubernetes-upgrade.yaml
# ⏳ KubernetesUpgrade waits (phase: Pending)
#    message: "Waiting for Talos upgrade 'talos' to complete before starting Kubernetes upgrade"

# After TalosUpgrade completes (phase: Completed)
# βœ… KubernetesUpgrade starts automatically (phase: InProgress)

Scenario 2: KubernetesUpgrade starts first

kubectl apply -f kubernetes-upgrade.yaml
# βœ… KubernetesUpgrade starts immediately (phase: InProgress)

kubectl apply -f talos-upgrade.yaml
# ⏳ TalosUpgrade waits (phase: Pending)
#    message: "Waiting for Kubernetes upgrade 'kubernetes' to complete before starting Talos upgrade"

# After KubernetesUpgrade completes (phase: Completed)
# βœ… TalosUpgrade starts automatically (phase: InProgress)

Scenario 3: Only one upgrade type needed

# If you only need Talos upgrades
kubectl apply -f talos-upgrade.yaml
# βœ… Starts immediately - no blocking

# If you only need Kubernetes upgrades
kubectl apply -f kubernetes-upgrade.yaml
# βœ… Starts immediately - no blocking

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

πŸ“„ License

This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Talos Linux - The modern OS for Kubernetes that inspired this project
  • System Upgrade Controller - Inspiration for upgrade orchestration patterns
  • Kubebuilder - Excellent framework for building Kubernetes controllers
  • Controller Runtime - Powerful runtime for Kubernetes controllers
  • CEL - Common Expression Language for flexible health checks
  • Prometheus - Monitoring and alerting toolkit for metrics collection

⭐ If this project helps you, please consider giving it a star!

For questions, issues, or feature requests, please visit our GitHub Issues.

About

Kubernetes controller to upgrade Talos and Kubernetes

Topics

Resources

License

Stars

Watchers

Forks

Packages