Skip to content

Expose Cluster Autoscaler Node Group Backoff Configuration #370

@FinnHuelsbusch

Description

@FinnHuelsbusch

Summary

This enhancement request proposes exposing additional cluster autoscaler configuration flags in Gardener's Shoot API, specifically initialNodeGroupBackoffDuration, to provide more fine-grained control over autoscaler backoff behavior when node provisioning fails. The exposure of this configuration would enable more efficient scaling behavior in environments with regular timeouts. By allowing longer initial backoff periods, operators can significantly reduce time-to-scale for critical workloads (as no ping pong happens during retries) while minimizing resource waste from repeated failed provisioning attempts.

Motivation

Currently, Gardener exposes several cluster autoscaler flags see here for details. The only flag relevant for the Scale-up and backoff behaviour is maxNodeProvisionTime. However, the backoff behavior for failed node group scaling attempts is not configurable, which can lead to suboptimal scaling decisions in scenarios where different node types have varying availability patterns.

Requested Feature

I would like to request the exposure of the following cluster autoscaler flag:

  • initialNodeGroupBackoffDuration - The initial backoff duration when a node group fails to scale up (default: 5 minutes)
    Additionally, for completeness, it would be beneficial to also expose:
  • maxNodeGroupBackoffDuration - The maximum backoff duration (default: 30 minutes)
  • nodeGroupBackoffResetTimeout - The timeout after which backoff is reset (default: 3 hours)

Use Case Example

Assume the following scenario. A Cluster is configured with two node types. NodeType-A and NodeType-B. NodeType-A is the preferred node type and NodeType-B is the fallback node type. Let's say that NodeType-A is available in two availability zones (AZs) and NodeType-B is only available in one AZ. To reflect the priorities, the priority expander is setup to try to get nodes from NodeType-A first from both AZs and from NodeType-B as a fallback.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
data:
  priorities: |-
    10:
      - .*
    20:
      - NodeType-B
    30:
      - NodeType-A-AZ-[1,2]

Behaviour with default settings

Assume that one node is requested and the following events happen:

  • CA requests node of NodeType-A-AZ-1.
  • Hyperscaler assures that the node is created.
  • CA waits for the node to join the cluster and become healthy.
  • 15 minutes pass and the node is still not healthy.
  • CA runs into a timeout for the requested node
    • The CA disables NodeType-A-AZ-1 for 5 minutes (default initial backoff)
    • The CA removes the requested node from NodeType-A-AZ-1
  • As there is now an insufficient amount of nodes the CA scales again. As NodeType-A-AZ-1 is disabled, NodeType-A-AZ-2 is chosen in the final scaleup plan
  • Hyperscaler assures that the node is created.
  • CA waits for the node to join the cluster and become healthy.
  • 5 minutes pass NodeType-A-AZ-1 is reenabled.
  • 10 more minutes pass and the requested node is still not healthy.
  • CA runs into a timeout for the requested node
    • The CA disables NodeType-A-AZ-2 for 5 minutes (default initial backoff)
    • The CA removes the requested node from NodeType-A-AZ-2
  • As there is now an insufficient amount of nodes the CA scales again. As NodeType-A-AZ-2 is disabled, NodeType-A-AZ-1 is chosen in the final scaleup plan as it is the only remaining option of priority 30.
  • Hyperscaler assures that the node is created.
  • CA waits for the node to join the cluster and become healthy.
  • 5 minutes pass NodeType-A-AZ-2 is enabled.
  • 10 more minutes pass and the node is still not healthy.
  • CA runs into a timeout for the requested node
    • The CA disables NodeType-A-AZ-1 for 10 minutes (exponential backoff)
    • The CA removes the requested node from NodeType-A-AZ-1
  • As there is now an insufficient amount of nodes the CA scales again. As NodeType-A-AZ-1 is disabled, NodeType-A-AZ-2 is chosen in the final scaleup plan as it is the only remaining option of priority 30.
  • Hyperscaler assures that the node is created.
  • CA waits for the node to join the cluster and become healthy.
  • 10 minutes pass NodeType-A-AZ-1 is enabled.
  • 5 more minutes pass and the node is still not healthy.
  • CA runs into a timeout for the requested node
    • The CA disables NodeType-A-AZ-2 for 10 minutes (exponential backoff)
    • The CA removes the requested node from NodeType-A-AZ-2
  • As there is now an insufficient amount of nodes the CA scales again. As NodeType-A-AZ-2 is disabled, NodeType-A-AZ-1 is chosen in the final scaleup plan as it is the only remaining option of priority 30.
  • Hyperscaler assures that the node is created.
  • CA waits for the node to join the cluster and become healthy.
  • 10 minutes pass NodeType-A-AZ-2 is enabled.
  • 5 more minutes pass and the node is still not healthy.
  • CA runs into a timeout for the requested node
    • The CA disables NodeType-A-AZ-1 for 20 minutes (exponential backoff)
    • The CA removes the requested node from NodeType-A-AZ-1
  • As there is now an insufficient amount of nodes the CA scales again. As NodeType-A-AZ-1 is disabled, NodeType-A-AZ-2 is chosen in the final scaleup plan as it is the only remaining option of priority 30.
  • Hyperscaler assures that the node is created.
  • CA waits for the node to join the cluster and become healthy.
  • 15 minutes pass and the node is still not healthy.
  • CA runs into a timeout for the requested node
    • The CA disables NodeType-A-AZ-2 for 20 minutes (exponential backoff)
    • The CA removes the requested node from NodeType-A-AZ-2
  • As there is now an insufficient amount of nodes the CA scales again. As NodeType-A-AZ-2 and NodeType-A-AZ-1 are disabled, NodeType-B-AZ-1 is chosen in the final scaleup plan as it is the highest remaining option with a priority of 20.

Overall it took 90 minutes until the CA finally decided to use NodeType-B after attempting to use NodeType-A for 6 times. While we acknowledge this behavior is working as designed, it creates operational challenges in production environments where rapid failover to available node types is critical for maintaining service availability. The current default backoff behavior results in repeated attempts to provision nodes from the same consistently failing node groups, significantly delaying the transition to available alternatives. While reducing maxNodeProvisionTime might appear to offer a solution, this approach is not viable in our environment as we regularly experience legitimate node provisioning times exceeding 10 minutes for the specialized instance types required by our workloads.

Desired Behaviour with configurable backoff settings

Assume that initialNodeGroupBackoffDuration is increased to 20 minutes from the default of 5 minutes. The other two settings are kept at their defaults. The same scenario as above happens:

  • CA requests node of NodeType-A-AZ-1.
  • Hyperscaler assures that the node is created.
  • CA waits for the node to join the cluster and become healthy.
  • 15 minutes pass and the node is still not healthy.
  • CA runs into a timeout for the requested node
    • The CA disables NodeType-A-AZ-1 for 20 minutes (configured backoff)
    • The CA removes the requested node from NodeType-A-AZ-1
  • As there is now an insufficient amount of nodes the CA scales again. As NodeType-A-AZ-1 is disabled, NodeType-A-AZ-2 is chosen in the final scaleup plan as it is the only remaining option of priority 30.
  • Hyperscaler assures that the node is created.
  • CA waits for the node to join the cluster and become healthy.
  • 15 minutes pass and the node is still not healthy.
  • CA runs into a timeout for the requested node
    • The CA disables NodeType-A-AZ-2 for 20 minutes (configured backoff)
    • The CA removes the requested node from NodeType-A-AZ-2
  • As there is now an insufficient amount of nodes the CA scales again. As NodeType-A-AZ-2 and NodeType-A-AZ-1 are disabled, NodeType-B-AZ-1 is chosen in the final scaleup plan as it is the highest remaining option.

In this scenario, the CA reaches the fallback node type NodeType-B in only 30 minutes after attempting NodeType-A twice. This represents a significant improvement over the default behavior and minimizes unproductive retry attempts from 6 to 2. This optimized behavior aligns with our operational requirements, as cloud provider problems that prevent node provisioning are typically not resolved within short time windows. Therefore, longer backoff periods allow for more efficient resource allocation by reducing futile retry attempts and enabling faster transitions to available alternatives.

Additional Configuration Options

For enhanced control over the backoff mechanism, it would be valuable to also expose the complementary settings maxNodeGroupBackoffDuration and nodeGroupBackoffResetTimeout. While not essential for addressing the primary use case, these additional parameters would provide complete flexibility for fine-tuning autoscaler behavior across diverse operational environments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    exp/intermediateIssue that requires some project experiencekind/enhancementEnhancement, improvement, extensionpriority/3Priority (lower number equals higher priority)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions