Expose Cluster Autoscaler Node Group Backoff Configuration

## Summary
This enhancement request proposes exposing additional cluster autoscaler configuration flags in Gardener's Shoot API, specifically `initialNodeGroupBackoffDuration`, to provide more fine-grained control over autoscaler backoff behavior when node provisioning fails. The exposure of this configuration would enable more efficient scaling behavior in environments with regular timeouts. By allowing longer initial backoff periods, operators can significantly reduce time-to-scale for critical workloads (as no ping pong happens during retries) while minimizing resource waste from repeated failed provisioning attempts.

## Motivation
Currently, Gardener exposes several cluster autoscaler flags see [here](https://github.com/gardener/gardener/blob/master/example/90-shoot.yaml#L322-#L342) for details. The only flag relevant for the Scale-up and backoff behaviour is `maxNodeProvisionTime`. However, the backoff behavior for failed node group scaling attempts is not configurable, which can lead to suboptimal scaling decisions in scenarios where different node types have varying availability patterns.

## Requested Feature
I would like to request the exposure of the following cluster autoscaler flag:
- **`initialNodeGroupBackoffDuration`** - The initial backoff duration when a node group fails to scale up (default: 5 minutes)
Additionally, for completeness, it would be beneficial to also expose:
- **`maxNodeGroupBackoffDuration`** - The maximum backoff duration (default: 30 minutes)
- **`nodeGroupBackoffResetTimeout`** - The timeout after which backoff is reset (default: 3 hours)
 
## Use Case Example
Assume the following scenario. A Cluster is configured with two node types. `NodeType-A` and `NodeType-B`. `NodeType-A` is the preferred node type and `NodeType-B` is the fallback node type. Let's say that `NodeType-A` is available in two availability zones (AZs) and `NodeType-B` is only available in one AZ. To reflect the priorities, the priority expander is setup to try to get nodes from `NodeType-A` first from both AZs and from `NodeType-B` as a fallback.
```YAML
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
data:
  priorities: |-
    10:
      - .*
    20:
      - NodeType-B
    30:
      - NodeType-A-AZ-[1,2]
```
### Behaviour with default settings
Assume that one node is requested and the following events happen:
- CA requests node of `NodeType-A-AZ-1`.
- Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 15 minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
    - The CA disables `NodeType-A-AZ-1` for 5 minutes (default initial backoff)
    - The CA removes the requested node from `NodeType-A-AZ-1`
- As there is now an insufficient amount of nodes the CA scales again. As `NodeType-A-AZ-1` is disabled, `NodeType-A-AZ-2` is chosen in the final scaleup plan
- Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 5 minutes pass `NodeType-A-AZ-1` is reenabled.
- 10 more minutes pass and the requested node is still not healthy.
- CA runs into a timeout for the requested node
    - The CA disables `NodeType-A-AZ-2` for 5 minutes (default initial backoff)
    - The CA removes the requested node from `NodeType-A-AZ-2`
- As there is now an insufficient amount of nodes the CA scales again. As `NodeType-A-AZ-2` is disabled, `NodeType-A-AZ-1` is chosen in the final scaleup plan as it is the only remaining option of priority 30.
- Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 5 minutes pass `NodeType-A-AZ-2` is enabled.
- 10 more minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
    - The CA disables `NodeType-A-AZ-1` for 10 minutes (exponential backoff)
    - The CA removes the requested node from `NodeType-A-AZ-1`
- As there is now an insufficient amount of nodes the CA scales again. As `NodeType-A-AZ-1` is disabled, `NodeType-A-AZ-2` is chosen in the final scaleup plan as it is the only remaining option of priority 30.
- Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 10 minutes pass `NodeType-A-AZ-1` is enabled.
- 5 more minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
    - The CA disables `NodeType-A-AZ-2` for 10 minutes (exponential backoff)
    - The CA removes the requested node from `NodeType-A-AZ-2`
- As there is now an insufficient amount of nodes the CA scales again. As `NodeType-A-AZ-2` is disabled, `NodeType-A-AZ-1` is chosen in the final scaleup plan as it is the only remaining option of priority 30.
- Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 10 minutes pass `NodeType-A-AZ-2` is enabled.
- 5 more minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
    - The CA disables `NodeType-A-AZ-1` for 20 minutes (exponential backoff)
    - The CA removes the requested node from `NodeType-A-AZ-1`
- As there is now an insufficient amount of nodes the CA scales again. As `NodeType-A-AZ-1` is disabled, `NodeType-A-AZ-2` is chosen in the final scaleup plan as it is the only remaining option of priority 30.
- Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 15 minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
    - The CA disables `NodeType-A-AZ-2` for 20 minutes (exponential backoff)
    - The CA removes the requested node from `NodeType-A-AZ-2`
- As there is now an insufficient amount of nodes the CA scales again. As `NodeType-A-AZ-2` and `NodeType-A-AZ-1` are disabled, `NodeType-B-AZ-1` is chosen in the final scaleup plan as it is the highest remaining option with a priority of 20.

Overall it took 90 minutes until the CA finally decided to use `NodeType-B` after attempting to use `NodeType-A` for 6 times. While we acknowledge this behavior is working as designed, it creates operational challenges in production environments where rapid failover to available node types is critical for maintaining service availability. The current default backoff behavior results in repeated attempts to provision nodes from the same consistently failing node groups, significantly delaying the transition to available alternatives. While reducing `maxNodeProvisionTime` might appear to offer a solution, this approach is not viable in our environment as we regularly experience legitimate node provisioning times exceeding 10 minutes for the specialized instance types required by our workloads.

### Desired Behaviour with configurable backoff settings
Assume that `initialNodeGroupBackoffDuration` is increased to 20 minutes from the default of 5 minutes. The other two settings are kept at their defaults. The same scenario as above happens:
- CA requests node of `NodeType-A-AZ-1`.
- Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 15 minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
    - The CA disables `NodeType-A-AZ-1` for 20 minutes (configured backoff)
    - The CA removes the requested node from `NodeType-A-AZ-1`
- As there is now an insufficient amount of nodes the CA scales again. As `NodeType-A-AZ-1` is disabled, `NodeType-A-AZ-2` is chosen in the final scaleup plan as it is the only remaining option of priority 30.
- Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 15 minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
    - The CA disables `NodeType-A-AZ-2` for 20 minutes (configured backoff)
    - The CA removes the requested node from `NodeType-A-AZ-2`
- As there is now an insufficient amount of nodes the CA scales again. As `NodeType-A-AZ-2` and `NodeType-A-AZ-1` are disabled, `NodeType-B-AZ-1` is chosen in the final scaleup plan as it is the highest remaining option.

In this scenario, the CA reaches the fallback node type `NodeType-B` in only 30 minutes after attempting `NodeType-A` twice. This represents a significant improvement over the default behavior and minimizes unproductive retry attempts from 6 to 2. This optimized behavior aligns with our operational requirements, as cloud provider problems that prevent node provisioning are typically not resolved within short time windows. Therefore, longer backoff periods allow for more efficient resource allocation by reducing futile retry attempts and enabling faster transitions to available alternatives.

## Additional Configuration Options
For enhanced control over the backoff mechanism, it would be valuable to also expose the complementary settings `maxNodeGroupBackoffDuration` and `nodeGroupBackoffResetTimeout`. While not essential for addressing the primary use case, these additional parameters would provide complete flexibility for fine-tuning autoscaler behavior across diverse operational environments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expose Cluster Autoscaler Node Group Backoff Configuration #370

Summary

Motivation

Requested Feature

Use Case Example

Behaviour with default settings

Desired Behaviour with configurable backoff settings

Additional Configuration Options

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Expose Cluster Autoscaler Node Group Backoff Configuration #370

Description

Summary

Motivation

Requested Feature

Use Case Example

Behaviour with default settings

Desired Behaviour with configurable backoff settings

Additional Configuration Options

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions