-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Summary
This enhancement request proposes exposing additional cluster autoscaler configuration flags in Gardener's Shoot API, specifically initialNodeGroupBackoffDuration
, to provide more fine-grained control over autoscaler backoff behavior when node provisioning fails. The exposure of this configuration would enable more efficient scaling behavior in environments with regular timeouts. By allowing longer initial backoff periods, operators can significantly reduce time-to-scale for critical workloads (as no ping pong happens during retries) while minimizing resource waste from repeated failed provisioning attempts.
Motivation
Currently, Gardener exposes several cluster autoscaler flags see here for details. The only flag relevant for the Scale-up and backoff behaviour is maxNodeProvisionTime
. However, the backoff behavior for failed node group scaling attempts is not configurable, which can lead to suboptimal scaling decisions in scenarios where different node types have varying availability patterns.
Requested Feature
I would like to request the exposure of the following cluster autoscaler flag:
initialNodeGroupBackoffDuration
- The initial backoff duration when a node group fails to scale up (default: 5 minutes)
Additionally, for completeness, it would be beneficial to also expose:maxNodeGroupBackoffDuration
- The maximum backoff duration (default: 30 minutes)nodeGroupBackoffResetTimeout
- The timeout after which backoff is reset (default: 3 hours)
Use Case Example
Assume the following scenario. A Cluster is configured with two node types. NodeType-A
and NodeType-B
. NodeType-A
is the preferred node type and NodeType-B
is the fallback node type. Let's say that NodeType-A
is available in two availability zones (AZs) and NodeType-B
is only available in one AZ. To reflect the priorities, the priority expander is setup to try to get nodes from NodeType-A
first from both AZs and from NodeType-B
as a fallback.
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |-
10:
- .*
20:
- NodeType-B
30:
- NodeType-A-AZ-[1,2]
Behaviour with default settings
Assume that one node is requested and the following events happen:
- CA requests node of
NodeType-A-AZ-1
. - Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 15 minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
- The CA disables
NodeType-A-AZ-1
for 5 minutes (default initial backoff) - The CA removes the requested node from
NodeType-A-AZ-1
- The CA disables
- As there is now an insufficient amount of nodes the CA scales again. As
NodeType-A-AZ-1
is disabled,NodeType-A-AZ-2
is chosen in the final scaleup plan - Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 5 minutes pass
NodeType-A-AZ-1
is reenabled. - 10 more minutes pass and the requested node is still not healthy.
- CA runs into a timeout for the requested node
- The CA disables
NodeType-A-AZ-2
for 5 minutes (default initial backoff) - The CA removes the requested node from
NodeType-A-AZ-2
- The CA disables
- As there is now an insufficient amount of nodes the CA scales again. As
NodeType-A-AZ-2
is disabled,NodeType-A-AZ-1
is chosen in the final scaleup plan as it is the only remaining option of priority 30. - Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 5 minutes pass
NodeType-A-AZ-2
is enabled. - 10 more minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
- The CA disables
NodeType-A-AZ-1
for 10 minutes (exponential backoff) - The CA removes the requested node from
NodeType-A-AZ-1
- The CA disables
- As there is now an insufficient amount of nodes the CA scales again. As
NodeType-A-AZ-1
is disabled,NodeType-A-AZ-2
is chosen in the final scaleup plan as it is the only remaining option of priority 30. - Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 10 minutes pass
NodeType-A-AZ-1
is enabled. - 5 more minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
- The CA disables
NodeType-A-AZ-2
for 10 minutes (exponential backoff) - The CA removes the requested node from
NodeType-A-AZ-2
- The CA disables
- As there is now an insufficient amount of nodes the CA scales again. As
NodeType-A-AZ-2
is disabled,NodeType-A-AZ-1
is chosen in the final scaleup plan as it is the only remaining option of priority 30. - Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 10 minutes pass
NodeType-A-AZ-2
is enabled. - 5 more minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
- The CA disables
NodeType-A-AZ-1
for 20 minutes (exponential backoff) - The CA removes the requested node from
NodeType-A-AZ-1
- The CA disables
- As there is now an insufficient amount of nodes the CA scales again. As
NodeType-A-AZ-1
is disabled,NodeType-A-AZ-2
is chosen in the final scaleup plan as it is the only remaining option of priority 30. - Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 15 minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
- The CA disables
NodeType-A-AZ-2
for 20 minutes (exponential backoff) - The CA removes the requested node from
NodeType-A-AZ-2
- The CA disables
- As there is now an insufficient amount of nodes the CA scales again. As
NodeType-A-AZ-2
andNodeType-A-AZ-1
are disabled,NodeType-B-AZ-1
is chosen in the final scaleup plan as it is the highest remaining option with a priority of 20.
Overall it took 90 minutes until the CA finally decided to use NodeType-B
after attempting to use NodeType-A
for 6 times. While we acknowledge this behavior is working as designed, it creates operational challenges in production environments where rapid failover to available node types is critical for maintaining service availability. The current default backoff behavior results in repeated attempts to provision nodes from the same consistently failing node groups, significantly delaying the transition to available alternatives. While reducing maxNodeProvisionTime
might appear to offer a solution, this approach is not viable in our environment as we regularly experience legitimate node provisioning times exceeding 10 minutes for the specialized instance types required by our workloads.
Desired Behaviour with configurable backoff settings
Assume that initialNodeGroupBackoffDuration
is increased to 20 minutes from the default of 5 minutes. The other two settings are kept at their defaults. The same scenario as above happens:
- CA requests node of
NodeType-A-AZ-1
. - Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 15 minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
- The CA disables
NodeType-A-AZ-1
for 20 minutes (configured backoff) - The CA removes the requested node from
NodeType-A-AZ-1
- The CA disables
- As there is now an insufficient amount of nodes the CA scales again. As
NodeType-A-AZ-1
is disabled,NodeType-A-AZ-2
is chosen in the final scaleup plan as it is the only remaining option of priority 30. - Hyperscaler assures that the node is created.
- CA waits for the node to join the cluster and become healthy.
- 15 minutes pass and the node is still not healthy.
- CA runs into a timeout for the requested node
- The CA disables
NodeType-A-AZ-2
for 20 minutes (configured backoff) - The CA removes the requested node from
NodeType-A-AZ-2
- The CA disables
- As there is now an insufficient amount of nodes the CA scales again. As
NodeType-A-AZ-2
andNodeType-A-AZ-1
are disabled,NodeType-B-AZ-1
is chosen in the final scaleup plan as it is the highest remaining option.
In this scenario, the CA reaches the fallback node type NodeType-B
in only 30 minutes after attempting NodeType-A
twice. This represents a significant improvement over the default behavior and minimizes unproductive retry attempts from 6 to 2. This optimized behavior aligns with our operational requirements, as cloud provider problems that prevent node provisioning are typically not resolved within short time windows. Therefore, longer backoff periods allow for more efficient resource allocation by reducing futile retry attempts and enabling faster transitions to available alternatives.
Additional Configuration Options
For enhanced control over the backoff mechanism, it would be valuable to also expose the complementary settings maxNodeGroupBackoffDuration
and nodeGroupBackoffResetTimeout
. While not essential for addressing the primary use case, these additional parameters would provide complete flexibility for fine-tuning autoscaler behavior across diverse operational environments.