The cluster administration overview is for anyone creating or administering a Kubernetes cluster.
It assumes some familiarity with core Kubernetes concepts.
Planning a cluster
See the guides in Setup for examples of how to plan, set up, and configure
Kubernetes clusters. The solutions listed in this article are called distros.
Note:
Not all distros are actively maintained. Choose distros which have been tested with a recent
version of Kubernetes.
Before choosing a guide, here are some considerations:
Do you want to try out Kubernetes on your computer, or do you want to build a high-availability,
multi-node cluster? Choose distros best suited for your needs.
Will you be using a hosted Kubernetes cluster, such as
Google Kubernetes Engine, or hosting your own cluster?
Will your cluster be on-premises, or in the cloud (IaaS)? Kubernetes does not directly
support hybrid clusters. Instead, you can set up multiple clusters.
If you are configuring Kubernetes on-premises, consider which
networking model fits best.
Will you be running Kubernetes on "bare metal" hardware or on virtual machines (VMs)?
Do you want to run a cluster, or do you expect to do active development of Kubernetes project code?
If the latter, choose an actively-developed distro. Some distros only use binary releases, but
offer a greater variety of choices.
Familiarize yourself with the components needed to run a cluster.
Authenticating explains authentication in
Kubernetes, including the various authentication options.
Authorization is separate from
authentication, and controls how HTTP calls are handled.
Using Admission Controllers
explains plug-ins which intercepts requests to the Kubernetes API server after authentication
and authorization.
Admission Webhook Good Practices
provides good practices and considerations when designing mutating admission
webhooks and validating admission webhooks.
In a Kubernetes cluster, a node
can be shut down in a planned graceful way or unexpectedly because of reasons such
as a power outage or something else external. A node shutdown could lead to workload
failure if the node is not drained before the shutdown. A node shutdown can be
either graceful or non-graceful.
Graceful node shutdown
The kubelet attempts to detect node system shutdown and terminates pods running on the node.
Kubelet ensures that pods follow the normal
pod termination process
during the node shutdown. During node shutdown, the kubelet does not accept new
Pods (even if those Pods are already bound to the node).
FEATURE STATE:Kubernetes v1.21 [beta] (enabled by default: true)
On Linux, the graceful node shutdown feature is controlled with the GracefulNodeShutdownfeature gate which is
enabled by default in 1.21.
Note:
The graceful node shutdown feature depends on systemd since it takes advantage of
systemd inhibitor locks to
delay the node shutdown with a given duration.
FEATURE STATE:Kubernetes v1.34 [beta] (enabled by default: true)
On Windows, the graceful node shutdown feature is controlled with the WindowsGracefulNodeShutdownfeature gate
which is introduced in 1.32 as an alpha feature. In Kubernetes 1.34 the feature is Beta
and is enabled by default.
Note:
The Windows graceful node shutdown feature depends on kubelet running as a Windows service,
it will then have a registered service control handler
to delay the preshutdown event with a given duration.
Windows graceful node shutdown can not be cancelled.
If kubelet is not running as a Windows service, it will not be able to set and monitor
the Preshutdown event,
the node will have to go through the Non-Graceful Node Shutdown procedure mentioned above.
In the case where the Windows graceful node shutdown feature is enabled, but the kubelet is not
running as a Windows service, the kubelet will continue running instead of failing. However,
it will log an error indicating that it needs to be run as a Windows service.
Configuring graceful node shutdown
Note that by default, both configuration options described below,
shutdownGracePeriod and shutdownGracePeriodCriticalPods, are set to zero,
thus not activating the graceful node shutdown functionality.
To activate the feature, both options should be configured appropriately and
set to non-zero values.
Once the kubelet is notified of a node shutdown, it sets a NotReady condition on
the Node, with the reason set to "node is shutting down". The kube-scheduler honors this condition
and does not schedule any Pods onto the affected node; other third-party schedulers are
expected to follow the same logic. This means that new Pods won't be scheduled onto that node
and therefore none will start.
The kubelet also rejects Pods during the PodAdmission phase if an ongoing
node shutdown has been detected, so that even Pods with a
toleration for
node.kubernetes.io/not-ready:NoSchedule do not start there.
When kubelet is setting that condition on its Node via the API,
the kubelet also begins terminating any Pods that are running locally.
During a graceful shutdown, kubelet terminates pods in two phases:
The graceful node shutdown feature is configured with two
KubeletConfiguration options:
shutdownGracePeriod:
Specifies the total duration that the node should delay the shutdown by. This is the total
grace period for pod termination for both regular and
critical pods.
shutdownGracePeriodCriticalPods:
Specifies the duration used to terminate
critical pods
during a node shutdown. This value should be less than shutdownGracePeriod.
Note:
There are cases when Node termination was cancelled by the system (or perhaps manually
by an administrator). In either of those situations the Node will return to the Ready state.
However, Pods which already started the process of termination will not be restored by kubelet
and will need to be re-scheduled.
For example, if shutdownGracePeriod=30s, and
shutdownGracePeriodCriticalPods=10s, kubelet will delay the node shutdown by
30 seconds. During the shutdown, the first 20 (30-10) seconds would be reserved
for gracefully terminating normal pods, and the last 10 seconds would be
reserved for terminating critical pods.
Note:
When pods were evicted during the graceful node shutdown, they are marked as shutdown.
Running kubectl get pods shows the status of the evicted pods as Terminated.
And kubectl describe pod indicates that the pod was evicted because of node shutdown:
Reason: Terminated
Message: Pod was terminated in response to imminent node shutdown.
Pod Priority based graceful node shutdown
FEATURE STATE:Kubernetes v1.24 [beta] (enabled by default: true)
To provide more flexibility during graceful node shutdown around the ordering
of pods during shutdown, graceful node shutdown honors the PriorityClass for
Pods, provided that you enabled this feature in your cluster. The feature
allows cluster administrators to explicitly define the ordering of pods
during graceful node shutdown based on
priority classes.
The Graceful Node Shutdown feature, as described
above, shuts down pods in two phases, non-critical pods, followed by critical
pods. If additional flexibility is needed to explicitly define the ordering of
pods during shutdown in a more granular way, pod priority based graceful
shutdown can be used.
When graceful node shutdown honors pod priorities, this makes it possible to do
graceful node shutdown in multiple phases, each phase shutting down a
particular priority class of pods. The kubelet can be configured with the exact
phases and shutdown time per phase.
Assuming the following custom pod
priority classes
in a cluster,
Pod priority class name
Pod priority class value
custom-class-a
100000
custom-class-b
10000
custom-class-c
1000
regular/unset
0
Within the kubelet configuration
the settings for shutdownGracePeriodByPodPriority could look like:
Pod priority class value
Shutdown period
100000
10 seconds
10000
180 seconds
1000
120 seconds
0
60 seconds
The corresponding kubelet config YAML configuration would be:
The above table implies that any pod with priority value >= 100000 will get
just 10 seconds to shut down, any pod with value >= 10000 and < 100000 will get 180
seconds to shut down, any pod with value >= 1000 and < 10000 will get 120 seconds to shut down.
Finally, all other pods will get 60 seconds to shut down.
One doesn't have to specify values corresponding to all of the classes. For
example, you could instead use these settings:
Pod priority class value
Shutdown period
100000
300 seconds
1000
120 seconds
0
60 seconds
In the above case, the pods with custom-class-b will go into the same bucket
as custom-class-c for shutdown.
If there are no pods in a particular range, then the kubelet does not wait
for pods in that priority range. Instead, the kubelet immediately skips to the
next priority class value range.
If this feature is enabled and no configuration is provided, then no ordering
action will be taken.
Using this feature requires enabling the GracefulNodeShutdownBasedOnPodPriorityfeature gate,
and setting ShutdownGracePeriodByPodPriority in the
kubelet config
to the desired configuration containing the pod priority class values and
their respective shutdown periods.
Note:
The ability to take Pod priority into account during graceful node shutdown was introduced
as an Alpha feature in Kubernetes v1.23. In Kubernetes 1.34
the feature is Beta and is enabled by default.
Metrics graceful_shutdown_start_time_seconds and graceful_shutdown_end_time_seconds
are emitted under the kubelet subsystem to monitor node shutdowns.
Non-graceful node shutdown handling
FEATURE STATE:Kubernetes v1.28 [stable] (enabled by default: true)
A node shutdown action may not be detected by kubelet's Node Shutdown Manager,
either because the command does not trigger the inhibitor locks mechanism used by
kubelet or because of a user error, i.e., the ShutdownGracePeriod and
ShutdownGracePeriodCriticalPods are not configured properly. Please refer to above
section Graceful Node Shutdown for more details.
When a node is shutdown but not detected by kubelet's Node Shutdown Manager, the pods
that are part of a StatefulSet
will be stuck in terminating status on the shutdown node and cannot move to a new running node.
This is because kubelet on the shutdown node is not available to delete the pods so
the StatefulSet cannot create a new pod with the same name. If there are volumes used by the pods,
the VolumeAttachments will not be deleted from the original shutdown node so the volumes
used by these pods cannot be attached to a new running node. As a result, the
application running on the StatefulSet cannot function properly. If the original
shutdown node comes up, the pods will be deleted by kubelet and new pods will be
created on a different running node. If the original shutdown node does not come up,
these pods will be stuck in terminating status on the shutdown node forever.
To mitigate the above situation, a user can manually add the taint node.kubernetes.io/out-of-service
with either NoExecute or NoSchedule effect to a Node marking it out-of-service.
If a Node is marked out-of-service with this taint, the pods on the node will be forcefully deleted
if there are no matching tolerations on it and volume detach operations for the pods terminating on
the node will happen immediately. This allows the Pods on the out-of-service node to recover quickly
on a different node.
During a non-graceful shutdown, Pods are terminated in the two phases:
Force delete the Pods that do not have matching out-of-service tolerations.
Immediately perform detach volume operation for such pods.
Note:
Before adding the taint node.kubernetes.io/out-of-service, it should be verified
that the node is already in shutdown or power off state (not in the middle of restarting).
The user is required to manually remove the out-of-service taint after the pods are
moved to a new node and the user has checked that the shutdown node has been
recovered since the user was the one who originally added the taint.
Forced storage detach on timeout
In any situation where a pod deletion has not succeeded for 6 minutes, kubernetes will
force detach volumes being unmounted if the node is unhealthy at that instant. Any
workload still running on the node that uses a force-detached volume will cause a
violation of the
CSI specification,
which states that ControllerUnpublishVolume "must be called after all
NodeUnstageVolume and NodeUnpublishVolume on the volume are called and succeed".
In such circumstances, volumes on the node in question might encounter data corruption.
The forced storage detach behaviour is optional; users might opt to use the "Non-graceful
node shutdown" feature instead.
Force storage detach on timeout can be disabled by setting the disable-force-detach-on-timeout
config field in kube-controller-manager. Disabling the force detach on timeout feature means
that a volume that is hosted on a node that is unhealthy for more than 6 minutes will not have
its associated
VolumeAttachment
deleted.
After this setting has been applied, unhealthy pods still attached to volumes must be recovered
via the Non-Graceful Node Shutdown procedure mentioned above.
Kubernetes can be configured to use swap memory on a node,
allowing the kernel to free up physical memory by swapping out pages to backing storage.
This is useful for multiple use-cases.
For example, nodes running workloads that can benefit from using swap,
such as those that have large memory footprints but only access a portion of that memory at any given time.
It also helps prevent Pods from being terminated during memory pressure spikes,
shields nodes from system-level memory spikes that might compromise its stability,
allows for more flexible memory management on the node, and much more.
How to use it?
Prerequisites
Swap must be enabled and provisioned on the node.
The node must run a Linux operating system.
The node must use cgroup v2. Kubernetes does not support swap on cgroup v1 nodes.
Enabling swap for Kubernetes Workloads
To allow Kubernetes workloads to use swap,
you must disable the kubelet's default behavior of failing when swap is detected,
and specify memory-swap behavior as LimitedSwap:
Update kubelet configuration:
# this fragment goes into the kubelet's configuration filefailSwapOn:falsememorySwap:swapBehavior:LimitedSwap
The available choices for swapBehavior are:
NoSwap (default): Kubernetes workloads cannot use swap. However, processes
outside of Kubernetes' scope, like system daemons (such as kubelet itself!) can utilize swap.
This behavior is beneficial for protecting the node from system-level memory spikes,
but it does not safeguard the workloads themselves from such spikes.
LimitedSwap: Kubernetes workloads can utilize swap memory.
The amount of swap available to a Pod is determined automatically.
For more details, see the section below.
If configuration for memorySwap is not specified,
by default the kubelet will apply the same behaviour as the NoSwap setting.
Bear in mind that the following pods would be excluded from swap access
(see more info in the section below):
Pods that are not classified as Burstable QoS.
Pods of High-priority.
Containers with memory limit that equals to memory request.
Note:
Kubernetes only supports swap for Linux nodes.
How does it work?
There are a number of possible ways that one could envision swap use on a node.
If kubelet is already running on a node, it would need to be restarted after swap is provisioned in order to identify it.
When kubelet starts on a node in which swap is provisioned and available
(with the failSwapOn: false configuration), kubelet will:
Be able to start on this swap-enabled node.
Direct the Container Runtime Interface (CRI) implementation, often referred to as the container runtime,
to allocate zero swap memory to Kubernetes workloads by default.
Swap configuration on a node is exposed to a cluster admin via the
memorySwap in the KubeletConfiguration.
As a cluster administrator, you can specify the node's behaviour in the
presence of swap memory by setting memorySwap.swapBehavior.
The kubelet uses the container runtime API, and directs the container runtime to
apply specific configuration (for example, in the cgroup v2 case, memory.swap.max) in a manner that will
enable the desired swap configuration for a container. For runtimes that use control groups, or cgroups,
the container runtime is then responsible for writing these settings to the container-level cgroup.
Observability for swap use
Node and container level metric statistics
Kubelet now collects node and container level metric statistics,
which can be accessed at the /metrics/resource (which is used mainly by monitoring
tools like Prometheus) and /stats/summary (which is used mainly by Autoscalers) kubelet HTTP endpoints.
This allows clients who can directly request the kubelet to
monitor swap usage and remaining swap memory when using LimitedSwap.
Additionally, a machine_swap_bytes metric has been added to cadvisor to show
the total physical swap capacity of the machine.
See this page for more info.
For example, these /metrics/resource are supported:
node_swap_usage_bytes: Current swap usage of the node in bytes.
container_swap_usage_bytes: Current amount of the container swap usage in bytes.
container_swap_limit_bytes: Current amount of the container swap limit in bytes.
Using kubectl top --show-swap
Querying metrics is valuable, but somewhat cumbersome, as these metrics
are designed to be used by software rather than humans.
In order to consume this data in a more user-friendly way,
the kubectl top command has been extended to support swap metrics, using the --show-swap flag.
In order to receive information about swap usage on nodes, kubectl top nodes --show-swap can be used:
The <unknown> value indicates that the .status.nodeInfo.swap.capacity field is not set for that Node.
This probably means that the node does not have swap provisioned, or less likely,
that the kubelet is not able to determine the swap capacity of the node.
Swap discovery using Node Feature Discovery (NFD)
Node Feature Discovery
is a Kubernetes addon for detecting hardware features and configuration.
It can be utilized to discover which nodes are provisioned with swap.
As an example, to figure out which nodes are provisioned with swap,
use the following command:
kubectl get nodes -o jsonpath='{range .items[?(@.metadata.labels.feature\.node\.kubernetes\.io/memory-swap)]}{.metadata.name}{"\t"}{.metadata.labels.feature\.node\.kubernetes\.io/memory-swap}{"\n"}{end}'
In this example, swap is provisioned on nodes k8s-worker1 and k8s-worker2, but not on k8s-worker3.
Risks and caveats
Caution:
It is deeply encouraged to encrypt the swap space.
See Memory-backed volumes memory-backed volumes for more info.
Having swap available on a system reduces predictability.
While swap can enhance performance by making more RAM available, swapping data
back to memory is a heavy operation, sometimes slower by many orders of magnitude,
which can cause unexpected performance regressions.
Furthermore, swap changes a system's behaviour under memory pressure.
Enabling swap increases the risk of noisy neighbors,
where Pods that frequently use their RAM may cause other Pods to swap.
In addition, since swap allows for greater memory usage for workloads in Kubernetes that cannot be predictably accounted for,
and due to unexpected packing configurations,
the scheduler currently does not account for swap memory usage.
This heightens the risk of noisy neighbors.
The performance of a node with swap memory enabled depends on the underlying physical storage.
When swap memory is in use, performance will be significantly worse in an I/O
operations per second (IOPS) constrained environment, such as a cloud VM with
I/O throttling, when compared to faster storage mediums like solid-state drives
or NVMe.
As swap might cause IO pressure, it is recommended to give a higher IO latency
priority to system critical daemons. See the relevant section in the
recommended practices section below.
Memory-backed volumes
On Linux nodes, memory-backed volumes (such as secret
volume mounts, or emptyDir with medium: Memory)
are implemented with a tmpfs filesystem.
The contents of such volumes should remain in memory at all times, hence should
not be swapped to disk.
To ensure the contents of such volumes remain in memory, the noswap tmpfs option
is being used.
The Linux kernel officially supports the noswap option from version 6.3 (more info
can be found in Linux Kernel Version Requirements).
However, the different distributions often choose to backport this mount option to older
Linux versions as well.
In order to verify whether the node supports the noswap option, the kubelet will do the following:
If the kernel's version is above 6.3 then the noswap option will be assumed to be supported.
Otherwise, kubelet would try to mount a dummy tmpfs with the noswap option at startup.
If kubelet fails with an error indicating of an unknown option, noswap will be assumed
to not be supported, hence will not be used.
A kubelet log entry will be emitted to warn the user about memory-backed volumes might swap to disk.
If kubelet succeeds, the dummy tmpfs will be deleted and the noswap option will be used.
If the noswap option is not supported, kubelet will emit a warning log entry,
then continue its execution.
See the section above with an example for setting unencrypted swap.
However, handling encrypted swap is not within the scope of kubelet;
rather, it is a general OS configuration concern and should be addressed at that level.
It is the administrator's responsibility to provision encrypted swap to mitigate this risk.
Evictions
Configuring memory eviction thresholds for swap-enabled nodes can be tricky.
With swap being disabled, it is reasonable to configure kubelet's eviction thresholds
to be a bit lower than the node's memory capacity.
The rationale is that we want Kubernetes to start evicting Pods before the node runs out of memory
and invokes the Out Of Memory (OOM) killer, since the OOM killer is not Kubernetes-aware,
therefore does not consider things like QoS, pod priority, or other Kubernetes-specific factors.
With swap enabled, the situation is more complex.
In Linux, the vm.min_free_kbytes parameter defines the memory threshold for the kernel
to start aggressively reclaiming memory, which includes swapping out pages.
If the kubelet's eviction thresholds are set in a way that eviction would take place
before the kernel starts reclaiming memory, it could lead to workloads never
being able to swap out during node memory pressure.
However, setting the eviction thresholds too high could result in the node running out of memory
and invoking the OOM killer, which is not ideal either.
To address this, it is recommended to set the kubelet's eviction thresholds
to be slightly lower than the vm.min_free_kbytes value.
This way, the node can start swapping before kubelet would start evicting Pods,
allowing workloads to swap out unused data and preventing evictions from happening.
On the other hand, since it is just slightly lower, kubelet is likely to start evicting Pods
before the node runs out of memory, thus avoiding the OOM killer.
The value of vm.min_free_kbytes can be determined by running the following command on the node:
cat /proc/sys/vm/min_free_kbytes
Unutilized swap space
Under the LimitedSwap behavior, the amount of swap available to a Pod is determined automatically,
based on the proportion of the memory requested relative to the node's total memory
(For more details, see the section below).
This design means that usually there would be some portion of swap that will remain
restricted for Kubernetes workloads.
For example, since Guaranteed QoS pods are currently not permitted to use swap,
the amount of swap that's proportional to the memory request will remain unused
by Kubernetes workloads.
This behavior carries some risk in a situation where many pods are not eligible for swapping.
On the other hand, it effectively keeps some system-reserved amount of swap memory that can be used by processes
outside of Kubernetes' scope, such as system daemons and even kubelet itself.
Good practice for using swap in a Kubernetes cluster
Disable swap for system-critical daemons
During the testing phase and based on user feedback, it was observed that the performance
of system-critical daemons and services might degrade.
This implies that system daemons, including the kubelet, could operate slower than usual.
If this issue is encountered, it is advisable to configure the cgroup of the system slice
to prevent swapping (i.e., set memory.swap.max=0).
Protect system-critical daemons for I/O latency
Swap can increase the I/O load on a node.
When memory pressure causes the kernel to rapidly swap pages in and out,
system-critical daemons and services that rely on I/O operations may
experience performance degradation.
To mitigate this, it is recommended for systemd users to prioritize the system slice in terms of I/O latency.
For non-systemd users,
setting up a dedicated cgroup for system daemons and processes and prioritizing I/O latency in the same way is advised.
This can be achieved by setting io.latency for the system slice,
thereby granting it higher I/O priority.
See cgroup's documentation for more info.
Swap and control plane nodes
The Kubernetes project recommends running control plane nodes without any swap space configured.
The control plane primarily hosts Guaranteed QoS Pods, so swap can generally be disabled.
The main concern is that swapping critical services on the control plane could negatively impact performance.
Use of a dedicated disk for swap
The Kubernetes project recommends using encrypted swap, whenever you run nodes with swap enabled.
If swap resides on a partition or the root filesystem, workloads may interfere
with system processes that need to write to disk.
When they share the same disk, processes can overwhelm swap,
disrupting the I/O of kubelet, container runtime, and systemd, which would impact other workloads.
Since swap space is located on a disk, it is crucial to ensure the disk is fast enough for the intended use cases.
Alternatively, one can configure I/O priorities between different mapped areas of a single backing device.
Swap-aware scheduling
Kubernetes 1.34 does not support allocating Pods to nodes in a way that accounts
for swap memory usage. The scheduler typically uses requests for infrastructure resources
to guide Pod placement, and Pods do not request swap space; they just request memory.
This means that the scheduler does not consider swap memory when making scheduling decisions.
While this is something we are actively working on, it is not yet implemented.
In order for administrators to ensure that Pods are not scheduled on nodes
with swap memory unless they are specifically intended to use it,
Administrators can taint nodes with swap available to protect against this problem.
Taints will ensure that workloads which tolerate swap will not spill onto nodes without swap under load.
Selecting storage for optimal performance
The storage device designated for swap space is critical to maintaining system responsiveness
during high memory usage.
Rotational hard disk drives (HDDs) are ill-suited for this task as their mechanical nature introduces significant latency,
leading to severe performance degradation and system thrashing.
For modern performance needs, a device such as a Solid State Drive (SSD) is probably the appropriate choice for swap,
as its low-latency electronic access minimizes the slowdown.
Swap Behavior Details
How is the swap limit being determined with LimitedSwap?
The configuration of swap memory, including its limitations, presents a significant
challenge. Not only is it prone to misconfiguration, but as a system-level property, any
misconfiguration could potentially compromise the entire node rather than just a specific
workload. To mitigate this risk and ensure the health of the node, we have implemented
Swap with automatic configuration of limitations.
With LimitedSwap, Pods that do not fall under the Burstable QoS classification (i.e.
BestEffort/Guaranteed QoS Pods) are prohibited from utilizing swap memory.
BestEffort QoS Pods exhibit unpredictable memory consumption patterns and lack
information regarding their memory usage, making it difficult to determine a safe
allocation of swap memory.
Conversely, Guaranteed QoS Pods are typically employed for applications that rely on the
precise allocation of resources specified by the workload, with memory being immediately available.
To maintain the aforementioned security and node health guarantees,
these Pods are not permitted to use swap memory when LimitedSwap is in effect.
In addition, high-priority pods are not permitted to use swap in order to ensure the memory
they consume always residents on disk, hence ready to use.
Prior to detailing the calculation of the swap limit, it is necessary to define the following terms:
nodeTotalMemory: The total amount of physical memory available on the node.
totalPodsSwapAvailable: The total amount of swap memory on the node that is available for use by Pods (some swap memory may be reserved for system use).
containerMemoryRequest: The container's memory request.
In other words, the amount of swap that a container is able to use is proportionate to its
memory request, the node's total physical memory and the total amount of swap memory on
the node that is available for use by Pods.
It is important to note that, for containers within Burstable QoS Pods, it is possible to
opt-out of swap usage by specifying memory requests that are equal to memory limits.
Containers configured in this manner will not have access to swap memory.
For more information, please see the original KEP, KEP-2400,
and its design.
3 - Node Autoscaling
Automatically provision and consolidate the Nodes in your cluster to adapt to demand and optimize cost.
In order to run workloads in your cluster, you need
Nodes. Nodes in your cluster can be autoscaled -
dynamically provisioned, or consolidated to provide needed
capacity while optimizing cost. Autoscaling is performed by Node autoscalers.
Node provisioning
If there are Pods in a cluster that can't be scheduled on existing Nodes, new Nodes can be
automatically added to the cluster—provisioned—to accommodate the Pods. This is
especially useful if the number of Pods changes over time, for example as a result of
combining horizontal workload with Node autoscaling.
Autoscalers provision the Nodes by creating and deleting cloud provider resources backing them. Most
commonly, the resources backing the Nodes are Virtual Machines.
The main goal of provisioning is to make all Pods schedulable. This goal is not always attainable
because of various limitations, including reaching configured provisioning limits, provisioning
configuration not being compatible with a particular set of pods, or the lack of cloud provider
capacity. While provisioning, Node autoscalers often try to achieve additional goals (for example
minimizing the cost of the provisioned Nodes or balancing the number of Nodes between failure
domains).
Autoscaler configuration may also include other Node provisioning triggers (for example the number
of Nodes falling below a configured minimum limit).
Note:
Provisioning was formerly known as scale-up in Cluster Autoscaler.
Pod scheduling constraints
Pods can express scheduling constraints to
impose limitations on the kind of Nodes they can be scheduled on. Node autoscalers take these
constraints into account to ensure that the pending Pods can be scheduled on the provisioned Nodes.
The most common kind of scheduling constraints are the resource requests specified by Pod
containers. Autoscalers will make sure that the provisioned Nodes have enough resources to satisfy
the requests. However, they don't directly take into account the real resource usage of the Pods
after they start running. In order to autoscale Nodes based on actual workload resource usage, you
can combine horizontal workload autoscaling with Node
autoscaling.
Node constraints imposed by autoscaler configuration
The specifics of the provisioned Nodes (for example the amount of resources, the presence of a given
label) depend on autoscaler configuration. Autoscalers can either choose them from a pre-defined set
of Node configurations, or use auto-provisioning.
Auto-provisioning
Node auto-provisioning is a mode of provisioning in which a user doesn't have to fully configure the
specifics of the Nodes that can be provisioned. Instead, the autoscaler dynamically chooses the Node
configuration based on the pending Pods it's reacting to, as well as pre-configured constraints (for
example, the minimum amount of resources or the need for a given label).
Node consolidation
The main consideration when running a cluster is ensuring that all schedulable pods are running,
whilst keeping the cost of the cluster as low as possible. To achieve this, the Pods' resource
requests should utilize as much of the Nodes' resources as possible. From this perspective, the
overall Node utilization in a cluster can be used as a proxy for how cost-effective the cluster is.
Note:
Correctly setting the resource requests of your Pods is as important to the overall
cost-effectiveness of a cluster as optimizing Node utilization.
Combining Node autoscaling with vertical workload autoscaling can
help you achieve this.
Nodes in your cluster can be automatically consolidated in order to improve the overall Node
utilization, and in turn the cost-effectiveness of the cluster. Consolidation happens through
removing a set of underutilized Nodes from the cluster. Optionally, a different set of Nodes can
be provisioned to replace them.
Consolidation, like provisioning, only considers Pod resource requests and not real resource usage
when making decisions.
For the purpose of consolidation, a Node is considered empty if it only has DaemonSet and static
Pods running on it. Removing empty Nodes during consolidation is more straightforward than non-empty
ones, and autoscalers often have optimizations designed specifically for consolidating empty Nodes.
Removing non-empty Nodes during consolidation is disruptive—the Pods running on them are
terminated, and possibly have to be recreated (for example by a Deployment). However, all such
recreated Pods should be able to schedule on existing Nodes in the cluster, or the replacement Nodes
provisioned as part of consolidation. No Pods should normally become pending as a result of
consolidation.
Note:
Autoscalers predict how a recreated Pod will likely be scheduled after a Node is provisioned or
consolidated, but they don't control the actual scheduling. Because of this, some Pods might
become pending as a result of consolidation - if for example a completely new Pod appears while
consolidation is being performed.
Autoscaler configuration may also enable triggering consolidation by other conditions (for example,
the time elapsed since a Node was created), in order to optimize different properties (for example,
the maximum lifespan of Nodes in a cluster).
The details of how consolidation is performed depend on the configuration of a given autoscaler.
Note:
Consolidation was formerly known as scale-down in Cluster Autoscaler.
Autoscalers
The functionalities described in previous sections are provided by Node autoscalers. In addition
to the Kubernetes API, autoscalers also need to interact with cloud provider APIs to provision and
consolidate Nodes. This means that they need to be explicitly integrated with each supported cloud
provider. The performance and feature set of a given autoscaler can differ between cloud provider
integrations.
graph TD
na[Node autoscaler]
k8s[Kubernetes]
cp[Cloud Provider]
k8s --> |get Pods/Nodes|na
na --> |drain Nodes|k8s
na --> |create/remove resources backing Nodes|cp
cp --> |get resources backing Nodes|na
classDef white_on_blue fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef blue_on_white fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class na blue_on_white;
class k8s,cp white_on_blue;
From the perspective of a cluster user, both autoscalers should provide a similar Node autoscaling
experience. Both will provision new Nodes for unschedulable Pods, and both will consolidate the
Nodes that are no longer optimally utilized.
Different autoscalers may also provide features outside the Node autoscaling scope described on this
page, and those additional features may differ between them.
Consult the sections below, and the linked documentation for the individual autoscalers to decide
which autoscaler fits your use case better.
Cluster Autoscaler
Cluster Autoscaler adds or removes Nodes to pre-configured Node groups. Node groups generally map
to some sort of cloud provider resource group (most commonly a Virtual Machine group). A single
instance of Cluster Autoscaler can simultaneously manage multiple Node groups. When provisioning,
Cluster Autoscaler will add Nodes to the group that best fits the requests of pending Pods. When
consolidating, Cluster Autoscaler always selects specific Nodes to remove, as opposed to just
resizing the underlying cloud provider resource group.
Karpenter auto-provisions Nodes based on NodePool
configurations provided by the cluster operator. Karpenter handles all aspects of node lifecycle,
not just autoscaling. This includes automatically refreshing Nodes once they reach a certain
lifetime, and auto-upgrading Nodes when new worker Node images are released. It works directly with
individual cloud provider resources (most commonly individual Virtual Machines), and doesn't rely on
cloud provider resource groups.
Main differences between Cluster Autoscaler and Karpenter:
Cluster Autoscaler provides features related to just Node autoscaling. Karpenter has a wider
scope, and also provides features intended for managing Node lifecycle altogether (for example,
utilizing disruption to auto-recreate Nodes once they reach a certain lifetime, or auto-upgrade
them to new versions).
Cluster Autoscaler doesn't support auto-provisioning, the Node groups it can provision from have
to be pre-configured. Karpenter supports auto-provisioning, so the user only has to configure a
set of constraints for the provisioned Nodes, instead of fully configuring homogenous groups.
Cluster Autoscaler provides cloud provider integrations directly, which means that they're a part
of the Kubernetes project. For Karpenter, the Kubernetes project publishes Karpenter as a library
that cloud providers can integrate with to build a Node autoscaler.
Cluster Autoscaler provides integrations with numerous cloud providers, including smaller and less
popular providers. There are fewer cloud providers that integrate with Karpenter, including
AWS, and
Azure.
Combine workload and Node autoscaling
Horizontal workload autoscaling
Node autoscaling usually works in response to Pods—it provisions new Nodes to accommodate
unschedulable Pods, and then consolidates the Nodes once they're no longer needed.
Horizontal workload autoscaling
automatically scales the number of workload replicas to maintain a desired average resource
utilization across the replicas. In other words, it automatically creates new Pods in response to
application load, and then removes the Pods once the load decreases.
You can use Node autoscaling together with horizontal workload autoscaling to autoscale the Nodes in
your cluster based on the average real resource utilization of your Pods.
If the application load increases, the average utilization of its Pods should also increase,
prompting workload autoscaling to create new Pods. Node autoscaling should then provision new Nodes
to accommodate the new Pods.
Once the application load decreases, workload autoscaling should remove unnecessary Pods. Node
autoscaling should, in turn, consolidate the Nodes that are no longer needed.
If configured correctly, this pattern ensures that your application always has the Node capacity to
handle load spikes if needed, but you don't have to pay for the capacity when it's not needed.
Vertical workload autoscaling
When using Node autoscaling, it's important to set Pod resource requests correctly. If the requests
of a given Pod are too low, provisioning a new Node for it might not help the Pod actually run.
If the requests of a given Pod are too high, it might incorrectly prevent consolidating its Node.
Vertical workload autoscaling
automatically adjusts the resource requests of your Pods based on their historical resource usage.
You can use Node autoscaling together with vertical workload autoscaling in order to adjust the
resource requests of your Pods while preserving Node autoscaling capabilities in your cluster.
Caution:
When using Node autoscaling, it's not recommended to set up vertical workload autoscaling for
DaemonSet Pods. Autoscalers have to predict what DaemonSet Pods on a new Node will look like in
order to predict available Node resources. Vertical workload autoscaling might make these
predictions unreliable, leading to incorrect scaling decisions.
Related components
This section describes components providing functionality related to Node autoscaling.
Descheduler
The descheduler is a component providing Node
consolidation functionality based on custom policies, as well as other features related to
optimizing Nodes and Pods (for example deleting frequently restarting Pods).
To learn how to generate certificates for your cluster, see Certificates.
5 - Cluster Networking
Networking is a central part of Kubernetes, but it can be challenging to
understand exactly how it is expected to work. There are 4 distinct networking
problems to address:
Highly-coupled container-to-container communications: this is solved by
Pods and localhost communications.
Pod-to-Pod communications: this is the primary focus of this document.
Pod-to-Service communications: this is covered by Services.
External-to-Service communications: this is also covered by Services.
Kubernetes is all about sharing machines among applications. Typically,
sharing machines requires ensuring that two applications do not try to use the
same ports. Coordinating ports across multiple developers is very difficult to
do at scale and exposes users to cluster-level issues outside of their control.
Dynamic port allocation brings a lot of complications to the system - every
application has to take ports as flags, the API servers have to know how to
insert dynamic port numbers into configuration blocks, services have to know
how to find each other, etc. Rather than deal with this, Kubernetes takes a
different approach.
To learn about the Kubernetes networking model, see here.
Kubernetes IP address ranges
Kubernetes clusters require to allocate non-overlapping IP addresses for Pods, Services and Nodes,
from a range of available addresses configured in the following components:
The network plugin is configured to assign IP addresses to Pods.
The kube-apiserver is configured to assign IP addresses to Services.
The kubelet or the cloud-controller-manager is configured to assign IP addresses to Nodes.
Cluster networking types
Kubernetes clusters, attending to the IP families configured, can be categorized into:
IPv4 only: The network plugin, kube-apiserver and kubelet/cloud-controller-manager are configured to assign only IPv4 addresses.
IPv6 only: The network plugin, kube-apiserver and kubelet/cloud-controller-manager are configured to assign only IPv6 addresses.
The network plugin is configured to assign IPv4 and IPv6 addresses.
The kube-apiserver is configured to assign IPv4 and IPv6 addresses.
The kubelet or cloud-controller-manager is configured to assign IPv4 and IPv6 address.
All components must agree on the configured primary IP family.
Kubernetes clusters only consider the IP families present on the Pods, Services and Nodes objects,
independently of the existing IPs of the represented objects. Per example, a server or a pod can have multiple
IP addresses on its interfaces, but only the IP addresses in node.status.addresses or pod.status.ips are
considered for implementing the Kubernetes network model and defining the type of the cluster.
How to implement the Kubernetes network model
The network model is implemented by the container runtime on each node. The most common container
runtimes use Container Network Interface (CNI)
plugins to manage their network and security capabilities. Many different CNI plugins exist from
many different vendors. Some of these provide only basic features of adding and removing network
interfaces, while others provide more sophisticated solutions, such as integration with other
container orchestration systems, running multiple CNI plugins, advanced IPAM features etc.
See this page
for a non-exhaustive list of networking addons supported by Kubernetes.
What's next
The early design of the networking model and its rationale are described in more detail in the
networking design document.
For future plans and some on-going efforts that aim to improve Kubernetes networking, please
refer to the SIG-Network
KEPs.
6 - Admission Webhook Good Practices
Recommendations for designing and deploying admission webhooks in Kubernetes.
This page provides good practices and considerations when designing
admission webhooks in Kubernetes. This information is intended for
cluster operators who run admission webhook servers or third-party applications
that modify or validate your API requests.
Before reading this page, ensure that you're familiar with the following
concepts: