Skip to content

Conversation

tobschli
Copy link
Member

@tobschli tobschli commented Sep 9, 2025

How to categorize this PR?

/area robustness
/kind enhancement

What this PR does / why we need it:

This PR adds the possibility to restrict the total amount of non-namespaced resources in the virtual garden cluster.
To achieve this, I enhanced the already existing resource-size webhook to also allow setting a count field.

When set, the webhook checks the total amount of already existing resources in the virtual-garden cluster and permits creating resources above this limit.

Just like the size limit, Kubernetes subjects can be configured that are exempt from this limit.


In addition to that, I added some minor fixes that improve developer experience.

  • Making the admission-local plugin debuggable
  • Fixing one condition in a Helm template that could result in invalid YAML

Which issue(s) this PR fixes:
Fixes #10878

Special notes for your reviewer:

There are two points, where I would like to know your opinion about:

  1. In order to allow this webhook to be usable for non-namespaced resources, I needed to remove the namespaceSelector for it.
    This is not really clean in my opinion, but apart from moving this logic in a standalone webhook (which I also did not prefer), I could not come up with a better solution at the moment
  2. The name of the webhook now does not 100% reflect it's usage, but I am not entirely sure about the potential side effects of renaming the endpoint. Should we keep it as-is? Or should we rename it?

Release note:

It is now possible to restrict the total count of objects for non-namespaced resources. You can set it through the admission controller configuration's `server.resourceAdmissionConfiguration.limits[].count` field.

@gardener-prow gardener-prow bot added area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension labels Sep 9, 2025
@gardener-prow gardener-prow bot requested review from ary1992 and marc1404 September 9, 2025 09:03
@gardener-prow gardener-prow bot added cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 9, 2025
@timuthy
Copy link
Member

timuthy commented Sep 10, 2025

/assign

Copy link
Member

@timuthy timuthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feature! Besides the comments, could you please enhance the documentation?

Regarding your second question:

The name of the webhook now does not 100% reflect it's usage, but I am not entirely sure about the potential side effects of renaming the endpoint. Should we keep it as-is? Or should we rename it?

Renaming of the webhook should be fine, but renaming the endpoint isn't disruption free. If we want to completely rename, a new webhook needs to be introduced and the old one removed.

@tobschli
Copy link
Member Author

I just noticed that it would probably also make sense to add an integration test for this.
The unit tests for this feature only test the case where the count limit is 0.
This works because there is no actual rest mapper, which means the resolution to the GVR fails, returning a count of 0, which violates the limit.

I would implement this, once we decided to use a separate webhook or not :)

@gardener-prow gardener-prow bot added cla: no Indicates the PR's author has not signed the cla-assistant.io CLA. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. and removed cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. cla: no Indicates the PR's author has not signed the cla-assistant.io CLA. labels Sep 16, 2025
@tobschli
Copy link
Member Author

tobschli commented Sep 18, 2025

In a talk with @timuthy we decided to move the namespace selector into the component-config of this webhook. - non-namespaced resources will be size / count checked all the time - namespaced resources will be ignored in case they don't match the namespaceSelector

We gain:

  • We do not need to introduce a new webhook, where we would need to drop all namespaced resources
  • We can also size restrict non-namespaced resources now 🥳

It turns out, that non-namespaced resources are not skipped, even though there is a namespaceSelector present.
This is also explained in the doc-string of the ValidatingWebhoolConfiguration's NamespaceSelector (ref):

If the object is another cluster scoped resource, it never skips the webhook.

I will update this PR accordingly in the near future.

Copy link
Member

@timuthy timuthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the adjustments!

@tobschli
Copy link
Member Author

Because I somehow cannot comment on the one PR review comment:

Can we only log this in case a limit is really configured for the affected resource?

We already return earlier here

@tobschli
Copy link
Member Author

tobschli commented Sep 22, 2025

BTW. pull-gardener-e2e-kind-ha-multi-zone now failed 5 times in a row.. I am beginning to think that this is not flaky behavior, but something related to this PR. But the test logs, as well as the artifacts do not really give me a clue on what this could be...

When checking the diff between the commit where it was working and where the failing started, I also cannot point out anything that looks off: https://github.com/tobschli/gardener/compare/5991c265aa5732529b289347e376e9459c5b5d7c..924f3e9db980e5594090792d6888a78c4692669c

Copy link
Member

@timuthy timuthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Sep 23, 2025
Copy link
Contributor

gardener-prow bot commented Sep 23, 2025

LGTM label has been added.

Git tree hash: b1ee9c4886f6e35412e29a0bbb6e177cdcdcc268

@tobschli
Copy link
Member Author

I will take a look later on why the pull-gardener-e2e-kind-ha-multi-zone consistently fails.

@tobschli
Copy link
Member Author

Booting up the scenario locally works without a problem.
My suspicion is that this behavior is related to this flake: #12919

I will take a look why this change here provokes this even more.

@gardener-prow gardener-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 2, 2025
@gardener-prow gardener-prow bot requested a review from timuthy October 2, 2025 11:27
Copy link
Contributor

gardener-prow bot commented Oct 2, 2025

New changes are detected. LGTM label has been removed.

@tobschli
Copy link
Member Author

tobschli commented Oct 6, 2025

/test pull-gardener-e2e-kind-ha-multi-zone

Size *resource.Quantity `json:"size,omitempty"`
// Count specifies the maximum number of resources of the given kind.
// +optional
Count *int64 `json:"count,omitempty"`
Copy link
Member

@rfranzke rfranzke Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could adapt L72 now

Size *resource.Quantity `json:"size,omitempty"`
// Count specifies the maximum number of resources of the given kind.
// +optional
Count *int64 `json:"count,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +116 to +124
if limit.Count != nil && apivalidation.ValidateNonnegativeField(*limit.Count, fld.Child("count")).ToAggregate() != nil {
allErrs = append(allErrs, field.Invalid(fld.Child("count"), limit.Count, "value must not be negative"))
}

if limit.Size != nil && limit.Size.Cmp(resource.Quantity{}) < 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one of count or size should be set (both must not be unset at the same time).

APIVersions: limit.APIVersions,
Resources: limit.Resources,
Size: limit.Size,
Size: &limit.Size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to add Count here now as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ResourceLimit type seems to be equally defined in two places:

https://github.com/tobschli/gardener/blob/13363fa6facf8e75776416b567d4a8e6ea9bf27b/pkg/admissioncontroller/apis/config/v1alpha1/types.go#L73-L88

type ResourceLimit struct {
// APIGroups is the name of the APIGroup that contains the limited resource. WildcardAll represents all groups.
// +optional
APIGroups []string `json:"apiGroups,omitempty"`
// APIVersions is the version of the resource. WildcardAll represents all versions.
// +optional
APIVersions []string `json:"apiVersions,omitempty"`
// Resources is the name of the resource this rule applies to. WildcardAll represents all resources.
Resources []string `json:"resources"`
// Size specifies the imposed limit.
Size resource.Quantity `json:"size"`
}

Do we want to drop one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we didn't want to import the component config API into the operator.gardener.cloud/v1alpha1 package, hence the duplication.

skaffold.yaml Outdated
Comment on lines 1644 to 1646
# TEST: Check if this causes the flake in CI
# insecureRegistries:
# - garden.local.gardener.cloud:5001
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this again :)

}

func (h *Handler) getKindFromGVR(gvr *metav1.GroupVersionResource) (string, error) {
if h.RESTMapper == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check really needed? Usually, we don't do these checks if we control the field 👀

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not, but I thought it would be better to have it, than to not have it.
In order keep everything aligned, I will remove it :)

}

func (h *Handler) existingResourcesExceedLimit(_ admission.Request, gvr *metav1.GroupVersionResource, limit int64) (bool, error) {
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we pass the context.Context from the Handle function down here? This would be better instead of using context.Background

})

// List all resources of this type cluster-wide (since we only count non-namespaced resources)
if err := h.Client.List(ctx, list, client.Limit(limit+1)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Would it make sense to use an APIReader (that is uncached)? Otherwise, if many objects were created at the same time, the GAC's cache might be stale and (unintentionally) allow exceeding the configured resource limit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the GAC's manager use a cached client? I could not find the cache settings for it, so I would assume it is uncached?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the controller-runtime client is cached by default, unless you explicitly disable it.

return fmt.Errorf("failed to determine if count exceeds limit: %w", err)
}

if exceedsLimit {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIt: You could make an early exit (to prevent indenting) if exceedsLimit is false.

port: {{ required ".Values.global.admission.config.server.metrics.port is required" .Values.global.admission.config.server.metrics.port }}
{{- if .Values.global.admission.config.server.resourceAdmissionConfiguration }}
resourceAdmissionConfiguration:
{{- if .Values.global.admission.config.server.resourceAdmissionConfiguration.limits }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You could adapt the example in

# resourceAdmissionConfiguration:
# limits:
# - apiGroups: ["core.gardener.cloud"]
# apiVersions: ["*"]
# resources: ["shoots"]
# size: 100Ki

…en values are empty.

Before this, when not specifying e.g. `unrestrictedSubjects`, the key would be added to the YAML, leading to invalid YAML
This also included some refactorings, as we did not return an error when the evaluation of the current count of objects fails, to prevent this feature from blocking operations on the objects.
Copy link
Contributor

gardener-prow bot commented Oct 9, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rfranzke. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

gardener-prow bot commented Oct 9, 2025

@tobschli: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gardener-e2e-kind-operator-seed 87ff3b4 link true /test pull-gardener-e2e-kind-operator-seed
pull-gardener-apidiff a7a3d09 link false /test pull-gardener-apidiff
pull-gardener-unit a7a3d09 link true /test pull-gardener-unit
pull-gardener-e2e-kind-migration a7a3d09 link true /test pull-gardener-e2e-kind-migration

Full PR test history. Your PR dashboard. Command help for this repository.
Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/robustness Robustness, reliability, resilience related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add quota for project resources

3 participants