Skip to content

Conversation

@hiboyang
Copy link
Contributor

@hiboyang hiboyang commented Dec 11, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This is a follow-up for PR #8082, which adds support for RayJob with autoscaling. This PR improves e2e test to wait for scaling-up/down.

Which issue(s) this PR fixes:

Improve e2e test from PR #8082.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 11, 2025
@netlify
Copy link

netlify bot commented Dec 11, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 88b37dc
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6941946b5865b900088e1b6d

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 11, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @hiboyang. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 11, 2025
@hiboyang
Copy link
Contributor Author

@mimowo @yaroslava-serdiuk this is the PR to improve e2e test for RayJob with autoscaling.

@mimowo
Copy link
Contributor

mimowo commented Dec 11, 2025

Thank you! I will leave the first review pass to Yaroslava

@mimowo
Copy link
Contributor

mimowo commented Dec 11, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 11, 2025
# run tasks in parallel to trigger autoscaling (scaling up)
print(ray.get([my_task.remote(i, 10) for i in range(10)]))
# run tasks in sequence to trigger scaling down
print([ray.get(my_task.remote(i, 1)) for i in range(40)])`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you put such high numbers? I'd anticipate a maximum of around 5 being adequate here. If we want to run with 20 or 40 iterations, I suggest we reduce the sleep time accordingly to optimize the test's execution speed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The high numbers are to make the rayjob have enough time to scale up and down. Let me tune these numbers to set them a bit lower.

Copy link
Contributor Author

@hiboyang hiboyang Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaroslava-serdiuk I updated the tests to make waiting time shorter in rayjob, tried different values multiple times, if setting too short, the test is not stable and may fail randomly. Now get the final values. Would you check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yaroslava-serdiuk for checking! I updated the PR again due to git conflict. Would you help to approve again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added lgtm. Passing to @mimowo for approval.

@yaroslava-serdiuk
Copy link
Contributor

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 11, 2025
@hiboyang
Copy link
Contributor Author

/retest

Copy link
Contributor

@yaroslava-serdiuk yaroslava-serdiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hiboyang
Copy link
Contributor Author

Actually, I have these two comments: https://github.com/kubernetes-sigs/kueue/pull/8174/changes#r2620439444 PTAL, and #8174 (comment)

Thanks for the comments, I updated the PR according to them.

Comment on lines 337 to 339
g.Expect(verifyPodNamesAreSuperset(currentPodNames, initialPodNames)).To(gomega.BeTrue(),
"Current worker pod names should be a superset of initial pod names. "+
"Initial pods: %v, Current pods: %v", initialPodNames, currentPodNames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, instead of the helper functions let's use the gomega way:

g.Expect(currentPodNames.UnsortedList()).To(gomega.ContainElements(initialPodNames.UnsortedList()))

This is less code, and better failure message when it happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, updated PR!

g.Expect(workerPodNames).To(gomega.HaveLen(1), "Expected exactly 1 pods with 'workers' in the name")

// Verify that the previous scaled-up pod names are a superset of the current pod names
g.Expect(verifyPodNamesAreSuperset(scaledUpPodNames, workerPodNames)).To(gomega.BeTrue(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also let's replace that with the assert using gomega.ContainElements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hiboyang, mbobrovskyi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2025
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 👍
/lgtm
/cherrypick release-0.15

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 16, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: f209c3591399228f9e97dbe9c96f49273f0dc9ea

@hiboyang
Copy link
Contributor Author

/retest

1 similar comment
@hiboyang
Copy link
Contributor Author

/retest

gomega.Expect(k8sClient.Create(ctx, configMap)).Should(gomega.Succeed())
})
ginkgo.DeferCleanup(func() {
gomega.Expect(k8sClient.Delete(ctx, configMap)).Should(gomega.Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gomega.Expect(k8sClient.Delete(ctx, configMap)).Should(gomega.Succeed())
util.ExpectObjectToBeDeleted(ctx, k8sClient, configMap, true)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated DeleteNamespace() to delete configmap, to make cleanup consistent with old code (inside DeleteNamespace).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 16, 2025
@mbobrovskyi
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 16, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: 4155b7e1abb71c7ad3890b8daa87e87b2d5bd033

@k8s-ci-robot k8s-ci-robot merged commit 8b0b091 into kubernetes-sigs:main Dec 16, 2025
28 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.16 milestone Dec 16, 2025
hiboyang added a commit to hiboyang/kueue_oss that referenced this pull request Dec 16, 2025
* Update rayjob autoscaling e2e test to check number of workers

* Refactor to extra helper function countWorkerPods

* Remove TODO comment

* Update test message

* Check running work pods

* Check workloads

* Reduce rayjob sleeping time

* Run 16 times in last step

* Run 16 times in middle step

* Update test/e2e/singlecluster/kuberay_test.go

Co-authored-by: Yaroslava Serdiuk <[email protected]>

* Add DeleteAllRayJobSetsInNamespace

* Use LongTimeout when possible

* Add clean up for configmap and rayjob

* Update test/util/util.go

Co-authored-by: Mykhailo Bobrovskyi <[email protected]>

* Check scaled up pod names superset of initial pod names

* Set TerminationGracePeriodSeconds to 5 seconds

* Use gomega.ContainElements

* Delete configmap inside DeleteNamespace

---------

Co-authored-by: Yaroslava Serdiuk <[email protected]>
Co-authored-by: Mykhailo Bobrovskyi <[email protected]>
hiboyang added a commit to hiboyang/kueue_oss that referenced this pull request Dec 16, 2025
* Update rayjob autoscaling e2e test to check number of workers

* Refactor to extra helper function countWorkerPods

* Remove TODO comment

* Update test message

* Check running work pods

* Check workloads

* Reduce rayjob sleeping time

* Run 16 times in last step

* Run 16 times in middle step

* Update test/e2e/singlecluster/kuberay_test.go

Co-authored-by: Yaroslava Serdiuk <[email protected]>

* Add DeleteAllRayJobSetsInNamespace

* Use LongTimeout when possible

* Add clean up for configmap and rayjob

* Update test/util/util.go

Co-authored-by: Mykhailo Bobrovskyi <[email protected]>

* Check scaled up pod names superset of initial pod names

* Set TerminationGracePeriodSeconds to 5 seconds

* Use gomega.ContainElements

* Delete configmap inside DeleteNamespace

---------

Co-authored-by: Yaroslava Serdiuk <[email protected]>
Co-authored-by: Mykhailo Bobrovskyi <[email protected]>
k8s-ci-robot pushed a commit that referenced this pull request Dec 16, 2025
…by using the ElasticJobsViaWorkloadSlices feature (#8282)

* Ray: Support RayJob InTreeAutoscaling by using the ElasticJobsViaWorkloadSlices feature (#8082)

* Support RayJob InTreeAutoscaling: update RayCluster.IsTopLevel() and RayJob.Skip()

* Do not suspend RayJob in webhook when autoscaling is enabled

* Check workloadslicing.Enabled in rayjob webhook

* Simplify code in RayCluster.IsTopLevel

* Fix unit test "invalid managed - has auto scaler"

* Fix workloadslicing_test.go compile issue

* Check RayClusterSpec not nil in RayJob.Skip()

* Check RayClusterSpec nil in rayjob webook before applying default for suspend

* Fix file format issues

* Run golangci-lint run --fix

* Add log before removing scheduling gate

* Return error if rayjob is initially suspended with autoscaling

* Add e2e test for RayJob with InTreeAutoscaling

* Run gci lint for kuberay_test.go

* Update formatting in kuberay_test.go

* Fix compile issues in utils_test.go

* Fix duplicate import in kuberay_test.go

* Implement IsTopLevel for Job

* Revert "Implement IsTopLevel for Job"

This reverts commit 16b53a9.

* Reapply "Implement IsTopLevel for Job"

This reverts commit 6faefd3.

* Remove JobWithCustomWorkloadRetriever interface

* Make CopyLabelAndAnnotationFromOwner only copy for ray submitter job

* Update unit test TestCopyLabelAndAnnotationFromOwner

* Fix lint formatting

* Refactor: add function isRaySubmitterJobWithAutoScaling

* Rename CopyLabelAndAnnotationFromOwner to RaySubmitterJobCopyLabelAndAnnotationFromOwner

* Move RaySubmitterJobCopyLabelAndAnnotationFromOwner from jobframework package to job_controller

* Update pkg/controller/jobs/job/job_controller.go

Co-authored-by: Michał Woźniak <[email protected]>

* Update pkg/util/testingjobs/rayjob/wrappers.go

Co-authored-by: Michał Woźniak <[email protected]>

* Fix compile issue

* Small update according to the comment

* Check EnableInTreeAutoscaling in rayjob webhook

* Update code based on comments

* Make return error

* Create new file ray_utils.go

* Update kuberay_test.go

* Update logging leve to 5

* Make copyRaySubmitterJobMetadata return error

* Set idleTimeoutSeconds to 5 in EnableInTreeAutoscaling

* Check copyRaySubmitterJobMetadata error in test code

* Use inline if error check when calling copyRaySubmitterJobMetadata

* Fix test name to use TestCopyRaySubmitterJobMetadata

---------

Co-authored-by: Michał Woźniak <[email protected]>

* Fix compile issue: workloadSliceEnabled() -> WorkloadSliceEnabled()

* Improve RayJob InTreeAutoscaling e2e test (#8174)

* Update rayjob autoscaling e2e test to check number of workers

* Refactor to extra helper function countWorkerPods

* Remove TODO comment

* Update test message

* Check running work pods

* Check workloads

* Reduce rayjob sleeping time

* Run 16 times in last step

* Run 16 times in middle step

* Update test/e2e/singlecluster/kuberay_test.go

Co-authored-by: Yaroslava Serdiuk <[email protected]>

* Add DeleteAllRayJobSetsInNamespace

* Use LongTimeout when possible

* Add clean up for configmap and rayjob

* Update test/util/util.go

Co-authored-by: Mykhailo Bobrovskyi <[email protected]>

* Check scaled up pod names superset of initial pod names

* Set TerminationGracePeriodSeconds to 5 seconds

* Use gomega.ContainElements

* Delete configmap inside DeleteNamespace

---------

Co-authored-by: Yaroslava Serdiuk <[email protected]>
Co-authored-by: Mykhailo Bobrovskyi <[email protected]>

---------

Co-authored-by: Michał Woźniak <[email protected]>
Co-authored-by: Yaroslava Serdiuk <[email protected]>
Co-authored-by: Mykhailo Bobrovskyi <[email protected]>
k8s-ci-robot pushed a commit that referenced this pull request Dec 16, 2025
…by using the ElasticJobsViaWorkloadSlices feature (#8284)

* Ray: Support RayJob InTreeAutoscaling by using the ElasticJobsViaWorkloadSlices feature (#8082)

* Support RayJob InTreeAutoscaling: update RayCluster.IsTopLevel() and RayJob.Skip()

* Do not suspend RayJob in webhook when autoscaling is enabled

* Check workloadslicing.Enabled in rayjob webhook

* Simplify code in RayCluster.IsTopLevel

* Fix unit test "invalid managed - has auto scaler"

* Fix workloadslicing_test.go compile issue

* Check RayClusterSpec not nil in RayJob.Skip()

* Check RayClusterSpec nil in rayjob webook before applying default for suspend

* Fix file format issues

* Run golangci-lint run --fix

* Add log before removing scheduling gate

* Return error if rayjob is initially suspended with autoscaling

* Add e2e test for RayJob with InTreeAutoscaling

* Run gci lint for kuberay_test.go

* Update formatting in kuberay_test.go

* Fix compile issues in utils_test.go

* Fix duplicate import in kuberay_test.go

* Implement IsTopLevel for Job

* Revert "Implement IsTopLevel for Job"

This reverts commit 16b53a9.

* Reapply "Implement IsTopLevel for Job"

This reverts commit 6faefd3.

* Remove JobWithCustomWorkloadRetriever interface

* Make CopyLabelAndAnnotationFromOwner only copy for ray submitter job

* Update unit test TestCopyLabelAndAnnotationFromOwner

* Fix lint formatting

* Refactor: add function isRaySubmitterJobWithAutoScaling

* Rename CopyLabelAndAnnotationFromOwner to RaySubmitterJobCopyLabelAndAnnotationFromOwner

* Move RaySubmitterJobCopyLabelAndAnnotationFromOwner from jobframework package to job_controller

* Update pkg/controller/jobs/job/job_controller.go

Co-authored-by: Michał Woźniak <[email protected]>

* Update pkg/util/testingjobs/rayjob/wrappers.go

Co-authored-by: Michał Woźniak <[email protected]>

* Fix compile issue

* Small update according to the comment

* Check EnableInTreeAutoscaling in rayjob webhook

* Update code based on comments

* Make return error

* Create new file ray_utils.go

* Update kuberay_test.go

* Update logging leve to 5

* Make copyRaySubmitterJobMetadata return error

* Set idleTimeoutSeconds to 5 in EnableInTreeAutoscaling

* Check copyRaySubmitterJobMetadata error in test code

* Use inline if error check when calling copyRaySubmitterJobMetadata

* Fix test name to use TestCopyRaySubmitterJobMetadata

---------

Co-authored-by: Michał Woźniak <[email protected]>

* Improve RayJob InTreeAutoscaling e2e test (#8174)

* Update rayjob autoscaling e2e test to check number of workers

* Refactor to extra helper function countWorkerPods

* Remove TODO comment

* Update test message

* Check running work pods

* Check workloads

* Reduce rayjob sleeping time

* Run 16 times in last step

* Run 16 times in middle step

* Update test/e2e/singlecluster/kuberay_test.go

Co-authored-by: Yaroslava Serdiuk <[email protected]>

* Add DeleteAllRayJobSetsInNamespace

* Use LongTimeout when possible

* Add clean up for configmap and rayjob

* Update test/util/util.go

Co-authored-by: Mykhailo Bobrovskyi <[email protected]>

* Check scaled up pod names superset of initial pod names

* Set TerminationGracePeriodSeconds to 5 seconds

* Use gomega.ContainElements

* Delete configmap inside DeleteNamespace

---------

Co-authored-by: Yaroslava Serdiuk <[email protected]>
Co-authored-by: Mykhailo Bobrovskyi <[email protected]>

---------

Co-authored-by: Michał Woźniak <[email protected]>
Co-authored-by: Yaroslava Serdiuk <[email protected]>
Co-authored-by: Mykhailo Bobrovskyi <[email protected]>
olekzabl pushed a commit to olekzabl/kueue that referenced this pull request Dec 18, 2025
* Update rayjob autoscaling e2e test to check number of workers

* Refactor to extra helper function countWorkerPods

* Remove TODO comment

* Update test message

* Check running work pods

* Check workloads

* Reduce rayjob sleeping time

* Run 16 times in last step

* Run 16 times in middle step

* Update test/e2e/singlecluster/kuberay_test.go

Co-authored-by: Yaroslava Serdiuk <[email protected]>

* Add DeleteAllRayJobSetsInNamespace

* Use LongTimeout when possible

* Add clean up for configmap and rayjob

* Update test/util/util.go

Co-authored-by: Mykhailo Bobrovskyi <[email protected]>

* Check scaled up pod names superset of initial pod names

* Set TerminationGracePeriodSeconds to 5 seconds

* Use gomega.ContainElements

* Delete configmap inside DeleteNamespace

---------

Co-authored-by: Yaroslava Serdiuk <[email protected]>
Co-authored-by: Mykhailo Bobrovskyi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants