Skip to content

Conversation

@kwilczynski
Copy link
Contributor

@kwilczynski kwilczynski commented Mar 7, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

Currently, when CRI-O attempts to stop a container where the process within, especially an init process (the so-called "PID 1"), is in an uninterruptible blocking state (for example, it's sleeping and waiting for a disk I/O completion, etc.), CRI-O will enter a broken state where it tries to delivery termination signals to such a process as fast as possible.

Nonetheless, a blocked process might not promptly respond to the signals delivered, causing CRI-O to enter a "busy loop" while it repeatedly tries to signal delivery. This seemingly unbound loop can render CRI-O unresponsive and result in high CPU usage while it happens.

Thus, add exponential backoff support to the container stop loop to fix the possible busy loop issue irregardless of the current state of the process to be terminated. The exponential backoff will stagger termination signals delivery for as long as the process is still running, allowing it to eventually terminate on its own volition (or crash, whichever comes first).

Related to:

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

None

Does this PR introduce a user-facing change?

- Add exponential backoff to the container stop loop to fix a busy loop issue when a process running with the container is in an uninterruptible blocking state and would become unresponsive to signals delivery during container termination.
- Add a new error log line to notify the user that a process (the container init) has been blocked in uninterruptible sleep for some time.```

@kwilczynski kwilczynski requested a review from mrunalp as a code owner March 7, 2024 17:55
@openshift-ci openshift-ci bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 7, 2024
@openshift-ci openshift-ci bot requested review from klihub and wgahnagl March 7, 2024 17:56
@kwilczynski
Copy link
Contributor Author

/assign kwilczynski

@openshift-ci openshift-ci bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. labels Mar 7, 2024
@kwilczynski kwilczynski changed the title Pace the container stop loop using ticker with different intervals [WIP] Pace the container stop loop using ticker with different intervals Mar 7, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 7, 2024
@kwilczynski kwilczynski force-pushed the feature/add-stop-loop-pacing branch 2 times, most recently from 311afbe to 216bb63 Compare March 7, 2024 18:08
@codecov
Copy link

codecov bot commented Mar 7, 2024

Codecov Report

Merging #7854 (b8e947a) into main (3643677) will decrease coverage by 0.01%.
The diff coverage is 69.62%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7854      +/-   ##
==========================================
- Coverage   48.95%   48.94%   -0.01%     
==========================================
  Files         151      151              
  Lines       16390    16426      +36     
==========================================
+ Hits         8024     8040      +16     
- Misses       7394     7411      +17     
- Partials      972      975       +3     

@kwilczynski kwilczynski force-pushed the feature/add-stop-loop-pacing branch from 216bb63 to 35c0693 Compare March 7, 2024 18:21
@kwilczynski kwilczynski force-pushed the feature/add-stop-loop-pacing branch from 35c0693 to 0bf9852 Compare March 7, 2024 23:34
@kwilczynski kwilczynski changed the title [WIP] Pace the container stop loop using ticker with different intervals [WIP] Add exponential backoff to the container stop loop Mar 7, 2024
@kwilczynski kwilczynski force-pushed the feature/add-stop-loop-pacing branch 2 times, most recently from 0fa7d34 to 57117c1 Compare March 7, 2024 23:38
@kwilczynski kwilczynski changed the title [WIP] Add exponential backoff to the container stop loop OCPBUGS-28981: Add exponential backoff to the container stop loop Mar 8, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 8, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 8, 2024
@openshift-ci-robot
Copy link

@kwilczynski: This pull request references Jira Issue OCPBUGS-28981, which is invalid:

  • expected the bug to target only the "4.16.0" version, but multiple target versions were set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

What type of PR is this?

/kind bug

What this PR does / why we need it:

Currently, when CRI-O attempts to stop a container where a process within, especially an init process (the so-called "PID 1"), is in an uninterruptible blocking state (for example, it's sleeping and waiting for a disk I/O completion, etc.), CRI-O will enter a broken state where it tries to delivery termination signals to such a process as fast as possible.

Nonetheless, a blocked process might not promptly respond to the signals delivered, causing CRI-O to enter a "busy loop" while it repeatedly tries to signal delivery. This seemingly unbound loop can render CRI-O unresponsive and result in high CPU usage while it happens.

Thus, add exponential backoff support to the container stop loop to fix the possible busy loop issue irregardless of the current state of the process to be terminated. The exponential backoff will stagger termination signals delivery for as long as the process is still running, allowing it to eventually terminate on its own volition (or crash, whichever comes first).

Related to:

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

None

Does this PR introduce a user-facing change?

Add exponential backoff to the container stop loop to fix a busy loop issue when a process running with the container is in an uninterruptible blocking state and would become unresponsive to signals delivery during container termination.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kwilczynski
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link

@kwilczynski: This pull request references Jira Issue OCPBUGS-28981, which is invalid:

  • expected the bug to target only the "4.16.0" version, but multiple target versions were set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kwilczynski
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Mar 8, 2024
@openshift-ci-robot
Copy link

@kwilczynski: This pull request references Jira Issue OCPBUGS-28981, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kwilczynski kwilczynski force-pushed the feature/add-stop-loop-pacing branch 2 times, most recently from 50bda88 to 2797cc1 Compare March 19, 2024 12:27
@kwilczynski
Copy link
Contributor Author

@haircommander, added the ProcessState() function as requested, in lieu of modifying Living().

@haircommander
Copy link
Member

/retest
/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, kwilczynski, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [haircommander,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kwilczynski
Copy link
Contributor Author

/retest-required

@kwilczynski
Copy link
Contributor Author

/retest-required

@kwilczynski
Copy link
Contributor Author

/retest-required

@kwilczynski
Copy link
Contributor Author

/retest

@kwilczynski
Copy link
Contributor Author

/retest-required

@haircommander
Copy link
Member

/override ci/prow/e2e-gcp-ovn
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 20, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 20, 2024

@haircommander: Overrode contexts on behalf of haircommander: ci/prow/e2e-gcp-ovn

In response to this:

/override ci/prow/e2e-gcp-ovn
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-bot openshift-merge-bot bot merged commit f63de87 into cri-o:main Mar 20, 2024
@openshift-ci-robot
Copy link

@kwilczynski: Jira Issue OCPBUGS-28981: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-28981 has been moved to the MODIFIED state.

In response to this:

What type of PR is this?

/kind bug

What this PR does / why we need it:

Currently, when CRI-O attempts to stop a container where the process within, especially an init process (the so-called "PID 1"), is in an uninterruptible blocking state (for example, it's sleeping and waiting for a disk I/O completion, etc.), CRI-O will enter a broken state where it tries to delivery termination signals to such a process as fast as possible.

Nonetheless, a blocked process might not promptly respond to the signals delivered, causing CRI-O to enter a "busy loop" while it repeatedly tries to signal delivery. This seemingly unbound loop can render CRI-O unresponsive and result in high CPU usage while it happens.

Thus, add exponential backoff support to the container stop loop to fix the possible busy loop issue irregardless of the current state of the process to be terminated. The exponential backoff will stagger termination signals delivery for as long as the process is still running, allowing it to eventually terminate on its own volition (or crash, whichever comes first).

Related to:

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

None

Does this PR introduce a user-facing change?

- Add exponential backoff to the container stop loop to fix a busy loop issue when a process running with the container is in an uninterruptible blocking state and would become unresponsive to signals delivery during container termination.
- Add a new error log line to notify the user that a process (the container init) has been blocked in uninterruptible sleep for some time.```

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@kwilczynski: #7854 failed to apply on top of branch "release-1.28":

Applying: Add exponential backoff to the container stop loop
Using index info to reconstruct a base tree...
M	contrib/test/ci/vars.yml
M	internal/oci/container.go
A	internal/oci/container_freebsd.go
A	internal/oci/container_freebsd_nocgo.go
A	internal/oci/container_linux.go
M	internal/oci/container_test.go
M	internal/oci/runtime_oci.go
M	internal/oci/runtime_oci_test.go
M	test/ctr.bats
Falling back to patching base and 3-way merge...
Auto-merging test/ctr.bats
Auto-merging internal/oci/runtime_oci_test.go
Auto-merging internal/oci/runtime_oci.go
Auto-merging internal/oci/container_test.go
CONFLICT (modify/delete): internal/oci/container_linux.go deleted in HEAD and modified in Add exponential backoff to the container stop loop. Version Add exponential backoff to the container stop loop of internal/oci/container_linux.go left in tree.
CONFLICT (modify/delete): internal/oci/container_freebsd_nocgo.go deleted in HEAD and modified in Add exponential backoff to the container stop loop. Version Add exponential backoff to the container stop loop of internal/oci/container_freebsd_nocgo.go left in tree.
CONFLICT (modify/delete): internal/oci/container_freebsd.go deleted in HEAD and modified in Add exponential backoff to the container stop loop. Version Add exponential backoff to the container stop loop of internal/oci/container_freebsd.go left in tree.
Auto-merging internal/oci/container.go
Auto-merging contrib/test/ci/vars.yml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Add exponential backoff to the container stop loop
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.29
/cherry-pick release-1.28
/cherry-pick release-1.27
/cherry-pick release-1.26

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@kwilczynski: #7854 failed to apply on top of branch "release-1.27":

Applying: Add exponential backoff to the container stop loop
Using index info to reconstruct a base tree...
M	contrib/test/ci/vars.yml
M	internal/oci/container.go
A	internal/oci/container_freebsd.go
A	internal/oci/container_freebsd_nocgo.go
A	internal/oci/container_linux.go
M	internal/oci/container_test.go
M	internal/oci/runtime_oci.go
M	internal/oci/runtime_oci_test.go
M	test/ctr.bats
Falling back to patching base and 3-way merge...
Auto-merging test/ctr.bats
Auto-merging internal/oci/runtime_oci_test.go
Auto-merging internal/oci/runtime_oci.go
Auto-merging internal/oci/container_test.go
CONFLICT (modify/delete): internal/oci/container_linux.go deleted in HEAD and modified in Add exponential backoff to the container stop loop. Version Add exponential backoff to the container stop loop of internal/oci/container_linux.go left in tree.
CONFLICT (modify/delete): internal/oci/container_freebsd_nocgo.go deleted in HEAD and modified in Add exponential backoff to the container stop loop. Version Add exponential backoff to the container stop loop of internal/oci/container_freebsd_nocgo.go left in tree.
CONFLICT (modify/delete): internal/oci/container_freebsd.go deleted in HEAD and modified in Add exponential backoff to the container stop loop. Version Add exponential backoff to the container stop loop of internal/oci/container_freebsd.go left in tree.
Auto-merging internal/oci/container.go
Auto-merging contrib/test/ci/vars.yml
CONFLICT (content): Merge conflict in contrib/test/ci/vars.yml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Add exponential backoff to the container stop loop
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.29
/cherry-pick release-1.28
/cherry-pick release-1.27
/cherry-pick release-1.26

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@kwilczynski: #7854 failed to apply on top of branch "release-1.26":

Applying: Add exponential backoff to the container stop loop
Using index info to reconstruct a base tree...
M	contrib/test/ci/vars.yml
M	internal/oci/container.go
A	internal/oci/container_freebsd.go
A	internal/oci/container_freebsd_nocgo.go
A	internal/oci/container_linux.go
M	internal/oci/container_test.go
M	internal/oci/runtime_oci.go
M	internal/oci/runtime_oci_test.go
M	test/ctr.bats
Falling back to patching base and 3-way merge...
Auto-merging test/ctr.bats
Auto-merging internal/oci/runtime_oci_test.go
Auto-merging internal/oci/runtime_oci.go
CONFLICT (content): Merge conflict in internal/oci/runtime_oci.go
Auto-merging internal/oci/container_test.go
CONFLICT (modify/delete): internal/oci/container_linux.go deleted in HEAD and modified in Add exponential backoff to the container stop loop. Version Add exponential backoff to the container stop loop of internal/oci/container_linux.go left in tree.
CONFLICT (modify/delete): internal/oci/container_freebsd_nocgo.go deleted in HEAD and modified in Add exponential backoff to the container stop loop. Version Add exponential backoff to the container stop loop of internal/oci/container_freebsd_nocgo.go left in tree.
CONFLICT (modify/delete): internal/oci/container_freebsd.go deleted in HEAD and modified in Add exponential backoff to the container stop loop. Version Add exponential backoff to the container stop loop of internal/oci/container_freebsd.go left in tree.
Auto-merging internal/oci/container.go
Auto-merging contrib/test/ci/vars.yml
CONFLICT (content): Merge conflict in contrib/test/ci/vars.yml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Add exponential backoff to the container stop loop
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.29
/cherry-pick release-1.28
/cherry-pick release-1.27
/cherry-pick release-1.26

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@kwilczynski: #7854 failed to apply on top of branch "release-1.29":

Applying: Add exponential backoff to the container stop loop
Using index info to reconstruct a base tree...
M	contrib/test/ci/vars.yml
M	internal/oci/container.go
A	internal/oci/container_freebsd.go
A	internal/oci/container_freebsd_nocgo.go
M	internal/oci/container_linux.go
M	internal/oci/container_test.go
M	internal/oci/runtime_oci.go
M	internal/oci/runtime_oci_test.go
M	test/ctr.bats
Falling back to patching base and 3-way merge...
Auto-merging test/ctr.bats
Auto-merging internal/oci/runtime_oci_test.go
Auto-merging internal/oci/runtime_oci.go
Auto-merging internal/oci/container_test.go
CONFLICT (modify/delete): internal/oci/container_freebsd_nocgo.go deleted in HEAD and modified in Add exponential backoff to the container stop loop. Version Add exponential backoff to the container stop loop of internal/oci/container_freebsd_nocgo.go left in tree.
CONFLICT (modify/delete): internal/oci/container_freebsd.go deleted in HEAD and modified in Add exponential backoff to the container stop loop. Version Add exponential backoff to the container stop loop of internal/oci/container_freebsd.go left in tree.
Auto-merging contrib/test/ci/vars.yml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Add exponential backoff to the container stop loop
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.29
/cherry-pick release-1.28
/cherry-pick release-1.27
/cherry-pick release-1.26

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kwilczynski kwilczynski deleted the feature/add-stop-loop-pacing branch March 20, 2024 16:55
@kwilczynski
Copy link
Contributor Author

OK. Requires manual backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants