Skip to content

Conversation

@rphillips
Copy link
Contributor

@rphillips rphillips commented Feb 5, 2024

This fixes the busy loop on a failed kill (D-state process) and reschedules the timer to sometime in the future. Since targetTime has expired, the loop re-enters the kill case statement without waiting. The pod is terminal at this point, so a 10ms pause seems ok.

Does this PR introduce a user-facing change?

Fix busy loop on killing a D-state container

@rphillips rphillips requested a review from mrunalp as a code owner February 5, 2024 23:00
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Feb 5, 2024
@openshift-ci-robot
Copy link

@rphillips: This pull request references Jira Issue OCPBUGS-28981, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This fixes the busy loop on a failed kill (D-state process) and reschedules the timer to sometime in the future. Since targetTime has expired, the loop re-enters the kill case statement without waiting. The pod is terminal at this point, so a 10ms pause seems ok.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Feb 5, 2024
@openshift-ci openshift-ci bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Feb 5, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 5, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rphillips
Once this PR has been reviewed and has the lgtm label, please assign giuseppe for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 5, 2024
@rphillips
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 5, 2024
@openshift-ci-robot
Copy link

@rphillips: This pull request references Jira Issue OCPBUGS-28981, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov
Copy link

codecov bot commented Feb 5, 2024

Codecov Report

Merging #7736 (eeb0ad8) into main (8510499) will increase coverage by 0.00%.
The diff coverage is 88.46%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7736   +/-   ##
=======================================
  Coverage   47.95%   47.95%           
=======================================
  Files         146      146           
  Lines       16274    16283    +9     
=======================================
+ Hits         7804     7809    +5     
- Misses       7517     7521    +4     
  Partials      953      953           

@rphillips rphillips force-pushed the fix_busy_loop_on_kill branch 5 times, most recently from 4105b1d to d0c93cc Compare February 6, 2024 14:50
@haircommander
Copy link
Member

haircommander commented Feb 6, 2024

@rphillips
Copy link
Contributor Author

rphillips commented Feb 6, 2024

@haircommander If I move it to container_linux, then the freebsd portion is not implemented.

Updated to not be implemented on bsd.

@rphillips rphillips force-pushed the fix_busy_loop_on_kill branch from d0c93cc to f2f72bb Compare February 6, 2024 15:47
@haircommander
Copy link
Member

sorry I may not have been clear. Queries to /proc are not trivially cheap and I think we should consolidate reading from /proc/pid/status if possible. We're already reading the start time whenever we check a Living(), we could extend the getPidStartTime function to be getPidStartAndState, and read it in the same function. we could then treat processes in D state as not running

@rphillips rphillips force-pushed the fix_busy_loop_on_kill branch from f2f72bb to 4cdee8c Compare February 6, 2024 17:16
@rphillips rphillips force-pushed the fix_busy_loop_on_kill branch from 4cdee8c to 3096b3d Compare February 6, 2024 17:27
@rphillips rphillips force-pushed the fix_busy_loop_on_kill branch 3 times, most recently from e2ea135 to 5dbbc69 Compare February 6, 2024 17:48
@rphillips rphillips force-pushed the fix_busy_loop_on_kill branch 3 times, most recently from dbd2f48 to eeb0ad8 Compare February 6, 2024 19:15
}

func getPidStatData(file string) (*StatData, error) {
data, err := os.ReadFile(file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some ideas for small optimisations that we can do here if we want to.

  1. We can use something that won't call fstat() on the file, which does not work for procfs/sysfs anyway
  2. We can only request for a page of data (so typically a 4 KiB on x86_64) to be read at once

The (2) is because procfs/sysfs (contrary to debugfs, etc.) only serves a page worth of data.

This can reduce the number of allocations a little bit per:

$ go test -v -bench=. -benchmem 
goos: linux
goarch: amd64
pkg: test-procfs
cpu: AMD Ryzen 9 6900HX with Radeon Graphics        
BenchmarkA
BenchmarkA-16    	  256002	      4549 ns/op	    2160 B/op	       8 allocs/op
BenchmarkB
BenchmarkB-16    	  284290	      4327 ns/op	    1976 B/op	       8 allocs/op
PASS
ok  	test-procfs	3.074s

The os.ReadFile() call to fstat() will only return 0 (or 4 KiB; as it would for most procfs/sysfs objects) per:

fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0

There is also a potential to leverage fmt.Fscan() over a bytes buffer in lieu of strings.Split() to potentially further reduce memory allocation.

@rphillips rphillips force-pushed the fix_busy_loop_on_kill branch from eeb0ad8 to 5c111c9 Compare February 7, 2024 01:01
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 7, 2024

@rphillips: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn 5c111c9 link true /test e2e-gcp-ovn
ci/prow/ci-cgroupv2-integration 5c111c9 link true /test ci-cgroupv2-integration
ci/prow/ci-crun-e2e 5c111c9 link true /test ci-crun-e2e
ci/prow/ci-fedora-integration 5c111c9 link true /test ci-fedora-integration
ci/prow/ci-e2e 5c111c9 link true /test ci-e2e
ci/prow/ci-e2e-evented-pleg 5c111c9 link true /test ci-e2e-evented-pleg

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

log.Warnf(ctx, "Stopping container %v with stop signal timed out. Killing", c.ID())
if _, err := r.runtimeCmd("kill", c.ID(), "KILL"); err != nil {
log.Errorf(ctx, "Killing container %v failed: %v", c.ID(), err)
targetTime = time.Now().Add(10 * time.Millisecond)
Copy link
Contributor

@andreaskaris andreaskaris Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this still mean that in case of a container that can't be killed with -9 due to uninetteruptible sleep, we'd be logging every 10 ms? Or is that caught by the change to verify PID? If I understand things correctly, processes in D state can be killed with -9 if the kernel has them as TASK_KILLABLE. Unless they are in TASK_UNINTERRUPTIBLE without TASK_KILLABLE, in that case they can't be killed with -9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreaskaris, assuming that we've attempted to send the graceful termination signal (SIGTERM) already and then at least once send the SIGKILL signal, then the aim with the new changes is to add this 10 ms delay/stagger and also bail out for any process which is in the state such as "D", "X" or "Z". Otherwise, yes, it would also be a lot of log entries (10 ms is not a lot relatively).

As for the "D" state, to us, it does not matter whether it's killable or uninterruptible—at this point, we are already on the path to termination. There should be at least one pending signal (since we should have attempted to send SIGKILL at least once) awaiting such a process to awake (when it will be delivered). There is also no easy way to see whether the "D" status has a TASK_KILLABLE or TASK_UNINTERRUPTIBLE flag set (userspace seems to have limited visibility via /proc in that specific case).

// as the start time we saved after creating the container.
// This is the simplest way to verify we are operating on the container
// process, and haven't run into PID wrap.
func (c *Container) verifyPid() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a thing now with this function since it understands the process state. We might be baling out on things too early/eagerly, for example:

// The initial container process either doesn't exist, or isn't ours.
if err := c.Living(); err != nil {
c.state.Finished = time.Now()
return nil
}

We wouldn't attempt to deliver SIGTERM or SIGKILL should the process be in the undesired state. However, this state might be brief, or the process might still be killable.

What I am saying, we need to make sure the old flow needs to be preserved/attempted at least once, I suppose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh shoot you're right. yeah we need to have two different functions. One to check for Living pre-SIGKILL, one for LivingOrSleeping for after. If we filter "D" state out for all processes we check for living, the process can flip back to R and we would totally skip the kill
/hold

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have two functions or pass a simple boolean to Living() and then to verifyPid() to toggle the process status check so it can be skipped on demand, for example:

func (c *Container) verifyPid(checkState bool) error {
	data, err := getPidData(c.state.InitPid)
	if err != nil {
		return err
	}

	if data.StartTime != c.state.InitStartTime {
		return fmt.Errorf(
			"PID %d is running but has start time of %s, whereas the saved start time is %s. PID wrap may have occurred",
			c.state.InitPid, data.StartTime, c.state.InitStartTime,
		)
	}

        // Don't check the process (PID) state.
	if !checkState {
                return nil
        }

        // Skip states which are typically considered as invalid.
        switch data.State {
	case "D", "X", "Z":
		return fmt.Errorf("PID %d has an invalid state: %s", c.state.InitPid, data.State)
	}

	return nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, verifyPid() would be split into verifyPIDWrap() and verifyPIDState(), and used accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose, we always want to check for a PID wrap condition. Which might render the idea of having two separate helpers moot.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2024
@kwilczynski
Copy link
Contributor

/assign @rphillips

@haircommander
Copy link
Member

I'm wondering if we should just add an exponential backoff and call it a day. We can't easily communicate to the kubelet the process is in "D" state without exiting StopLoopForContainer, which would require it is reentrant, or potentially cause other transient failures (runc kill decides to take a while?) from killing the container. The safest and simplest is to just try less frequently and wait for the kernel to catch up. What do folks think?

@kwilczynski
Copy link
Contributor

kwilczynski commented Feb 20, 2024

[...]
The safest and simplest is to just try less frequently and wait for the kernel to catch up. What do folks think?

Personally, I believe a simple solution might be the best served here. Otherwise, there are a lot of edge cases we might have to potentially address to make the stopping code robust.


To set the scene a little:

  • Signals are reentrant, non-blocking and asynchronous
  • Sending a signal to a process that is in the uninterruptible sleep without the TASK_KILLABLE set will be set as pending
  • There is no way for us to correctly ascertain from the userspace the type of the "D" process state
    • Has the TASK_UNINTERRUPTIBLE state been set without the TASK_KILLABLE flag or not
  • Process can be "blocked" directly or become irresponsible should one of its children is currently in a blocking state

While testing the proposed changes I was able to identify a few potential issues:

(A small Linux kernel module used to aid testing can be found here).

  • Testing a single process within a container that was in the blocked state revealed a potential race or ordering issue
panic: send on closed channel

goroutine 1810 [running]:
github.com/cri-o/cri-o/internal/oci.(*Container).WaitOnStopTimeout(0xc00089c000, {0x2e3beb0, 0xc0005d0150}, 0xa)
	github.com/cri-o/cri-o/internal/oci/container.go:604 +0xc5
github.com/cri-o/cri-o/internal/oci.(*runtimeOCI).StopContainer(0xc000153620, {0x2e3beb0?, 0xc0005d0120?}, 0xc00089c000, 0xa)
	github.com/cri-o/cri-o/internal/oci/runtime_oci.go:816 +0x352
github.com/cri-o/cri-o/internal/oci.(*Runtime).StopContainer(0xc0003cfb90, {0x2e3beb0?, 0xc0005d0090?}, 0xc00089c000, 0xa)
	github.com/cri-o/cri-o/internal/oci/oci.go:309 +0x102
github.com/cri-o/cri-o/server.(*Server).stopContainer(0xc000452808, {0x2e3beb0?, 0xc001069bf0?}, 0xc00089c000, 0xa)
	github.com/cri-o/cri-o/server/container_stop.go:60 +0x40b
github.com/cri-o/cri-o/server.(*Server).stopPodSandbox.func1()
	github.com/cri-o/cri-o/server/sandbox_stop_linux.go:63 +0x38
golang.org/x/sync/errgroup.(*Group).Go.func1()
	golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1785
	golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96

Related logs:

time="2024-02-20 02:36:26.182569470Z" level=warning msg="Stopping container 314423af59714f38f65ea8a23072248bccf3bacd43147fdc1db64527d515b791 with stop signal timed out. Killing" file="oci/runtime_oci.go:875"
time="2024-02-20 02:36:26.189833904Z" level=debug msg="Unmounted container 314423af59714f38f65ea8a23072248bccf3bacd43147fdc1db64527d515b791" file="storage/runtime.go:495" id=e026368c-a07e-478e-ac8e-bb9218b5fec3 name=/runtime.v1.RuntimeService/StopContainer
time="2024-02-20 02:36:26.195471499Z" level=info msg="Stopped container 314423af59714f38f65ea8a23072248bccf3bacd43147fdc1db64527d515b791: default/test/test" file="server/container_stop.go:29" id=e026368c-a07e-478e-ac8e-bb9218b5fec3 name=/runtime.v1.RuntimeService/StopContainer
time="2024-02-20 02:36:26.196017909Z" level=debug msg="Request: &StopPodSandboxRequest{PodSandboxId:07327609189e09d497a5647c3befaa3e6f3d64119591ba82d6821919039ec726,}" file="otel-collector/interceptors.go:62" id=08b90184-ad3d-4172-9f2e-9058d0d1b9ca name=/runtime.v1.RuntimeService/StopPodSandbox
time="2024-02-20 02:36:26.196048272Z" level=info msg="Stopping pod sandbox: 07327609189e09d497a5647c3befaa3e6f3d64119591ba82d6821919039ec726" file="server/sandbox_stop.go:18" id=08b90184-ad3d-4172-9f2e-9058d0d1b9ca name=/runtime.v1.RuntimeService/StopPodSandbox

The panic occurs following the SetAsDoneStopping() function that closes the stopTimeoutChan channel while there might still be a pending new timeout to send through from the StopContainer() function.

  • The 10 milliseconds delay is not set (renewed) as a new timeout

The following block is never run as sending a signal to a blocked process, albeit a process that exists, would not solicit any errors from the kernel (syscall).

if _, err := r.runtimeCmd("kill", c.ID(), "KILL"); err != nil {
	log.Errorf(ctx, "Killing container %v failed: %v", c.ID(), err)
	targetTime = time.Now().Add(10 * time.Millisecond)
}

Thus, when runc or crun are asked to kill the containers, their internal code does not fail, and as such, we see an error is set, which means that we won't be updating the targetTime variable with the new value.

  • The Living() function had to be taught to skip process state checks

A need to pass a boolean through the Living() to the pid() to then verifyPid() denoting whether checking for the process state should be performed or not. This was to allow for an early, or initial, check of the process liveness to be performed that needed to ignore a potentially momentary "D" state of a given process (a process might have been sleeping momentarily and such).

  • We can only check the process state of the main, or the "init" process, running within the container

If a single process runs within the container or so happens this process is the first process, then we can easily access and check its state—should it be "D" and we were going to terminate the process anyway, then the undesired state will be correctly detected and the loop will be terminated early, as intended.

However, if the process inside spawned a child process (one or more) and one of these children that aren't detached is currently blocked and might remain in such a state for a longer period of time, then the parent will also be unresponsive, and such, sending termination signals to it wouldn't have any desired effect—the signals will be simply pending.

This, in turn, means that we won't be able to detect the "D" state of the process, container, etc., while on the path to termination, so the loop will continue to spin, exhibiting the same behaviour as the original code currently does.

This is because we only pass the state.InitPid from the Container type each time we look at the process running within the container. Since the first or "init" process would birth the container into existence, it does make sense to check for this specific process only.

  • Can container storage be unmounted while the process is still running?

When a single process running within the container is due to be terminated and it happens to be in the blocking "D" state, then we would terminate the busy loop early. Then, the following code would run:

func (r *runtimeService) StopContainer(ctx context.Context, idOrName string) error {
ctx, span := log.StartSpan(ctx)
defer span.End()
if idOrName == "" {
return ErrInvalidContainerID
}
container, err := r.storageImageServer.GetStore().Container(idOrName)
if err != nil {
if errors.Is(err, storage.ErrContainerUnknown) {
log.Infof(ctx, "Container %s not known, assuming it got already removed", idOrName)
return nil
}
log.Warnf(ctx, "Failed to get container %s: %v", idOrName, err)
return err
}
if _, err := r.storageImageServer.GetStore().Unmount(container.ID, true); err != nil {
if errors.Is(err, storage.ErrLayerUnknown) {
log.Infof(ctx, "Layer for container %s not known", container.ID)
return nil
}
log.Warnf(ctx, "Failed to unmount container %s: %v", container.ID, err)
return err
}
log.Debugf(ctx, "Unmounted container %s", container.ID)
return nil
}

The question here would be: is this safe to run for a process (container) that might still be running for a while?

@haircommander
Copy link
Member

The question here would be: is this safe to run for a process (container) that might still be running for a while?

hard to tell without knowing exactly the stage the oci runtime fails to kill the container, but my guess is we'd get EBUSY on attempting to unmount the storage. I think retrying with an exponential backoff until the process is gone from the process table is the only safe and sustainable solution.

@rphillips rphillips closed this Mar 6, 2024
@openshift-ci-robot
Copy link

@rphillips: This pull request references Jira Issue OCPBUGS-28981. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

In response to this:

This fixes the busy loop on a failed kill (D-state process) and reschedules the timer to sometime in the future. Since targetTime has expired, the loop re-enters the kill case statement without waiting. The pod is terminal at this point, so a 10ms pause seems ok.

Does this PR introduce a user-facing change?

Fix busy loop on killing a D-state container

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants