-
Notifications
You must be signed in to change notification settings - Fork 1.1k
OCPBUGS-28981: fix busyloop when killing container #7736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@rphillips: This pull request references Jira Issue OCPBUGS-28981, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rphillips The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/jira refresh |
|
@rphillips: This pull request references Jira Issue OCPBUGS-28981, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #7736 +/- ##
=======================================
Coverage 47.95% 47.95%
=======================================
Files 146 146
Lines 16274 16283 +9
=======================================
+ Hits 7804 7809 +5
- Misses 7517 7521 +4
Partials 953 953 |
4105b1d to
d0c93cc
Compare
|
I think this would be better added to the code already i n https://github.com/cri-o/cri-o/blob/main/internal/oci/container.go#L530-L540 or more precisely https://github.com/cri-o/cri-o/blob/main/internal/oci/container_linux.go#L53 |
|
@haircommander If I move it to container_linux, then the freebsd portion is not implemented. Updated to not be implemented on bsd. |
d0c93cc to
f2f72bb
Compare
|
sorry I may not have been clear. Queries to /proc are not trivially cheap and I think we should consolidate reading from /proc/pid/status if possible. We're already reading the start time whenever we check a Living(), we could extend the getPidStartTime function to be getPidStartAndState, and read it in the same function. we could then treat processes in D state as not running |
f2f72bb to
4cdee8c
Compare
4cdee8c to
3096b3d
Compare
e2ea135 to
5dbbc69
Compare
dbd2f48 to
eeb0ad8
Compare
| } | ||
|
|
||
| func getPidStatData(file string) (*StatData, error) { | ||
| data, err := os.ReadFile(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some ideas for small optimisations that we can do here if we want to.
- We can use something that won't call
fstat()on the file, which does not work for procfs/sysfs anyway - We can only request for a page of data (so typically a 4 KiB on x86_64) to be read at once
The (2) is because procfs/sysfs (contrary to debugfs, etc.) only serves a page worth of data.
This can reduce the number of allocations a little bit per:
$ go test -v -bench=. -benchmem
goos: linux
goarch: amd64
pkg: test-procfs
cpu: AMD Ryzen 9 6900HX with Radeon Graphics
BenchmarkA
BenchmarkA-16 256002 4549 ns/op 2160 B/op 8 allocs/op
BenchmarkB
BenchmarkB-16 284290 4327 ns/op 1976 B/op 8 allocs/op
PASS
ok test-procfs 3.074sThe os.ReadFile() call to fstat() will only return 0 (or 4 KiB; as it would for most procfs/sysfs objects) per:
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0There is also a potential to leverage fmt.Fscan() over a bytes buffer in lieu of strings.Split() to potentially further reduce memory allocation.
Signed-off-by: Ryan Phillips <[email protected]>
eeb0ad8 to
5c111c9
Compare
|
@rphillips: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
| log.Warnf(ctx, "Stopping container %v with stop signal timed out. Killing", c.ID()) | ||
| if _, err := r.runtimeCmd("kill", c.ID(), "KILL"); err != nil { | ||
| log.Errorf(ctx, "Killing container %v failed: %v", c.ID(), err) | ||
| targetTime = time.Now().Add(10 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this still mean that in case of a container that can't be killed with -9 due to uninetteruptible sleep, we'd be logging every 10 ms? Or is that caught by the change to verify PID? If I understand things correctly, processes in D state can be killed with -9 if the kernel has them as TASK_KILLABLE. Unless they are in TASK_UNINTERRUPTIBLE without TASK_KILLABLE, in that case they can't be killed with -9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreaskaris, assuming that we've attempted to send the graceful termination signal (SIGTERM) already and then at least once send the SIGKILL signal, then the aim with the new changes is to add this 10 ms delay/stagger and also bail out for any process which is in the state such as "D", "X" or "Z". Otherwise, yes, it would also be a lot of log entries (10 ms is not a lot relatively).
As for the "D" state, to us, it does not matter whether it's killable or uninterruptible—at this point, we are already on the path to termination. There should be at least one pending signal (since we should have attempted to send SIGKILL at least once) awaiting such a process to awake (when it will be delivered). There is also no easy way to see whether the "D" status has a TASK_KILLABLE or TASK_UNINTERRUPTIBLE flag set (userspace seems to have limited visibility via /proc in that specific case).
| // as the start time we saved after creating the container. | ||
| // This is the simplest way to verify we are operating on the container | ||
| // process, and haven't run into PID wrap. | ||
| func (c *Container) verifyPid() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a thing now with this function since it understands the process state. We might be baling out on things too early/eagerly, for example:
cri-o/internal/oci/runtime_oci.go
Lines 805 to 809 in 5c111c9
| // The initial container process either doesn't exist, or isn't ours. | |
| if err := c.Living(); err != nil { | |
| c.state.Finished = time.Now() | |
| return nil | |
| } |
We wouldn't attempt to deliver SIGTERM or SIGKILL should the process be in the undesired state. However, this state might be brief, or the process might still be killable.
What I am saying, we need to make sure the old flow needs to be preserved/attempted at least once, I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh shoot you're right. yeah we need to have two different functions. One to check for Living pre-SIGKILL, one for LivingOrSleeping for after. If we filter "D" state out for all processes we check for living, the process can flip back to R and we would totally skip the kill
/hold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have two functions or pass a simple boolean to Living() and then to verifyPid() to toggle the process status check so it can be skipped on demand, for example:
func (c *Container) verifyPid(checkState bool) error {
data, err := getPidData(c.state.InitPid)
if err != nil {
return err
}
if data.StartTime != c.state.InitStartTime {
return fmt.Errorf(
"PID %d is running but has start time of %s, whereas the saved start time is %s. PID wrap may have occurred",
c.state.InitPid, data.StartTime, c.state.InitStartTime,
)
}
// Don't check the process (PID) state.
if !checkState {
return nil
}
// Skip states which are typically considered as invalid.
switch data.State {
case "D", "X", "Z":
return fmt.Errorf("PID %d has an invalid state: %s", c.state.InitPid, data.State)
}
return nil
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, verifyPid() would be split into verifyPIDWrap() and verifyPIDState(), and used accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose, we always want to check for a PID wrap condition. Which might render the idea of having two separate helpers moot.
|
/assign @rphillips |
|
I'm wondering if we should just add an exponential backoff and call it a day. We can't easily communicate to the kubelet the process is in "D" state without exiting StopLoopForContainer, which would require it is reentrant, or potentially cause other transient failures (runc kill decides to take a while?) from killing the container. The safest and simplest is to just try less frequently and wait for the kernel to catch up. What do folks think? |
Personally, I believe a simple solution might be the best served here. Otherwise, there are a lot of edge cases we might have to potentially address to make the stopping code robust. To set the scene a little:
While testing the proposed changes I was able to identify a few potential issues: (A small Linux kernel module used to aid testing can be found here).
panic: send on closed channel
goroutine 1810 [running]:
github.com/cri-o/cri-o/internal/oci.(*Container).WaitOnStopTimeout(0xc00089c000, {0x2e3beb0, 0xc0005d0150}, 0xa)
github.com/cri-o/cri-o/internal/oci/container.go:604 +0xc5
github.com/cri-o/cri-o/internal/oci.(*runtimeOCI).StopContainer(0xc000153620, {0x2e3beb0?, 0xc0005d0120?}, 0xc00089c000, 0xa)
github.com/cri-o/cri-o/internal/oci/runtime_oci.go:816 +0x352
github.com/cri-o/cri-o/internal/oci.(*Runtime).StopContainer(0xc0003cfb90, {0x2e3beb0?, 0xc0005d0090?}, 0xc00089c000, 0xa)
github.com/cri-o/cri-o/internal/oci/oci.go:309 +0x102
github.com/cri-o/cri-o/server.(*Server).stopContainer(0xc000452808, {0x2e3beb0?, 0xc001069bf0?}, 0xc00089c000, 0xa)
github.com/cri-o/cri-o/server/container_stop.go:60 +0x40b
github.com/cri-o/cri-o/server.(*Server).stopPodSandbox.func1()
github.com/cri-o/cri-o/server/sandbox_stop_linux.go:63 +0x38
golang.org/x/sync/errgroup.(*Group).Go.func1()
golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1785
golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96Related logs: time="2024-02-20 02:36:26.182569470Z" level=warning msg="Stopping container 314423af59714f38f65ea8a23072248bccf3bacd43147fdc1db64527d515b791 with stop signal timed out. Killing" file="oci/runtime_oci.go:875"
time="2024-02-20 02:36:26.189833904Z" level=debug msg="Unmounted container 314423af59714f38f65ea8a23072248bccf3bacd43147fdc1db64527d515b791" file="storage/runtime.go:495" id=e026368c-a07e-478e-ac8e-bb9218b5fec3 name=/runtime.v1.RuntimeService/StopContainer
time="2024-02-20 02:36:26.195471499Z" level=info msg="Stopped container 314423af59714f38f65ea8a23072248bccf3bacd43147fdc1db64527d515b791: default/test/test" file="server/container_stop.go:29" id=e026368c-a07e-478e-ac8e-bb9218b5fec3 name=/runtime.v1.RuntimeService/StopContainer
time="2024-02-20 02:36:26.196017909Z" level=debug msg="Request: &StopPodSandboxRequest{PodSandboxId:07327609189e09d497a5647c3befaa3e6f3d64119591ba82d6821919039ec726,}" file="otel-collector/interceptors.go:62" id=08b90184-ad3d-4172-9f2e-9058d0d1b9ca name=/runtime.v1.RuntimeService/StopPodSandbox
time="2024-02-20 02:36:26.196048272Z" level=info msg="Stopping pod sandbox: 07327609189e09d497a5647c3befaa3e6f3d64119591ba82d6821919039ec726" file="server/sandbox_stop.go:18" id=08b90184-ad3d-4172-9f2e-9058d0d1b9ca name=/runtime.v1.RuntimeService/StopPodSandboxThe panic occurs following the
The following block is never run as sending a signal to a blocked process, albeit a process that exists, would not solicit any errors from the kernel (syscall). if _, err := r.runtimeCmd("kill", c.ID(), "KILL"); err != nil {
log.Errorf(ctx, "Killing container %v failed: %v", c.ID(), err)
targetTime = time.Now().Add(10 * time.Millisecond)
}Thus, when
A need to pass a boolean through the
If a single process runs within the container or so happens this process is the first process, then we can easily access and check its state—should it be "D" and we were going to terminate the process anyway, then the undesired state will be correctly detected and the loop will be terminated early, as intended. However, if the process inside spawned a child process (one or more) and one of these children that aren't detached is currently blocked and might remain in such a state for a longer period of time, then the parent will also be unresponsive, and such, sending termination signals to it wouldn't have any desired effect—the signals will be simply pending. This, in turn, means that we won't be able to detect the "D" state of the process, container, etc., while on the path to termination, so the loop will continue to spin, exhibiting the same behaviour as the original code currently does. This is because we only pass the
When a single process running within the container is due to be terminated and it happens to be in the blocking "D" state, then we would terminate the busy loop early. Then, the following code would run: cri-o/internal/storage/runtime.go Lines 468 to 497 in 87ff09f
The question here would be: is this safe to run for a process (container) that might still be running for a while? |
hard to tell without knowing exactly the stage the oci runtime fails to kill the container, but my guess is we'd get EBUSY on attempting to unmount the storage. I think retrying with an exponential backoff until the process is gone from the process table is the only safe and sustainable solution. |
|
@rphillips: This pull request references Jira Issue OCPBUGS-28981. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This fixes the busy loop on a failed kill (D-state process) and reschedules the timer to sometime in the future. Since targetTime has expired, the loop re-enters the kill case statement without waiting. The pod is terminal at this point, so a 10ms pause seems ok.
Does this PR introduce a user-facing change?