Preventing containers from being unable to be deleted #4757

HirazawaUi · 2025-05-05T12:25:33Z

This is follow-up to PR #4645, I am taking over from @jianghao65536 to continue addressing this issue.

If the runc-create process is terminated due to receiving a SIGKILL signal, it may lead to the runc-init process leaking due to issues like cgroup freezing, and it cannot be cleaned up by runc delete/stop because the container lacks a state.json file. This typically occurs when higher-level container runtimes terminate the runc create process due to context cancellation or timeout.

In PR #4645, the Creating state was added to clean up processes in the STAGE_PARENT/STAGE_CHILD stage within the cgroup. This PR no longer adds the Creating state for the following reasons:

Although runc init STAGE_PARENT/STAGE_CHILD may exist simultaneously when runc create receives SIGKILL signal, after runc create terminates, STAGE_PARENT/STAGE_CHILD will also terminate due to the termination of runc create:
- STAGE_PARENT: Directly relies on pipenum to communicate with runc create. When runc create terminates, pipenum is closed, causing STAGE_PARENT to fail when reading/writing to pipenum, triggering bail and termination.
- STAGE_CHILD: Relies on syncfd to synchronize with STAGE_PARENT. When STAGE_PARENT terminates, syncfd is closed, causing STAGE_CHILD to fail when reading/writing to syncfd, triggering bail and termination.
If the runc-create process is terminated during execution, the container may be in one of the following states:
- paused: If runc create receives SIGKILL signal during the process of setting the cgroup, the container will be in a paused state. At this point, the runc init process becomes zombie process and cannot be killed. However, pausedState.destroy will thaw the cgroup and terminate the runc init process.
- stopped: If runc create receives SIGKILL signal during the STAGE_PARENT -> STAGE_CHILD phase, the container will be in a stopped state. As described above, STAGE_PARENT/STAGE_CHILD will terminate due to the termination of runc create, so no processes will be left behind. We only need to clean up the remaining cgroup files, and stoppedState.destroy will handle this cleanup.

Therefore, based on the above reasons, the existing paused and stopped states are sufficient to handle the abnormal termination of runc create due to a SIGKILL signal.

HirazawaUi · 2025-05-05T15:32:23Z

I was unable to add integration tests for this PR without resorting to some hacky methods, but I tested whether this issue was resolved in the kubernetes-sigs/kind repository.

In brief, I discovered this issue while working in the kubernetes/kubernetes repo to propagate kubelet's context to the container runtime. The issue manifested as the test job being unable to tear down after the k/k repo's e2e tests completed, because the leaked runc init process and its corresponding systemd scope prevented systemd from shutting down.

Therefore, I opened a PR in the kubernetes-sigs/kind repo to debug this issue by manually replacing the containerd/runc binaries in the CI environment. After building the code from this PR and replacing the binaries in the CI environment, the test job no longer failed to tear down due to systemd being unable to shut down, as the leaked processes were resolved.

Ref: kubernetes-sigs/kind#3903 (Some job failures occurred due to the instability of the k/k repo e2e tests, but they are unrelated to this issue.)

I also conducted some manual tests targeting the scenarios where the leftover container is in the paused and stopped states.

Paused:

Inject sleep to allow us to control where the code is interrupted.

You can add a header

diff --git a/vendor/github.com/opencontainers/cgroups/systemd/v1.go b/vendor/github.com/opencontainers/cgroups/systemd/v1.go
index 8453e9b4..bbe3524c 100644
--- a/vendor/github.com/opencontainers/cgroups/systemd/v1.go
+++ b/vendor/github.com/opencontainers/cgroups/systemd/v1.go
@@ -6,6 +6,7 @@ import (
        "path/filepath"
        "strings"
        "sync"
+       "time"

        systemdDbus "github.com/coreos/go-systemd/v22/dbus"
        "github.com/sirupsen/logrus"
@@ -361,6 +362,7 @@ func (m *LegacyManager) Set(r *cgroups.Resources) error {
                }
        }
        setErr := setUnitProperties(m.dbus, unitName, properties...)
+       time.Sleep(time.Second * 30)
        if needsThaw {
                if err := m.doFreeze(cgroups.Thawed); err != nil {
                        logrus.Infof("thaw container after SetUnitProperties failed: %v", err)

1. Create a container:
./runc --systemd-cgroup create mycontainer

2. Check container processes:
ps -ef | grep runc
root        2944     694  0 15:36 pts/2    00:00:00 ./runc --systemd-cgroup create mycontainer
root        2956    2944  0 15:36 ?        00:00:00 ./runc init
root        2963     688  0 15:36 pts/1    00:00:00 grep runc

3. Kill the runc create process:
kill -9 2944

4. Check if the runc init process is left behind:
ps -ef | grep runc
root        2956       1  0 15:36 ?        00:00:00 ./runc init
root        2965     688  0 15:37 pts/1    00:00:00 grep runc

5. Check the current container state:
./runc list
ID            PID         STATUS      BUNDLE              CREATED                OWNER
mycontainer   2953        paused      /root/mycontainer   0001-01-01T00:00:00Z   root

6. Delete the container:
./runc delete -f mycontainer
writing sync procError: write sync: broken pipe
EOF

7. Verify if the runc init process has been cleaned up:
ps -ef | grep runc
root        3067     688  0 15:39 pts/1    00:00:00 grep runc

stopped:

Inject sleep to allow us to control where the code is interrupted.

You can add a header

diff --git a/libcontainer/process_linux.go b/libcontainer/process_linux.go
index 96e3ca5f..350e3660 100644
--- a/libcontainer/process_linux.go
+++ b/libcontainer/process_linux.go
@@ -613,6 +613,7 @@ func (p *initProcess) start() (retErr error) {
                        return fmt.Errorf("unable to apply cgroup configuration: %w", err)
                }
        }
+         time.Sleep(time.Second * 30)
        if p.intelRdtManager != nil {
                if err := p.intelRdtManager.Apply(p.pid()); err != nil {
                        return fmt.Errorf("unable to apply Intel RDT configuration: %w", err)

1. Create a container:
./runc --systemd-cgroup create mycontainer

2. Check container processes:
ps -ef | grep runc
root        3124     694  0 15:45 pts/2    00:00:00 ./runc --systemd-cgroup create mycontainer
root        3132    3124  0 15:45 pts/2    00:00:00 ./runc init
root        3140     688  0 15:45 pts/1    00:00:00 grep runc

3. Kill the runc create process:
kill -9 3124

4. Check if the runc init process is left behind (There will be no runc init process left behind):
ps -ef | grep runc
root        3142     688  0 15:45 pts/1    00:00:00 grep runc

5. Check the current container state:
./runc list
ID            PID         STATUS      BUNDLE              CREATED                OWNER
mycontainer   0           stopped     /root/mycontainer   0001-01-01T00:00:00Z   root

6. Delete the container:
./runc delete -f mycontainer

HirazawaUi · 2025-05-05T15:49:54Z

/cc @kolyshkin @AkihiroSuda @rata

kolyshkin · 2025-05-06T19:34:07Z

See also: #2575

libcontainer/process_linux.go

rata · 2025-05-07T10:48:43Z

@HirazawaUi thanks! So my comment was on-spot, but you didn't need to remove the assignment?

For testing, I'd like to have something. It should be simple and kind of reliable. Here are some ideas, but we don't need a test if we don't find a reasonable and simple way to test this:

I wonder if creating a PID namespace with a low limit can emulate it (only if it's simple, I guess it is?). We can then increase the limit and call runc delete to see it is deleted correctly?
Or maybe we can use fanotify, to block some operation and send a SIGKILL at that point?
Or maybe in unit tests, we can maybe override the start() function and create a process with the API that will be blocked there, before the state file is created?

HirazawaUi · 2025-05-07T14:23:31Z

but you didn't need to remove the assignment?

I believe that removing this assignment and delaying the assignment process until after updateState is pointless. Regardless of whether it is removed here, the container will enter the stopped state if the creation process is interrupted before the cgroup is frozen, and stoppedState.destroy() can properly clean up residual files in this scenario.
ref:

runc/libcontainer/container_linux.go

Lines 893 to 895 in a4b9868

    
           if !c.hasInit() { 
        
           	return c.state.transition(&stoppedState{c: c}) 
        
           }

runc/libcontainer/container_linux.go

Lines 905 to 913 in a4b9868

    
           func (c *Container) hasInit() bool { 
        
           	if c.initProcess == nil { 
        
           		return false 
        
           	} 
        
           	pid := c.initProcess.pid() 
        
           	stat, err := system.Stat(pid) 
        
           	if err != nil { 
        
           		return false 
        
           	}

HirazawaUi · 2025-05-07T15:24:35Z

I wonder if creating a PID namespace with a low limit can emulate it (only if it's simple, I guess it is?). We can then increase the limit and call runc delete to see it is deleted correctly?

Or maybe we can use fanotify, to block some operation and send a SIGKILL at that point?

Or maybe in unit tests, we can maybe override the start() function and create a process with the API that will be blocked there, before the state file is created?

I will try testing it in the direction of Suggestion 2 (it seems the most effective). If it cannot be implemented, I will promptly provide feedback here :)

HirazawaUi · 2025-05-09T14:36:00Z

Test case has been added.

While attempting to use fanotify to monitor the open events of state.json and terminate the runc create process upon detecting an open event, I suddenly realized a blind spot I had never considered before: why not try running runc create and then sending it SIGKILL signal to terminate it within a very short time frame?

Compared to event monitoring, this approach better aligns with the scenario we encountered and is completely asynchronous. The only downside seems to be its fragility, but I added numerous device rules to slow down cgroup creation and restricted it to cgroup v1 only, which reduces the likelihood of errors (in my last code submission, all tests performed well with no errors).

@rata Do you think this test case sufficiently covers the scenarios for this PR?

HirazawaUi · 2025-05-14T13:37:17Z

ping @kolyshkin @AkihiroSuda @rata Could you take another look at this PR? Any feedback would be greatly appreciated.

kolyshkin

(Sorry, had some pending review comments which I forgot to submit)

Also, you need a proper name/description for the second commit. Currently it just says "add integration test" which is enough in the context of this PR, but definitely not enough when looking at git history.

tests/integration/delete.bats

HirazawaUi · 2025-06-18T14:37:25Z

Encountered errors during rebase, investigating...

rata

LGTM

libcontainer/process_linux.go

rata · 2025-06-19T10:44:30Z

@HirazawaUi you need to sing-off the commits before we merge. Can you do that?

Signed-off-by: HirazawaUi <[email protected]>

HirazawaUi · 2025-06-19T12:18:50Z

ou need to sing-off the commits before we merge. Can you do that?

Thanks for the reminder, signed.

rata · 2025-06-19T13:37:17Z

Oh, @lifubang requested changes, although he wrote to feel free to ignore if we want to merge the test, that is why I wanted to do that. @HirazawaUi Can you remove the test, then? That way we can easily merge now.

You can open another PR with the test, if you want. Although I feel we need to explore more options to have a reliable test (and not sure it's worth it? Maybe it is). Something that might work is using seccomp notify and make it hang in some syscall, but it is also fragile. Maybe if we use a rare syscall, only when compiled with some build tags (like the access syscall), and then compile and run runc like that for the tests. The test needs more thought, definitely :)

HirazawaUi · 2025-06-19T14:15:19Z

@HirazawaUi Can you remove the test, then? That way we can easily merge now.

Removed, while I'd prefer to keep it given the considerable effort invested in designing and implementing this testing approach, I respect the consensus to remove it. Perhaps the journey of exploration matters more than the outcome itself.

HirazawaUi · 2025-06-19T14:35:03Z

@lifubang PTAL

rata · 2025-06-19T14:39:09Z

Oh, yeah, I can dismiss the review but let's just wait for @lifubang to take another look...

lifubang · 2025-06-20T09:13:21Z

@HirazawaUi Would you like this change to be backported to release-1.3?

HirazawaUi · 2025-06-20T15:28:37Z

Would you like this change to be backported to release-1.3?

I'd be very happy to backport this to still maintained older releases, I'll do this tomorrow :)

HirazawaUi force-pushed the fix-unable-delete branch 2 times, most recently from 11c5aba to 60ae641 Compare May 5, 2025 13:30

kolyshkin force-pushed the fix-unable-delete branch from 60ae641 to 29dcef9 Compare May 6, 2025 19:25

kolyshkin mentioned this pull request May 6, 2025

Add creating status to ensure state.json exists when runc kill. #4645

Closed

rata reviewed May 7, 2025

View reviewed changes

libcontainer/process_linux.go Show resolved Hide resolved

HirazawaUi force-pushed the fix-unable-delete branch 9 times, most recently from 39d801e to a6ebd29 Compare May 9, 2025 14:16

vagabond2522 approved these changes May 11, 2025

View reviewed changes

This comment was marked as spam.

Sign in to view

kolyshkin requested changes May 14, 2025

View reviewed changes

tests/integration/delete.bats Outdated Show resolved Hide resolved

tests/integration/delete.bats Outdated Show resolved Hide resolved

kolyshkin reviewed May 14, 2025

View reviewed changes

tests/integration/delete.bats Outdated Show resolved Hide resolved

kolyshkin reviewed May 14, 2025

View reviewed changes

tests/integration/delete.bats Outdated Show resolved Hide resolved

kolyshkin reviewed May 14, 2025

View reviewed changes

tests/integration/delete.bats Outdated Show resolved Hide resolved

HirazawaUi force-pushed the fix-unable-delete branch from a6ebd29 to 1606d12 Compare May 15, 2025 04:24

HirazawaUi force-pushed the fix-unable-delete branch 2 times, most recently from 207ce21 to 649949f Compare June 18, 2025 14:47

HirazawaUi mentioned this pull request Jun 18, 2025

Do not pass context to runc create. containerd/containerd#11755

Open

HirazawaUi force-pushed the fix-unable-delete branch from 649949f to 6e99e66 Compare June 19, 2025 00:06

rata approved these changes Jun 19, 2025

View reviewed changes

libcontainer/process_linux.go Show resolved Hide resolved

HirazawaUi force-pushed the fix-unable-delete branch from 6e99e66 to 813e25f Compare June 19, 2025 12:16

Preventing containers from being unable to be deleted

1b39997

Signed-off-by: HirazawaUi <[email protected]>

HirazawaUi force-pushed the fix-unable-delete branch from 813e25f to 780c4e9 Compare June 19, 2025 12:18

rata requested a review from lifubang June 19, 2025 13:38

HirazawaUi force-pushed the fix-unable-delete branch from 780c4e9 to 1b39997 Compare June 19, 2025 14:11

lifubang approved these changes Jun 20, 2025

View reviewed changes

lifubang merged commit 94dc2be into opencontainers:main Jun 20, 2025
31 checks passed

lifubang added the backport/1.3-todo A PR in main branch which needs to be backported to release-1.3 label Jun 20, 2025

This was referenced Jun 21, 2025

[release-1.3] Preventing containers from being unable to be deleted #4793

Merged

runc hang on init when containerd set up #4481

Open

failed to delete cgroup paths kubernetes/kubernetes#123766

Open

lifubang added backport/1.3-done A PR in main branch which has been backported to release-1.3 and removed backport/1.3-todo A PR in main branch which needs to be backported to release-1.3 labels Jun 23, 2025

rata mentioned this pull request Sep 3, 2025

VERSION: release v1.3.1 #4880

Merged

cyphar mentioned this pull request Sep 4, 2025

The 'runc delete --force' command can't delete container if runc receives a SIGKILL before it can generate the state.json file. #4534

Closed

github-actions bot mentioned this pull request Sep 7, 2025

Bump runc from v1.1.13 to v1.3.1 kokyhm/kubespray#286

Open

Preventing containers from being unable to be deleted #4757

Preventing containers from being unable to be deleted #4757

Uh oh!

Conversation

HirazawaUi commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HirazawaUi commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I also conducted some manual tests targeting the scenarios where the leftover container is in the paused and stopped states.

You can add a header

You can add a header

Uh oh!

HirazawaUi commented May 5, 2025

Uh oh!

kolyshkin commented May 6, 2025

Uh oh!

Uh oh!

rata commented May 7, 2025

Uh oh!

HirazawaUi commented May 7, 2025

Uh oh!

HirazawaUi commented May 7, 2025

Uh oh!

HirazawaUi commented May 9, 2025

Uh oh!

This comment was marked as spam.

HirazawaUi commented May 14, 2025

Uh oh!

kolyshkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HirazawaUi commented Jun 18, 2025

Uh oh!

rata left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rata commented Jun 19, 2025

Uh oh!

HirazawaUi commented Jun 19, 2025

Uh oh!

rata commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HirazawaUi commented Jun 19, 2025

Uh oh!

HirazawaUi commented Jun 19, 2025

Uh oh!

rata commented Jun 19, 2025

Uh oh!

Uh oh!

lifubang commented Jun 20, 2025

Uh oh!

HirazawaUi commented Jun 20, 2025

Uh oh!

Uh oh!

HirazawaUi commented May 5, 2025 •

edited

Loading

HirazawaUi commented May 5, 2025 •

edited

Loading

rata commented Jun 19, 2025 •

edited

Loading