Skip to content

CRI-O always enforces seccomp filter for privileged sandbox containers #9675

@koct9i

Description

@koct9i

What happened?

opencontainers spec generation fills default seccomp filter:
https://github.com/cri-o/cri-o/blob/main/vendor/github.com/opencontainers/runtime-tools/generate/generate.go#L246

https://github.com/cri-o/cri-o/blob/main/vendor/github.com/opencontainers/runtime-tools/generate/seccomp/seccomp_default.go#L33

(just for notice - it has no "close_range" and "openat2")

sandbox-run leaves it unchanged when "privileged_seccomp_profile" is not set.
https://github.com/cri-o/cri-o/blob/main/server/sandbox_run_linux.go#L1087

container-create does the same
https://github.com/cri-o/cri-o/blob/main/server/container_create.go#L1082
So, privileged containers also must be affected by too strict seccomp filter.

I don't see how this could ever worked.
Unless seccomp is disabled at compile time.

Probably most software simply can live without most modern syscalls.


For me this case triggered bug in "runc" (actually its fork from nvidia) which cannot start privileged pod sandbox without close_range or openat2 allowd.
opencontainers/runc#5007
To make pod sandbox "privileged" is enough to use host namespaces - in my case that was netns.

It seems only runc version 1.3.3 is affected.
So, I was really lucky to catch this misbehavior in cri-o.

# crictl runp pod.json
E1218 10:02:26.402123   44450 log.go:32] "RunPodSandbox from runtime service failed" err=<
	rpc error: code = Unknown desc = container create failed: time="2025-12-18T10:02:26Z" level=error msg="runc create failed: unable to start container process: error during container init: error closing exec fds: get handle to /proc/thread-self/fd: unsafe procfs detected: openat2 fsmount:fscontext:proc/thread-self/fd/: operation not permitted"
 >
FATA[0000] run pod sandbox: rpc error: code = Unknown desc = container create failed: time="2025-12-18T10:02:26Z" level=error msg="runc create failed: unable to start container process: error during container init: error closing exec fds: get handle to /proc/thread-self/fd: unsafe procfs detected: openat2 fsmount:fscontext:proc/thread-self/fd/: operation not permitted" 
# cat pod.json 
{
    "metadata": {
        "name": "test",
        "namespace": "test-ns",
        "attempt": 1,
        "uid": "test-uid"
    },
    "log_directory": "/tmp/test",
    "linux": {
        "cgroup_parent": "/test/test-pod",
        "security_context": {
            "namespace_options": {
                "network": 2
            },
            "privileged": false
        },
        "resources": {
            "memory_limit_in_bytes": 1073741824,
            "unified": {
                "memory.oom.group": "1"
            }
        }
    }
}

"runc" fails right after applying seccomp, fails to call close_range, goes to falback and fails completely because openat2 is missing too:

fcntl(0, F_DUPFD_CLOEXEC, 0)            = 8
close_range(8, 8, CLOSE_RANGE_CLOEXEC)  = -1 EPERM (Operation not permitted)
close(8)                                = 0
fstatfs(13, {f_type=PROC_SUPER_MAGIC, f_bsize=4096, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_NOSUID|ST_NODEV|ST_NOEXEC|ST_RELATIME}) = 0
fstat(13, {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
openat2(13, "thread-self/fd/", {flags=O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH, resolve=RESOLVE_NO_XDEV|RESOLVE_NO_MAGICLINKS|RESOLVE_BENEATH}, 24) = -1 EPERM (Operation not permitted)
close(3)                                = 0
write(4, "{\"type\":\"procError\",\"flags\":0,\"arg\":{\"message\":\"error closing exec fds: get handle to /proc/thread-self/fd: unsafe procfs detected: openat2 fsmount:fscontext:proc/thread-self/fd/: operation not permitted\"}}", 206) = 206

Adding explicit allow-all "privileged_seccomp_profile" fixes the issue.

{ "defaultAction": "SCMP_ACT_ALLOW" }

Do you have integration tests for privileged containers?
Or framework for checking resulting OCI spec for various inputs?
Or anything to check my assumptions without reinventing the wheel.

What did you expect to happen?

"Unconfined" seccomp should not limit syscalls

How can we reproduce it (as minimally and precisely as possible)?

yes

Anything else we need to know?

No response

CRI-O and Kubernetes version

1.33, 1.34, main

Version:        1.35.0
GitCommit:      d41f1315d89e81423ff429ef7317e622b57dc266
GitCommitDate:  2025-12-17T11:42:48Z
GitTreeState:   clean
BuildDate:      2025-12-18T10:00:46Z
GoVersion:      go1.25.0
Compiler:       gc
Platform:       linux/amd64
Linkmode:       dynamic
BuildTags:
  containers_image_ostree_stub
  seccomp
  selinux
LDFlags:          unknown
SeccompEnabled:   true
AppArmorEnabled:  false

OS version

# cat /etc/issue
Ubuntu 24.04.3 LTS \n \l
# uname -a
Linux computeinstance-e00kaq8cebb49n2zdj 5.15.0-126-generic #136-Ubuntu SMP Wed Nov 6 10:38:22 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Additional environment details (AWS, VirtualBox, physical, etc.)

Cloud VM

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions