Skip to content

Set LimitNOFILE=1024:524288 for crio.service #7703

@polarathene

Description

@polarathene

What happened?

I was recently made aware of this configuration line (contributed Oct 2016):

LimitNOFILE=1048576

Quite a bit has changed since then, notably with systemd v240 release in 2018Q4. Both Docker and Containerd projects have recently removed the line from their configs to rely on the 1024:524288 default systemd v240 provides (unless the system has been configured explicitly to some other value, which the system administrator may do so when they know they need higher limits).

You can find insights related to those PRs, along with a third link to the Envoy project (as an example of a popular software that presently does not raise it's soft limit or document that requirement, but has depended upon this implicit config in the environment) where the linked comment details why the soft limit should be 1024 to avoid software incompatibility:


This issue is raised to suggest consider applying the same change.

Either:

  • Remove the line like Docker and containerd have done
  • Include LimitNOFILE=1024:524288 with contextual comment.
    • Although it may be better to remove the line, instead relying implicitly on the system default.
    • Admins / users could also use a drop-in service override unit.

What did you expect to happen?

For LimitNOFILE to have a soft limit of 1024, so that software running in a container operates with the same environment defaults of the host system.

Raising the default soft limit should be done explicitly by the admin, or via the process that needs it implicitly (see Python reproduction below for an example of this).

How can we reproduce it (as minimally and precisely as possible)?

Commands

I am not familiar with cri-o, but the equivalent Docker commands demonstrate the difference (which for LimitNOFILE=1048576 can be more subtle, for example postsrsd would be <500ms vs 8 minutes):

# Demonstrating the impact on a python process with `LimitNOFILE=1048576`:
$ docker run --rm -it --ulimit "nofile=1048576" --volume './python_close_individual.py:/tmp/test.py' python:3.12-alpine3.19 ash -c 'time python3 /tmp/test.py 1048570 100'
115.63215670500358
real    1m 56.75s
user    0m 4.08s
sys     1m 51.62s

# For this test example, the 1st parameter (number of FDs to open/close per process) is sufficient to alter the iteration behaviour,
# If you lower `--ulimit` instead, the program will fail to open FDs outside the hard limit.
# Much faster than 2 minutes, only 145ms.
$ docker run --rm -it --ulimit "nofile=1048576" --volume './python_close_individual.py:/tmp/test.py' python:3.12-alpine3.19 ash -c 'time python3 /tmp/test.py 1024 100'
0.1456817659927765
real    0m 0.24s
user    0m 0.18s
sys     0m 0.06s

# Fedora 35 (for comparison to next snippet results)
$ docker run --rm -it --ulimit "nofile=1048576" --volume './python_close_individual.py:/tmp/test.py' fedora:35 bash -c 'dnf install -y python3 && time python3 /tmp/test.py 1048570 100'
114.43512667200412
real    1m55.261s
user    0m3.282s
sys     1m50.847s
# fedora:34 uses a version of Python (3.9) with a less optimized `closerange()` call
# fedora:35 uses Python 3.10 which can use a faster syscall when available (requires glibc 2.34+ in container and host kernel 5.9+)
$ docker run --rm -it --ulimit "nofile=1048576" --volume './python_close_range.py:/tmp/test.py' fedora:35 bash -c 'dnf install -y python3 && time python3 /tmp/test.py'
real    0m0.015s
user    0m0.000s
sys     0m0.014s

$ docker run --rm -it --ulimit "nofile=1048576" --volume './python_close_range.py:/tmp/test.py' fedora:34 bash -c 'dnf install -y python3 && time python3 /tmp/test.py'
real    0m6.268s
user    0m1.950s
sys     0m4.319s

# Alpine as of Jan 2024 does not have compatibility for the better closerange syscall like fedora:35+ does:
$ docker run --rm -it --ulimit "nofile=1048576" --volume './python_close_range.py:/tmp/test.py' python:alpine ash -c 'time python3 /tmp/test.py'
real    0m 6.81s
user    0m 2.55s
sys     0m 4.26s

Sources

python_close_individual.py:

import os, subprocess, sys, timeit
from resource import *

# Example, get the soft and hard limits from the environment and try to adjust that (soft limit to hard)
soft, hard = getrlimit(RLIMIT_NOFILE)
setrlimit(RLIMIT_NOFILE, (hard, hard))

# CLI args, number of FDs to open and how many times to run the bench method
num_fds, num_iter = map(int, sys.argv[1:3])

for i in range(num_fds):
    os.open('/dev/null', os.O_RDONLY)

# Spawn a subprocess that inherits the FDs opened (which will close them internally).
# Do this N times to demonstrate the impact:
# https://docs.python.org/3/library/timeit.html
# `subprocess.run()` calls Popen, which by default (close_fd=True) closes each FD above 3 individually:
# https://docs.python.org/3/library/subprocess.html#popen-constructor
# > If `close_fds` is `true`, all file descriptors except 0, 1 and 2 will be closed before the child process is executed.
print(timeit.timeit(lambda: subprocess.run('/bin/true'), number=num_iter))

python_close_range.py:

import os;

# Raise repetition to emulate a more intensive task
num_iter = 100
# Close all FDs after the third to the max, a common initialization practice for daemons
# The faster call with the fedora:35 image is constant, avoiding iteration over a potentially
# large range of FDs, each with an individual `close()` call..
for i in range(num_iter):
  os.closerange(3, os.sysconf("SC_OPEN_MAX"))

Reproduction references:

Anything else we need to know?

While containerd is yet to publish a release with this change AFAIK (should be scheduled for v2.0), AWS eagerly adopted the change and promptly reverted it due to customer feedback with some software failing to communicate a request for a higher soft limit (some AWS specific software and Envoy are known examples).

AWS can provide a higher LimitNOFILE configuration if that better suits their users (despite the referenced 1024 soft limit concerns, or the difficult to troubleshoot issues with LimitNOFILE=infinity), but that should be a vendor decision while projects like cri-o actually fix the bug.

LimitNOFILE=1048576 is not as bad as LimitNOFILE=infinity, however:

CRI-O and Kubernetes version

N/A

$ crio --version
# paste output here
$ kubectl version --output=json
# paste output here

OS version

N/A

Test reproduction environment was WSL2 (Ubuntu), but previously was Arch Linux and Fedora.

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

Additional environment details (AWS, VirtualBox, physical, etc.)

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions