-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
What happened?
I was recently made aware of this configuration line (contributed Oct 2016):
cri-o/contrib/systemd/crio.service
Line 20 in 91816d7
| LimitNOFILE=1048576 |
Quite a bit has changed since then, notably with systemd v240 release in 2018Q4. Both Docker and Containerd projects have recently removed the line from their configs to rely on the 1024:524288 default systemd v240 provides (unless the system has been configured explicitly to some other value, which the system administrator may do so when they know they need higher limits).
You can find insights related to those PRs, along with a third link to the Envoy project (as an example of a popular software that presently does not raise it's soft limit or document that requirement, but has depended upon this implicit config in the environment) where the linked comment details why the soft limit should be 1024 to avoid software incompatibility:
- fix: Normalize
RLIMIT_NOFILE(LimitNOFILE) to sensible defaults moby/moby#45534 - Remove
LimitNOFILEfromcontainerd.servicecontainerd/containerd#8924 - Support raising the soft limit envoyproxy/envoy#31502 (comment)
- Set containerd LimitNOFILE to recommended value awslabs/amazon-eks-ami#1535 (comment)
This issue is raised to suggest consider applying the same change.
Either:
- Remove the line like Docker and containerd have done
- Include
LimitNOFILE=1024:524288with contextual comment.- Although it may be better to remove the line, instead relying implicitly on the system default.
- Admins / users could also use a drop-in service override unit.
What did you expect to happen?
For LimitNOFILE to have a soft limit of 1024, so that software running in a container operates with the same environment defaults of the host system.
Raising the default soft limit should be done explicitly by the admin, or via the process that needs it implicitly (see Python reproduction below for an example of this).
How can we reproduce it (as minimally and precisely as possible)?
Commands
I am not familiar with cri-o, but the equivalent Docker commands demonstrate the difference (which for LimitNOFILE=1048576 can be more subtle, for example postsrsd would be <500ms vs 8 minutes):
# Demonstrating the impact on a python process with `LimitNOFILE=1048576`:
$ docker run --rm -it --ulimit "nofile=1048576" --volume './python_close_individual.py:/tmp/test.py' python:3.12-alpine3.19 ash -c 'time python3 /tmp/test.py 1048570 100'
115.63215670500358
real 1m 56.75s
user 0m 4.08s
sys 1m 51.62s
# For this test example, the 1st parameter (number of FDs to open/close per process) is sufficient to alter the iteration behaviour,
# If you lower `--ulimit` instead, the program will fail to open FDs outside the hard limit.
# Much faster than 2 minutes, only 145ms.
$ docker run --rm -it --ulimit "nofile=1048576" --volume './python_close_individual.py:/tmp/test.py' python:3.12-alpine3.19 ash -c 'time python3 /tmp/test.py 1024 100'
0.1456817659927765
real 0m 0.24s
user 0m 0.18s
sys 0m 0.06s
# Fedora 35 (for comparison to next snippet results)
$ docker run --rm -it --ulimit "nofile=1048576" --volume './python_close_individual.py:/tmp/test.py' fedora:35 bash -c 'dnf install -y python3 && time python3 /tmp/test.py 1048570 100'
114.43512667200412
real 1m55.261s
user 0m3.282s
sys 1m50.847s# fedora:34 uses a version of Python (3.9) with a less optimized `closerange()` call
# fedora:35 uses Python 3.10 which can use a faster syscall when available (requires glibc 2.34+ in container and host kernel 5.9+)
$ docker run --rm -it --ulimit "nofile=1048576" --volume './python_close_range.py:/tmp/test.py' fedora:35 bash -c 'dnf install -y python3 && time python3 /tmp/test.py'
real 0m0.015s
user 0m0.000s
sys 0m0.014s
$ docker run --rm -it --ulimit "nofile=1048576" --volume './python_close_range.py:/tmp/test.py' fedora:34 bash -c 'dnf install -y python3 && time python3 /tmp/test.py'
real 0m6.268s
user 0m1.950s
sys 0m4.319s
# Alpine as of Jan 2024 does not have compatibility for the better closerange syscall like fedora:35+ does:
$ docker run --rm -it --ulimit "nofile=1048576" --volume './python_close_range.py:/tmp/test.py' python:alpine ash -c 'time python3 /tmp/test.py'
real 0m 6.81s
user 0m 2.55s
sys 0m 4.26sSources
python_close_individual.py:
import os, subprocess, sys, timeit
from resource import *
# Example, get the soft and hard limits from the environment and try to adjust that (soft limit to hard)
soft, hard = getrlimit(RLIMIT_NOFILE)
setrlimit(RLIMIT_NOFILE, (hard, hard))
# CLI args, number of FDs to open and how many times to run the bench method
num_fds, num_iter = map(int, sys.argv[1:3])
for i in range(num_fds):
os.open('/dev/null', os.O_RDONLY)
# Spawn a subprocess that inherits the FDs opened (which will close them internally).
# Do this N times to demonstrate the impact:
# https://docs.python.org/3/library/timeit.html
# `subprocess.run()` calls Popen, which by default (close_fd=True) closes each FD above 3 individually:
# https://docs.python.org/3/library/subprocess.html#popen-constructor
# > If `close_fds` is `true`, all file descriptors except 0, 1 and 2 will be closed before the child process is executed.
print(timeit.timeit(lambda: subprocess.run('/bin/true'), number=num_iter))python_close_range.py:
import os;
# Raise repetition to emulate a more intensive task
num_iter = 100
# Close all FDs after the third to the max, a common initialization practice for daemons
# The faster call with the fedora:35 image is constant, avoiding iteration over a potentially
# large range of FDs, each with an individual `close()` call..
for i in range(num_iter):
os.closerange(3, os.sysconf("SC_OPEN_MAX"))Reproduction references:
- [BUG]
ENABLE_SRS=1causing high CPU usage withpostsrsddocker-mailserver/docker-mailserver#2722 (comment) - os.closerange optimization python/cpython#57997
Anything else we need to know?
While containerd is yet to publish a release with this change AFAIK (should be scheduled for v2.0), AWS eagerly adopted the change and promptly reverted it due to customer feedback with some software failing to communicate a request for a higher soft limit (some AWS specific software and Envoy are known examples).
AWS can provide a higher LimitNOFILE configuration if that better suits their users (despite the referenced 1024 soft limit concerns, or the difficult to troubleshoot issues with LimitNOFILE=infinity), but that should be a vendor decision while projects like cri-o actually fix the bug.
LimitNOFILE=1048576 is not as bad as LimitNOFILE=infinity, however:
- This concern still applies: Support raising the soft limit envoyproxy/envoy#31502 (comment)
- Software such as MySQL has been known to allocate excessive memory, this will be
1,000xless but affected deployments would still be allocating1,000xmore than they may need. Java runtime was also identified as another culprit. - Software like PostSRSd, Fail2Ban, Rsyslog
- RPM package managers:
yum(NOTE: PowerDNS had to workaround due to 6 hours image build time, however that was due to a2^30, not2^20limit)zypper(NOTE:LimitNOFILE=1048576taking 30-60 minutes, could be much faster)dnf
CRI-O and Kubernetes version
N/A
$ crio --version
# paste output here$ kubectl version --output=json
# paste output hereOS version
N/A
Test reproduction environment was WSL2 (Ubuntu), but previously was Arch Linux and Fedora.
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here