-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST
What is the current behavior?
Description
Currently, CRI-O's behavior upon an unclean shutdown is NOT configurable via the [crio.runtime] options. The default behaviour is running a quick repair-and-check scan. See #8417, where the most I/O intensive tasks LayerDigest and LayerData checks, are skipped.
While the current CRI-O implementation provides check-and-repair control, they do not expose the full capabilities of the crio check CLI tool. Extensive testing on corrupted images reveals the following limitations:
- A latently corrupted image (e.g., a binary within a layer overwritten with bad data) is not detected or repaired by the default startup check since the LayerDigest check is skipped, causing application pods to enter
CrashLoopBackOff. - Manual intervention is required to fix this specific issue. The
crio check --repair --forcecommand showed to perform a "surgical" deletion of only the corrupted image, allowing it to be re-pulled. - There is currently no way to configure this effective
--repair --forcebehavior (e.g. full check instead of the default quick check) to run automatically at startup. The existing check-and-repair options (quick repair) is not aggressive enough to overcome container locks that may persist in the storage metadata after a crash. - As expected, the quick repair succeeded at node startup when metadata corruption issues were involved. That's because the quick check-and-repair run and fix those types of checks.
What is the desired behavior?
I propose adding new options to crio.conf to allow administrators to configure the more advanced behaviors of crio check for the automatic startup check. This provides greater flexibility for different environments, especially remote/edge nodes, where manual intervention is costly.
The proposed new options under [crio.runtime] would be:
auto_storage_check(boolean, default:quick): It can be eitherquickas it is currently the default, skipping the most I/O intensive checks orfullwhere all available checks are run.auto_storage_repair(boolean, default:true). This option enables the user to try to repair the corruption or not. By default I suggest being true as it is now. If this option is set to false, then bothauto_storage_repair_forceorauto_storage_wipeoptions are not taken into account.auto_storage_repair_force(boolean, default:false): Ifauto_storage_repairistrue, this option would make the repair attempt more aggressive, equivalent to the--forceflag in crio check subcommand.auto_storage_wipe(boolean, default:false): Ifauto_storage_repairistrue, this option would enable the automatic wipe of irreparable artifacts, equivalent to the--wipeflag in crio check subcommand.
This would allow an administrator to adapt the check-and-repair feature to their needs and configure a robust, automated, surgical repair at startup. For example:
[crio.runtime]
# Run a full, deep check on startup after an unclean shutdown.
auto_storage_check = "full"
# If errors are found, attempt to repair them.
repair = true
# Make the repair aggressive enough to fix locked images.
repair_force = true
# (If repair did not succeed) Avoid wiping the whole container storage.
repair_wipe = falseUse Case and Motivation
In Telco/Edge environments with limited bandwidth, a node that fails due to image corruption needs to recover automatically and efficiently.
Current problem: A node with a corrupted image enters a failed state (CrashLoopBackOff) and requires manual SSH intervention to run crio check --repair --force (notice there is no --quick option set) since the current repair-and-check behaviour does not fix LayerDigest corruptions.
Desired outcome: With the proposed configuration, CRI-O could automatically perform this surgical repair on reboot. It would delete only the single corrupted image, forcing a re-pull of just that image and showing the application as Running at startup. This minimizes downtime and network traffic, which is critical in these environments.
Anything else we need to know?
This feature request is complementary to the proposal in PR #9242 , which introduces the ability to selectively wipe container storage in cases of unresolvable corruption.
Here, that aim can be achieved by setting the repair_wipe option to false. This mirrors the default behavior of the crio check subcommand when the --wipe flag is omitted.
$ crio check --help
OPTIONS:
--age value, -a value Maximum allowed age for unreferenced layers (default: "24h")
--force, -f Remove damaged containers (default: false)
--repair, -r Remove damaged images and layers (default: false)
--quick, -q Perform only quick checks (default: false)
--wipe, -w Wipe storage directory on repair failure (default: false)
--help, -h show help
cc/ @saschagrunert