Skip to content

RFE: Expose Granular crio check Behaviors (force, wipe, quick, repair) in crio.conf for Startup Checks #9496

@alosadagrande

Description

@alosadagrande

Is this a BUG REPORT or FEATURE REQUEST?:

FEATURE REQUEST

What is the current behavior?

Description

Currently, CRI-O's behavior upon an unclean shutdown is NOT configurable via the [crio.runtime] options. The default behaviour is running a quick repair-and-check scan. See #8417, where the most I/O intensive tasks LayerDigest and LayerData checks, are skipped.

While the current CRI-O implementation provides check-and-repair control, they do not expose the full capabilities of the crio check CLI tool. Extensive testing on corrupted images reveals the following limitations:

  1. A latently corrupted image (e.g., a binary within a layer overwritten with bad data) is not detected or repaired by the default startup check since the LayerDigest check is skipped, causing application pods to enter CrashLoopBackOff.
  2. Manual intervention is required to fix this specific issue. The crio check --repair --force command showed to perform a "surgical" deletion of only the corrupted image, allowing it to be re-pulled.
  3. There is currently no way to configure this effective --repair --force behavior (e.g. full check instead of the default quick check) to run automatically at startup. The existing check-and-repair options (quick repair) is not aggressive enough to overcome container locks that may persist in the storage metadata after a crash.
  4. As expected, the quick repair succeeded at node startup when metadata corruption issues were involved. That's because the quick check-and-repair run and fix those types of checks.

What is the desired behavior?

I propose adding new options to crio.conf to allow administrators to configure the more advanced behaviors of crio check for the automatic startup check. This provides greater flexibility for different environments, especially remote/edge nodes, where manual intervention is costly.

The proposed new options under [crio.runtime] would be:

  • auto_storage_check (boolean, default: quick): It can be either quick as it is currently the default, skipping the most I/O intensive checks or full where all available checks are run.
  • auto_storage_repair (boolean, default: true). This option enables the user to try to repair the corruption or not. By default I suggest being true as it is now. If this option is set to false, then both auto_storage_repair_force or auto_storage_wipe options are not taken into account.
  • auto_storage_repair_force (boolean, default: false): If auto_storage_repair is true, this option would make the repair attempt more aggressive, equivalent to the --force flag in crio check subcommand.
  • auto_storage_wipe (boolean, default: false): If auto_storage_repair is true, this option would enable the automatic wipe of irreparable artifacts, equivalent to the --wipe flag in crio check subcommand.

This would allow an administrator to adapt the check-and-repair feature to their needs and configure a robust, automated, surgical repair at startup. For example:

[crio.runtime]
# Run a full, deep check on startup after an unclean shutdown.
auto_storage_check = "full"

# If errors are found, attempt to repair them.
repair = true

# Make the repair aggressive enough to fix locked images.
repair_force = true

# (If repair did not succeed) Avoid wiping the whole container storage.
repair_wipe = false

Use Case and Motivation

In Telco/Edge environments with limited bandwidth, a node that fails due to image corruption needs to recover automatically and efficiently.

Current problem: A node with a corrupted image enters a failed state (CrashLoopBackOff) and requires manual SSH intervention to run crio check --repair --force (notice there is no --quick option set) since the current repair-and-check behaviour does not fix LayerDigest corruptions.

Desired outcome: With the proposed configuration, CRI-O could automatically perform this surgical repair on reboot. It would delete only the single corrupted image, forcing a re-pull of just that image and showing the application as Running at startup. This minimizes downtime and network traffic, which is critical in these environments.

Anything else we need to know?

This feature request is complementary to the proposal in PR #9242 , which introduces the ability to selectively wipe container storage in cases of unresolvable corruption.
Here, that aim can be achieved by setting the repair_wipe option to false. This mirrors the default behavior of the crio check subcommand when the --wipe flag is omitted.

$ crio check --help
OPTIONS:
   --age value, -a value  Maximum allowed age for unreferenced layers (default: "24h")
   --force, -f            Remove damaged containers (default: false)
   --repair, -r           Remove damaged images and layers (default: false)
   --quick, -q            Perform only quick checks (default: false)
   --wipe, -w             Wipe storage directory on repair failure (default: false)
   --help, -h             show help

cc/ @saschagrunert

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions