RFE: Expose Granular `crio check` Behaviors (force, wipe, quick, repair) in crio.conf for Startup Checks

**Is this a BUG REPORT or FEATURE REQUEST?**:

FEATURE REQUEST

**What is the current behavior?**

**Description**

Currently, CRI-O's behavior upon an unclean shutdown is NOT configurable via the `[crio.runtime]` options. The default behaviour is running a quick repair-and-check scan. See https://github.com/cri-o/cri-o/pull/8417, where the most I/O intensive tasks LayerDigest and LayerData checks, are skipped.

While the current CRI-O implementation provides check-and-repair control, they do not expose the full capabilities of the `crio check` CLI tool. Extensive testing on corrupted images reveals the following limitations:

1.  A latently corrupted image (e.g., a binary within a layer overwritten with bad data) is not detected or repaired by the default startup check since the LayerDigest check is skipped, causing application pods to enter `CrashLoopBackOff`.
2.  Manual intervention is required to fix this specific issue. The `crio check --repair --force` command showed to perform a "surgical" deletion of only the corrupted image, allowing it to be re-pulled.
3.  There is currently **no way to configure this effective `--repair --force` behavior** (e.g. full check instead of the default quick check) to run automatically at startup. The existing check-and-repair options (quick repair) is not aggressive enough to overcome container locks that may persist in the storage metadata after a crash.
4. As expected, the quick repair succeeded at node startup when metadata corruption issues were involved. That's because the quick check-and-repair run and fix those types of checks.

**What is the desired behavior?**

I propose adding new options to `crio.conf` to allow administrators to configure the more advanced behaviors of `crio check` for the automatic startup check. This provides greater flexibility for different environments, especially remote/edge nodes, where manual intervention is costly.

The proposed new options under `[crio.runtime]` would be:

* `auto_storage_check` (boolean, default: `quick`): It can be either `quick` as it is currently the default, skipping the most I/O intensive checks or `full` where all available checks are run.
* `auto_storage_repair` (boolean, default: `true`). This option enables the user to try to repair the corruption or not. By default I suggest being true as it is now. If this option is set to false, then both `auto_storage_repair_force` or `auto_storage_wipe` options are not taken into account.
* `auto_storage_repair_force` (boolean, default: `false`): If `auto_storage_repair` is `true`, this option would make the repair attempt more aggressive, equivalent to the `--force` flag in crio check subcommand.
* `auto_storage_wipe` (boolean, default: `false`): If `auto_storage_repair` is `true`, this option would enable the automatic wipe of irreparable artifacts, equivalent to the `--wipe` flag in crio check subcommand.

This would allow an administrator to adapt the check-and-repair feature to their needs and configure a robust, automated, surgical repair at startup. For example:

```toml
[crio.runtime]
# Run a full, deep check on startup after an unclean shutdown.
auto_storage_check = "full"

# If errors are found, attempt to repair them.
repair = true

# Make the repair aggressive enough to fix locked images.
repair_force = true

# (If repair did not succeed) Avoid wiping the whole container storage.
repair_wipe = false
```

**Use Case and Motivation**

In Telco/Edge environments with limited bandwidth, a node that fails due to image corruption needs to recover automatically and efficiently.

Current problem: A node with a corrupted image enters a failed state (CrashLoopBackOff) and requires manual SSH intervention to run crio check --repair --force (notice there is no --quick option set) since the current repair-and-check behaviour does not fix LayerDigest corruptions.

Desired outcome: With the proposed configuration, CRI-O could automatically perform this surgical repair on reboot. It would delete only the single corrupted image, forcing a re-pull of just that image and showing the application as Running at startup. This minimizes downtime and network traffic, which is critical in these environments.

**Anything else we need to know?**

This feature request is complementary to the proposal in PR #9242 , which introduces the ability to selectively wipe container storage in cases of unresolvable corruption.
Here, that aim can be achieved by setting the repair_wipe option to false. This mirrors the default behavior of the crio check subcommand when the --wipe flag is omitted.

```
$ crio check --help
OPTIONS:
   --age value, -a value  Maximum allowed age for unreferenced layers (default: "24h")
   --force, -f            Remove damaged containers (default: false)
   --repair, -r           Remove damaged images and layers (default: false)
   --quick, -q            Perform only quick checks (default: false)
   --wipe, -w             Wipe storage directory on repair failure (default: false)
   --help, -h             show help
```

cc/ @saschagrunert 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFE: Expose Granular `crio check` Behaviors (force, wipe, quick, repair) in crio.conf for Startup Checks #9496

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFE: Expose Granular crio check Behaviors (force, wipe, quick, repair) in crio.conf for Startup Checks #9496

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

RFE: Expose Granular `crio check` Behaviors (force, wipe, quick, repair) in crio.conf for Startup Checks #9496