Skip to content

Conversation

@bitoku
Copy link
Contributor

@bitoku bitoku commented May 15, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds feature to track conmon and logs error and emits a metric when the conmon is stopped.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPNODE-1853

Special notes for your reviewer:

benchmark result is here.

Does this PR introduce a user-facing change?

Added feature to track conmon processes and emit containers_stopped_monitor_count{name="$ctr_name"} metric when it's stopped. 

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 15, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2025
@bitoku
Copy link
Contributor Author

bitoku commented May 15, 2025

/test all

@openshift-ci openshift-ci bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 15, 2025
@bitoku bitoku force-pushed the conmon-monitor3 branch from 38a1c67 to 78503b5 Compare May 16, 2025 11:09
@bitoku bitoku force-pushed the conmon-monitor3 branch from 78503b5 to 09f6fb9 Compare May 19, 2025 17:13
@codecov
Copy link

codecov bot commented May 19, 2025

Codecov Report

❌ Patch coverage is 75.23810% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.99%. Comparing base (e738af1) to head (1d392a1).
⚠️ Report is 136 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9205      +/-   ##
==========================================
+ Coverage   66.55%   66.99%   +0.44%     
==========================================
  Files         198      198              
  Lines       27164    27298     +134     
==========================================
+ Hits        18078    18288     +210     
+ Misses       7600     7505      -95     
- Partials     1486     1505      +19     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bitoku bitoku force-pushed the conmon-monitor3 branch 6 times, most recently from 3bc2b2b to eb23ec3 Compare May 20, 2025 15:36
@openshift-ci openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 20, 2025
@bitoku bitoku changed the title [WIP] Conmon monitor3 [WIP] Track conmon process May 20, 2025
@bitoku bitoku force-pushed the conmon-monitor3 branch from eb23ec3 to d5199bc Compare May 20, 2025 15:58
@bitoku bitoku marked this pull request as ready for review May 20, 2025 16:03
@bitoku bitoku requested review from fidencio and mrunalp as code owners May 20, 2025 16:03
@openshift-ci openshift-ci bot requested review from hasan4791 and klihub May 20, 2025 16:03
@bitoku
Copy link
Contributor Author

bitoku commented May 20, 2025

/skip

@bitoku bitoku force-pushed the conmon-monitor3 branch from d5199bc to 16676e2 Compare May 20, 2025 17:52
@bitoku
Copy link
Contributor Author

bitoku commented May 21, 2025

/retest-required

@bitoku bitoku force-pushed the conmon-monitor3 branch from 16676e2 to 7c127d3 Compare May 21, 2025 09:57
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2025
@bitoku
Copy link
Contributor Author

bitoku commented Jun 10, 2025

@cri-o/cri-o-maintainers PTAL

@bitoku bitoku changed the title Track conmon process OCPNODE-3316: Track conmon process Jun 12, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 12, 2025
@openshift-ci-robot
Copy link

@bitoku: This pull request references OCPNODE-3316 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds feature to track conmon and logs error and emits a metric when the conmon is stopped.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPNODE-1853

Special notes for your reviewer:

benchmark result is here.

Does this PR introduce a user-facing change?

Added feature to track conmon processes and emit containers_stopped_monitor_count{name="$ctr_name"} metric when it's stopped. 

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bitoku
Copy link
Contributor Author

bitoku commented Jun 18, 2025

@cri-o/cri-o-maintainers Can you PTAL?

1 similar comment
@bitoku
Copy link
Contributor Author

bitoku commented Jun 24, 2025

@cri-o/cri-o-maintainers Can you PTAL?

return err
}

c.state.ContainerMonitorProcess, err = r.getConmonProcess(c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be done in SetMonitorProcess?

Copy link
Contributor Author

@bitoku bitoku Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it intentionally to make SetMonitorProcess reusable with conmon-rs.

return c.Living() == nil
}

func (r *runtimePod) ProbeMonitor(ctx context.Context, c *Container) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do save the PID of the monitor by asking it via RPC, we could do a similar thing as runtime oci here if we set the monitor pid to that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I'd prefer to do it in a separate PR and keep this PR small. (it's not small already though)

@lance5890
Copy link
Contributor

M


if r.IsContainerAlive(c) {
metrics.Instance().MetricContainersStoppedMonitorCountInc(c.Name())
log.Errorf(ctx, "Conmon for container %s is stopped, although the container is running", c.ID())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we restart the conmon process or stop the container ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we can restart conmon.
I'd rather not stop the container, because I think users want to decide what to do to the orphaned containers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add one more check to see if the container pid of pidfile is alive or not, if the pid of pidfile is dead, should remove the container,the pidfile lies in the following dir

[root@node userdata]# pwd
/var/run/containers/storage/overlay-containers/abb2086e148c22259006ba00dcc360b953edf1277fcbc43d62588d936de0940b/userdata
[root@node userdata]# ls
attach  config.json  conmon-pidfile  ctl  pidfile  run  winsz

I have once encounted into this situation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks but IsContainerAlive already checks container liveness with container pid. I don't understand why we need to use pidfile.

@bitoku
Copy link
Contributor Author

bitoku commented Jun 27, 2025

@cri-o/cri-o-maintainers PTAL?

@haircommander
Copy link
Member

/override ci/prow/ci-e2e-evented-pleg
/approve

LGTM, good work here!

@cri-o/cri-o-maintainers PTAL

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 27, 2025

@haircommander: Overrode contexts on behalf of haircommander: ci/prow/ci-e2e-evented-pleg

In response to this:

/override ci/prow/ci-e2e-evented-pleg
/approve

LGTM, good work here!

@cri-o/cri-o-maintainers PTAL

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 27, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 30, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 30, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bitoku, haircommander, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [haircommander,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bitoku
Copy link
Contributor Author

bitoku commented Jun 30, 2025

/retest

@openshift-merge-bot openshift-merge-bot bot merged commit f3374ad into cri-o:main Jun 30, 2025
106 of 108 checks passed
@saschagrunert saschagrunert added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants