- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.1k
[WIP] Add mechanism to monitor conmon #9172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| [APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bitoku The full list of commands accepted by this bot can be found here. 
Needs approval from an approver in each of these files:
 Approvers can indicate their approval by writing  | 
098e0f1    to
    9e8689f      
    Compare
  
            
          
                internal/oci/runtime_oci.go
              
                Outdated
          
        
      | }() | ||
|  | ||
| // Wait until the conmon process exits | ||
| state, err := conmonProcess.Wait() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may want to do some perf scale testing to make sure this scales up with the number of containers. It may be more efficient to open the process and store it and periodically check it rather than have a goroutine open for each conmon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think if we catch this situation happening we should emit a metric so we can track over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, don't we have a performance test CI or something, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no but we really should...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Wait() way didn't work.
Did some tests and found Wait() (= waitid) can't wait non-children process, and because the actual conmon process is the one detached from the original process, it can't wait conmon from CRI-O.
Polling with signal 0 should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had hoped we could use pidfds for this behavior as non parent processes can open then and then the kernel will close them when the process dies, so we could asynchronously check whether the fd was closed (I believe this is how it works, it's possible I've got details fuzzy there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I was trying to achieve it only with high-level libraries, but it should be possible with epoll or some way to mointor fds. I'll try implementing and compare the efficiency.
Signed-off-by: Ayato Tokubi <[email protected]>
| Codecov ReportAttention: Patch coverage is  
 Additional details and impacted files@@            Coverage Diff             @@
##             main    #9172      +/-   ##
==========================================
- Coverage   47.01%   46.82%   -0.19%     
==========================================
  Files         161      161              
  Lines       23662    23797     +135     
==========================================
+ Hits        11124    11143      +19     
- Misses      11438    11552     +114     
- Partials     1100     1102       +2     🚀 New features to boost your workflow:
 | 
Signed-off-by: Ayato Tokubi <[email protected]>
Signed-off-by: Ayato Tokubi <[email protected]>
| @bitoku: The following tests failed, say  
 Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. | 
| #9185 I tested the performance with 5 containers * 100 pods = 500 containers. The script I used is https://github.com/cri-o/cri-o/pull/9185/files#diff-b99687ee53b96d05bae8378f7364c7999d379d47a742edbfab6d2af3227da9db . In short, I got these results with ps command no change: rss 155,695 kb,  cpu 3.99 % details (it's only accessible with redhat account) The increase in epoll cpu was mainly EpollWait syscall. I'm not sure the cause of the increase in epoll memory, it shows LoadContainer used some memories, which was not shown in the signal polling. (epoll pprof, polling pprof ) IMO the signal polling way is simple and cheap enough, but let me know your thought about this change. Do we want to monitor conmon at the expense of these consumption? | 
| I did some additional tests with a single goroutine way. 
 details (it's only accessible with redhat account) 
 I'd like to go for "single goroutine + send signal polling" though we can make the interval longer. | 
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds mechanism to monitor conmon processes, and logs errors when they exited before the container stops.
It monitors conmon by polling with signal 0 periodically.
There's a concern about its performance impact. A goroutine has a minimum stack size 8KB + other usage. If the node runs 1000 containers, cri-o uses more memory usage by 10MB - 100MB order I guess.
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?