-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Fix suspicious async #139886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix suspicious async #139886
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139886
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Yep, the blocking behavior of @fduwjj If it is intended for the |
@kwen2501, |
You are right, I thought |
// best effort dump, not waiting for the dump here | ||
std::future<bool> fut = std::async( | ||
std::launch::async, [this]() { return this->dumpDebuggingInfo(); }); | ||
this->dumpDebuggingInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, so your change is equivalent? but we didn't want to block here, so we probably actually want a different change.. we have to make the async object outlive the function?
not sure if making this dump async is critical or not. but it may have been causing problems that we failed to diagnose because of making bad assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is for the dump from a pipe which is less used. But yes, this is a good point, let me follow up on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wconstab Even if the future object is placed outside the block, it still blocks when the whole function returns, is that desired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyyever we were chatting about stashing Futures we don't want to wait for to a garbage collector at global scope. Something like this:
future = std::async(std::launch::async, func_I_dont_wait_for);
futGC.push_back(std::move(future));
And then at program teardown, we did a check:
if (!futGC.empty()) {
WARN(“There are unfinished tasks associated with this PG but this PG is destructing. Your program may experience undefined behavior.”);
}
Do you think this works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At what time are finished works in futGC poped so that it is not empty? Before checking empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kwen2501 Implements a generic future pool may be a good idea to discover new optimisation opportunities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some side threads in ProcessGroup that's constantly doing monitoring (on the collectives). Those threads can lend a hand popping finished Futures from futGC.
Implements a generic future pool may be a good idea to discover new optimisation opportunities
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sync offline with @kwen2501 If this is only for FR dump. Why can't we always do this:
- Always wait for the async for a short period of time.
- Always throw exceptions when it's timeout or return false.
This is like what we did waitForFutureOrTimeout
dumpDebuggingInfo is called in std::async again some lines later. |
The 2nd call is probably fine as it's meant to be called when things are shutting down and it gives Flight Recorder time to write out diagnostics before the world gets shut down. |
This call is blocking.
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o