Skip to content

Conversation

cyyever
Copy link
Collaborator

@cyyever cyyever commented Nov 6, 2024

Copy link

pytorch-bot bot commented Nov 6, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139886

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 6, 2024
@cyyever cyyever requested a review from kwen2501 November 6, 2024 15:01
@kwen2501
Copy link
Contributor

kwen2501 commented Nov 6, 2024

Thanks @cyyever .
Do you mean that when fut is destroyed in that scope, it will wait for the lambda it wraps to finish anyway?
If so I agree that std::async may not be effective here.
Cc @fduwjj

@kwen2501
Copy link
Contributor

kwen2501 commented Nov 6, 2024

Yep, the blocking behavior of std::future destructor is confirmed (tho as the thread below mentioned, controversial).
https://stackoverflow.com/questions/23455104/why-is-the-destructor-of-a-future-returned-from-stdasync-blocking

@fduwjj If it is intended for the dumpDebuggingInfo call to be non-blocking to the heartbeat monitor thread, do you want to add a waitForFutureOrTimeout call after creating that std::async? Just like you did in the lines below. Thanks!

@c-p-i-o
Copy link
Contributor

c-p-i-o commented Nov 6, 2024

Yep, the blocking behavior of std::future destructor is confirmed (tho as the thread below mentioned, controversial). https://stackoverflow.com/questions/23455104/why-is-the-destructor-of-a-future-returned-from-stdasync-blocking

@fduwjj If it is intended for the dumpDebuggingInfo call to be non-blocking to the heartbeat monitor thread, do you want to add a waitForFutureOrTimeout call after creating that std::async? Just like you did in the lines below. Thanks!

@kwen2501, waitForFutureOrTimeout, as the name suggests, also waits after creating the std::async.

@kwen2501
Copy link
Contributor

kwen2501 commented Nov 6, 2024

You are right, I thought waitForFutureOrTimeout have some abort mechanism (it does), but here we may not want to fire the abort because it is best effort.

// best effort dump, not waiting for the dump here
std::future<bool> fut = std::async(
std::launch::async, [this]() { return this->dumpDebuggingInfo(); });
this->dumpDebuggingInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, so your change is equivalent? but we didn't want to block here, so we probably actually want a different change.. we have to make the async object outlive the function?

not sure if making this dump async is critical or not. but it may have been causing problems that we failed to diagnose because of making bad assumptions.

cc @fduwjj @shuqiangzhang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is for the dump from a pipe which is less used. But yes, this is a good point, let me follow up on this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wconstab Even if the future object is placed outside the block, it still blocks when the whole function returns, is that desired?

Copy link
Contributor

@kwen2501 kwen2501 Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyyever we were chatting about stashing Futures we don't want to wait for to a garbage collector at global scope. Something like this:

future = std::async(std::launch::async, func_I_dont_wait_for);
futGC.push_back(std::move(future));

And then at program teardown, we did a check:

if (!futGC.empty()) {
    WARN(“There are unfinished tasks associated with this PG but this PG is destructing. Your program may experience undefined behavior.”);
}

Do you think this works?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At what time are finished works in futGC poped so that it is not empty? Before checking empty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kwen2501 Implements a generic future pool may be a good idea to discover new optimisation opportunities.

Copy link
Contributor

@kwen2501 kwen2501 Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some side threads in ProcessGroup that's constantly doing monitoring (on the collectives). Those threads can lend a hand popping finished Futures from futGC.

Implements a generic future pool may be a good idea to discover new optimisation opportunities

Sounds good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync offline with @kwen2501 If this is only for FR dump. Why can't we always do this:

  1. Always wait for the async for a short period of time.
  2. Always throw exceptions when it's timeout or return false.

This is like what we did waitForFutureOrTimeout

@cyyever
Copy link
Collaborator Author

cyyever commented Nov 6, 2024

dumpDebuggingInfo is called in std::async again some lines later.

@c-p-i-o
Copy link
Contributor

c-p-i-o commented Nov 7, 2024

dumpDebuggingInfo is called in std::async again some lines later.

The 2nd call is probably fine as it's meant to be called when things are shutting down and it gives Flight Recorder time to write out diagnostics before the world gets shut down.
1st call is blocking.

@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 8, 2024
@github-actions github-actions bot added the Stale label Jan 10, 2025
@kwen2501 kwen2501 added no-stale and removed Stale labels Jan 11, 2025
@pytorch pytorch deleted a comment from github-actions bot Oct 10, 2025
@cyyever cyyever closed this Oct 10, 2025
@cyyever cyyever deleted the fix_error branch October 10, 2025 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-stale oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants