Skip to content

Conversation

cyyever
Copy link
Collaborator

@cyyever cyyever commented Nov 6, 2024

Copy link

pytorch-bot bot commented Nov 6, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139886

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 6, 2024
@cyyever cyyever requested a review from kwen2501 November 6, 2024 15:01
@kwen2501
Copy link
Contributor

kwen2501 commented Nov 6, 2024

Thanks @cyyever .
Do you mean that when fut is destroyed in that scope, it will wait for the lambda it wraps to finish anyway?
If so I agree that std::async may not be effective here.
Cc @fduwjj

@kwen2501
Copy link
Contributor

kwen2501 commented Nov 6, 2024

Yep, the blocking behavior of std::future destructor is confirmed (tho as the thread below mentioned, controversial).
https://stackoverflow.com/questions/23455104/why-is-the-destructor-of-a-future-returned-from-stdasync-blocking

@fduwjj If it is intended for the dumpDebuggingInfo call to be non-blocking to the heartbeat monitor thread, do you want to add a waitForFutureOrTimeout call after creating that std::async? Just like you did in the lines below. Thanks!

@c-p-i-o
Copy link
Contributor

c-p-i-o commented Nov 6, 2024

Yep, the blocking behavior of std::future destructor is confirmed (tho as the thread below mentioned, controversial). https://stackoverflow.com/questions/23455104/why-is-the-destructor-of-a-future-returned-from-stdasync-blocking

@fduwjj If it is intended for the dumpDebuggingInfo call to be non-blocking to the heartbeat monitor thread, do you want to add a waitForFutureOrTimeout call after creating that std::async? Just like you did in the lines below. Thanks!

@kwen2501, waitForFutureOrTimeout, as the name suggests, also waits after creating the std::async.

@kwen2501
Copy link
Contributor

kwen2501 commented Nov 6, 2024

You are right, I thought waitForFutureOrTimeout have some abort mechanism (it does), but here we may not want to fire the abort because it is best effort.

// best effort dump, not waiting for the dump here
std::future<bool> fut = std::async(
std::launch::async, [this]() { return this->dumpDebuggingInfo(); });
this->dumpDebuggingInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, so your change is equivalent? but we didn't want to block here, so we probably actually want a different change.. we have to make the async object outlive the function?

not sure if making this dump async is critical or not. but it may have been causing problems that we failed to diagnose because of making bad assumptions.

cc @fduwjj @shuqiangzhang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.