-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Remove NCCL_NET_GDR_LEVEL and NCCL_NET_GDR_C2C environment variables. #15161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
These variables were a workaround for NCCL versions older than 2.27.x and are no longer needed in this context. - Removed setting of these variables from scripts/performance/executors.py Signed-off-by: Alex Filby <[email protected]>
|
Can we add conditional checks instead of removing? We can do a torch NCCL version check. |
@sanandaraj5597 We can, I removed it outright since the latest NeMo container releases have a new enough NCCL version to not be needed. |
|
@sanandaraj5597 Actually how would that work? The executor script is run during job launch on the local environment and outside the container env. We won't know what the NCCL version is until after the job starts. |
|
Ack on your point about executor script launch. I also agree that's a problem. Why do we want to remove this? |
|
We recently had an internal team run into issues with Nemotron4 and checkpointing when those flags were set when using a Nemo container with NCCL 2.27.x+ (can link slack thread if interested) In general I'm leery about leaving unneeded vars around given potential unintended consequences later. I also don't recall right now how big of an impact those settings actually were on perf. Is there a way with the current structure of Nemo to do a run time check and inject the settings there? |
Shouldn't this be an issue NCCL should fix for backward compatibility? Not sure if M-Bridge is the right place to fix this. |
These variables were a workaround for NCCL versions older than 2.27.x and are no longer needed in this context.