Skip to content

Conversation

@sudostock
Copy link
Contributor

These variables were a workaround for NCCL versions older than 2.27.x and are no longer needed in this context.

  • Removed setting of these variables from scripts/performance/executors.py

These variables were a workaround for NCCL versions older than 2.27.x and are no longer needed in this context.
- Removed setting of these variables from scripts/performance/executors.py

Signed-off-by: Alex Filby <[email protected]>
@sanandaraj5597
Copy link
Collaborator

Can we add conditional checks instead of removing? We can do a torch NCCL version check.

@sudostock
Copy link
Contributor Author

Can we add conditional checks instead of removing? We can do a torch NCCL version check.

@sanandaraj5597 We can, I removed it outright since the latest NeMo container releases have a new enough NCCL version to not be needed.

@sudostock
Copy link
Contributor Author

@sanandaraj5597 Actually how would that work? The executor script is run during job launch on the local environment and outside the container env. We won't know what the NCCL version is until after the job starts.

@sanandaraj5597
Copy link
Collaborator

Ack on your point about executor script launch. I also agree that's a problem. Why do we want to remove this?
I don't want someone in the future to run an older container to debug a regression and this change causing the debug to be harder.

@sudostock
Copy link
Contributor Author

We recently had an internal team run into issues with Nemotron4 and checkpointing when those flags were set when using a Nemo container with NCCL 2.27.x+ (can link slack thread if interested)

In general I'm leery about leaving unneeded vars around given potential unintended consequences later. I also don't recall right now how big of an impact those settings actually were on perf.

Is there a way with the current structure of Nemo to do a run time check and inject the settings there?

@sanandaraj5597
Copy link
Collaborator

sanandaraj5597 commented Dec 10, 2025

We recently had an internal team run into issues with Nemotron4 and checkpointing when those flags were set when using a Nemo container with NCCL 2.27.x+

Shouldn't this be an issue NCCL should fix for backward compatibility? Not sure if M-Bridge is the right place to fix this.
But anyways if this is blocking your work, I'm okay removing this but I would fix the actual problem in NCCL than to workaround by patching M-Bridge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants