Tags: instructlab/training
Tags
fix(torchrun): Omit empty arguments and correct nproc_per_node type (#… …661) * fix(torchrun): Omit empty arguments and correct nproc_per_node type The command generation logic is updated to dynamically build the torchrun command, excluding arguments that are empty or None. This prevents them from overriding environment variables, ensuring that torchrun can correctly inherit its configuration. An exception is made for integer arguments where 0 is a valid value. Additionally, the nproc_per_node argument type has been changed from int to str to support special values accepted by PyTorch, such as 'auto', 'gpu', and 'cpu'. Reference: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L77-L88 Signed-off-by: Saad Zaher <[email protected]> * only dynamically add torchrun args & change rdzv_id type to str Signed-off-by: Saad Zaher <[email protected]> * fix smoke tests Signed-off-by: Saad Zaher <[email protected]> * Enable both dtypes str, int for nproc_per_node, rdzv_id Signed-off-by: Saad Zaher <[email protected]> * Use python3.11 style for pydatnic model Signed-off-by: Saad Zaher <[email protected]> * add all torchrun args and validate them Signed-off-by: Saad Zaher <[email protected]> * Remove non-required dependencies Signed-off-by: Saad Zaher <[email protected]> * update datatypes only Signed-off-by: Saad Zaher <[email protected]> * replace _ with - when passing torchrun args Signed-off-by: Saad Zaher <[email protected]> * make nproc_per_node to only accept gpu or int Signed-off-by: Saad Zaher <[email protected]> * add master_{addr, port} validate args Signed-off-by: Saad Zaher <[email protected]> * check for not set or empty rdzv endpoint Signed-off-by: Saad Zaher <[email protected]> * fix formatting error Signed-off-by: Saad Zaher <[email protected]> * Update src/instructlab/training/config.py Signed-off-by: Saad Zaher <[email protected]> * Update tests/smoke/test_train.py Signed-off-by: Saad Zaher <[email protected]> * Update src/instructlab/training/main_ds.py Signed-off-by: Saad Zaher <[email protected]> * fixes indentation Signed-off-by: Oleg Silkin <[email protected]> * formatting * add standalone as the fallback when neither master_addr nor rdzv_endpoint are provided Signed-off-by: Oleg Silkin <[email protected]> * clarify rdzv-backend arg --------- Signed-off-by: Saad Zaher <[email protected]> Signed-off-by: Oleg Silkin <[email protected]> Co-authored-by: Oleg Silkin <[email protected]>
PreviousNext