-
Notifications
You must be signed in to change notification settings - Fork 162
Open
Description
When running training with Qwen2.5-3B on 4x RTX 4090 GPUs (24GB each), encountering two issues:
- Out of Memory (OOM) error despite using memory optimization settings
- Ray debugger hangs indefinitely when trying to debug remote functions
My script is as follows:
set -x
MODEL_PATH="Qwen/Qwen2.5-3B"
EXPERIMENT_NAME="logic_grpo_countdown_3b"
export HYDRA_FULL_ERROR=1
export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_VISIBLE_DEVICES=0,1,2,3
RAY_DEBUG=legacy python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=data/countdown/train.parquet \
data.val_files=data/countdown/test.parquet \
data.train_batch_size=8 \
data.val_batch_size=8 \
data.max_prompt_length=400 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=$MODEL_PATH\
actor_rollout_ref.actor.optim.lr=3e-7 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.ppo_micro_batch_size=32 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.grad_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size=160 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
actor_rollout_ref.rollout.n=10 \
actor_rollout_ref.ref.log_prob_micro_batch_size=32 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=['wandb'] \
trainer.project_name='GRPO_logic_countdown' \
trainer.experiment_name='Qwen-3B' \
trainer.n_gpus_per_node=4 \
trainer.nnodes=1 \
trainer.default_local_dir='/data/projects/logic_rl/checkpoints/${trainer.project_name}/${trainer.experiment_name}' \
trainer.default_hdfs_dir=null \
trainer.save_freq=100 \
trainer.test_freq=10 \
trainer.total_epochs=5 $@
I followed the official ray debugger document, tried to step into the remote func by remote and it hanged infinitely without any error or print log.
Active breakpoints:
index | timestamp | Ray task | filename:lineno
0 | 2025-03-06 12:03:52 | ray::main_task | /home/projects/Logic-RL/verl/trainer/ppo/ray_trainer.py:691
Enter breakpoint index or press enter to refresh: 0
> /home/projects/Logic-RL/verl/trainer/ppo/ray_trainer.py(692)fit()
-> actor_output = self.actor_rollout_wg.update_actor(batch)
(Pdb) s
--Call--
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(38)func()
-> def func(*args, **kwargs):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(39)func()
-> args, kwargs = dispatch_fn(self, *args, **kwargs)
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(40)func()
-> output = execute_fn(method_name, *args, **kwargs)
(Pdb) s
--Call--
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(329)execute_all()
-> def execute_all(self, method_name: str, *args, **kwargs):
(Pdb) s
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(330)execute_all()
-> return self.execute_all_async(method_name, *args, **kwargs)
(Pdb) s
--Call--
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(335)execute_all_async()
-> def execute_all_async(self, method_name: str, *args, **kwargs):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(339)execute_all_async()
-> length = len(self._workers)
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(340)execute_all_async()
-> if all(isinstance(arg, list) for arg in args) and all(isinstance(kwarg, list) for kwarg in kwargs.values()):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(341)execute_all_async()
-> if all(len(arg) == length for arg in args) and all(len(kwarg) == length for kwarg in kwargs.values()):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(343)execute_all_async()
-> result = []
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(344)execute_all_async()
-> for i in range(length):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(345)execute_all_async()
-> sliced_args = tuple(arg[i] for arg in args)
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(346)execute_all_async()
-> sliced_kwargs = {k: v[i] for k, v in kwargs.items()}
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(347)execute_all_async()
-> remote_call = getattr(self._workers[i], method_name)
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(348)execute_all_async()
-> result.append(remote_call.remote(*sliced_args, **sliced_kwargs))
(Pdb) remote
Continuing pdb session in different process...
Can you guys give me some suggestions on how to tune the hyperparameters and how to fix the debugger hanging problem?
Thx!
Metadata
Metadata
Assignees
Labels
No labels