Skip to content

Commit 0eb50ec

Browse files
authored
[rollout] fix: resolve agent loop config path in multi-node Ray training (volcengine#4029)
### What does this PR do? Fixes agent loop configuration file path resolution in multi-node Ray training environments. **Problem:** When running multi-node training, relative paths to agent loop config files fail on remote worker nodes with `FileNotFoundError` because the working directory differs across nodes. **Solution:** Updated `resolve_config_path()` to dynamically resolve relative paths using the verl package installation location, making it work universally regardless of execution directory. Related issue: Agent loop config files cannot be loaded in multi-node setups. ### Test **Testing approach:** - Tested with 2-node Ray cluster (4 GPUs total) - Configuration: `recipe/langgraph_agent/example/agent.yaml` - Results: Config file successfully resolved on all remote nodes **Before fix:** ```shell FileNotFoundError: [Errno 2] No such file or directory: '/dfs/data/recipe/langgraph_agent/example/agent.yaml' ``` **After fix:** ```shell [DEBUG] Found file at verl base path: /dfs/data/work/verl/recipe/langgraph_agent/example/agent.yaml Training proceeds successfully ``` ### API and Usage Example **No API changes.** The fix is internal to the `resolve_config_path()` helper function. Users continue to use relative paths in config as before: ```yaml rollout: agent: agent_loop_config_path: "recipe/langgraph_agent/example/agent.yaml" ``` The path resolution now works correctly across all nodes. ### Design & Code Changes __File Changed:__ `verl/experimental/agent_loop/agent_loop.py` __Function Modified:__ `resolve_config_path()` __Key Changes:__ 1. Removed hardcoded path fallbacks 2. Added dynamic path resolution using `verl.__file__` to locate project root 3. Improved error messages with `FileNotFoundError` __Resolution Strategy:__ 1. If absolute path → return as-is 2. Try current working directory 3. Try relative to verl package installation (e.g., `/path/to/verl/recipe/...`) 4. Raise clear error if not found ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
1 parent 6d6ccb0 commit 0eb50ec

File tree

2 files changed

+62
-1
lines changed

2 files changed

+62
-1
lines changed

verl/experimental/agent_loop/agent_loop.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
from tensordict import TensorDict
3030
from transformers import AutoProcessor, AutoTokenizer
3131

32+
from verl.experimental.agent_loop.utils import resolve_config_path
3233
from verl.experimental.reward import RewardManagerWorker
3334
from verl.protocol import DataProto
3435
from verl.single_controller.ray.base import RayWorkerGroup
@@ -281,7 +282,8 @@ def __init__(
281282

282283
agent_loop_config_path = config.actor_rollout_ref.rollout.agent.agent_loop_config_path
283284
if agent_loop_config_path:
284-
agent_loop_configs = OmegaConf.load(agent_loop_config_path)
285+
resolved_path = resolve_config_path(agent_loop_config_path)
286+
agent_loop_configs = OmegaConf.load(resolved_path)
285287
for agent_loop_config in agent_loop_configs:
286288
_agent_loop_registry[agent_loop_config.name] = agent_loop_config
287289
if self.config.actor_rollout_ref.model.get("custom_chat_template", None) is not None:

verl/experimental/agent_loop/utils.py

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,65 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15+
import os
16+
17+
18+
def resolve_config_path(config_path: str) -> str:
19+
"""Resolve agent loop configuration file path.
20+
21+
In multi-node Ray training, relative paths may not resolve correctly
22+
because the working directory on remote nodes can differ from the driver node.
23+
This function resolves relative paths by checking multiple locations in order:
24+
1. If already absolute, return as-is
25+
2. Try current working directory
26+
3. Try relative to verl package installation (project root)
27+
28+
Args:
29+
config_path: Configuration file path (relative or absolute)
30+
31+
Returns:
32+
Absolute path to the configuration file
33+
34+
Raises:
35+
FileNotFoundError: If the configuration file cannot be found
36+
"""
37+
# Return absolute paths unchanged
38+
if os.path.isabs(config_path):
39+
return config_path
40+
41+
# Try current working directory first
42+
cwd = os.path.abspath(os.getcwd())
43+
cwd_path = os.path.abspath(os.path.join(cwd, config_path))
44+
if (cwd_path == cwd or cwd_path.startswith(cwd + os.sep)) and os.path.exists(cwd_path):
45+
return cwd_path
46+
47+
# Try relative to verl project root (where verl package is installed)
48+
try:
49+
import verl
50+
51+
verl_package_dir = os.path.abspath(os.path.dirname(verl.__file__))
52+
53+
# Strategy 1: For development/editable installs.
54+
project_root = os.path.dirname(verl_package_dir)
55+
dev_path = os.path.abspath(os.path.join(project_root, config_path))
56+
if (dev_path == project_root or dev_path.startswith(project_root + os.sep)) and os.path.exists(dev_path):
57+
return dev_path
58+
59+
# Strategy 2: For standard package installations.
60+
install_path = os.path.abspath(os.path.join(verl_package_dir, config_path))
61+
if (install_path == verl_package_dir or install_path.startswith(verl_package_dir + os.sep)) and os.path.exists(
62+
install_path
63+
):
64+
return install_path
65+
except (ImportError, AttributeError):
66+
pass # verl not installed or __file__ not available
67+
68+
# File not found - raise clear error
69+
raise FileNotFoundError(
70+
f"Agent loop configuration file not found: {config_path}. Tried current directory and verl project root."
71+
)
72+
73+
1574
# tokenizer.apply_chat_template is not working properly for gpt-oss model.
1675
# Because the chat template requires tool call messages to parse tool response messages
1776
# so we need to format the tool response manually.

0 commit comments

Comments
 (0)