This is the official repository for the paper: "MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale". In the paper, we introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents.
MedAgentGym has been carefully curated in strict accordance with ethical standards, utilizing datasets that are either publicly available or that incorporate rigorous privacy protection and anonymization measures. Table 7 in the Appendix details the specific access requirements for each of the 12 datasets included in MedAgentGym. Researchers seeking access to preprocessed tasks and data files must first obtain and submit all necessary data usage agreements. Access Policy: Only credentialed users who have signed the Data Use Agreement (DUA) are permitted to access these files.
License (for files): PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement: PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required Training: CITI Data or Specimens Only Research.
Please note, this current version excludes the MIMIC-related (MIMIC-III, eICU, TREQS) and EHRSHOT dataset. Access to data involving MIMIC-III, eICU, and EHRSHOT tasks requires additional approval from PhysioNet and Stanford University. Researchers seeking for any additional guidance on full access to preprocessed data can send an email to [email protected], using the subject line “MedAgentGym Preprocessed Data Access".
This repository contains basic task files train_tasks.jsonl and test_tasks.jsonl, each including the task ID, task description, question, and corresponding ground truth answer.
After completing the previous step and obtaining approval for access, applicants will receive a script (download_data.py) to download the entire preprocessed dataset from a private repository. This script will automatically download all datasets into the ./data/ directory. The downloaded datasets should be structured as ./data/biocoder/*. Detailed descriptions of the datasets utilized in this paper are provided below:
Since our dataset relies on a Docker environment for isolated coding and execution, you may first build the Docker container. Please execute the following command:
docker buildx build -t ehr_gym:latest .Alternatively, you can run the prepared script directly:
bash build_docker.shPrepare your experiment commands in the entrypoint.sh file. For instance, to run experiments on the Biocoder task using the GPT-4.1-mini model, execute the following command for parallel execution with 5 threads:
python3 /home/main.py --config /home/configs/gpt_4_1_mini/exp-gpt_4_1_mini-biocoder.yaml --async_run --parallel_backend joblib --n_jobs 5Figure below highlights substantial performance gains from SFT across four OSS backbone LLMs of varying sizes.
The table below compares several post-training methods, revealing that simple SFT over successful trajectories significantly boosts performance on structured coding tasks, demonstrating its effectiveness in capturing structured coding patterns. Besides, DPO is particularly beneficial for optimizing open-ended task performance. Although DPO alone slightly underperforms compared to SFT, combining an initial SFT warm-up with subsequent DPO further improves overall results by leveraging their complementary strengths.
Inference-Time Scaling: The left figure illustrates performance scaling with increased trajectory sampling. Pass@K significantly improves from 17.0% at K = 1 to 45.0% at 16, while Best@K shows steady advancement from 17.0% to 41.7%. The relatively small gap between metrics indicates that our trained verifier effectively identifies successful trajectories, unleashing its potential as a reward model for integration into advanced online RL frameworks such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO).
Training-Time Scaling: The right figure examines agent performance as a function of increased training data volumes (25%, 50%, 75%, and 100%) in SFT. We observe consistent performance improvements with greater training data availability, suggesting additional computational resources dedicated to sampling further trajectories are likely to yield continued performance gains.
@article{xu2025medagentgym,
title={MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale},
author={Xu, Ran and Zhuang, Yuchen and Zhong, Yishan and Yu, Yue and Tang, Xiangru and Wu, Hang and Wang, May D and Ruan, Peifeng and Yang, Donghan and Wang, Tao and Xiao, Guanghua and Yang, Carl and Xie, Yang and Shi, Wenqi},
journal={arXiv preprint arXiv:2506.04405},
year={2025}
}