This is the code for implementing the PRD-MAPPO algorithm presented in the paper: Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization. It is configured to be run in conjunction with the following environments:
Alert: Please note that the environments listed above are customised and hence one should use the environment directories provided in the above codebase instead
-
To install MPE, PP, or MA-GYM,
cdinto the root directory and typepip install -e . -
Known dependencies for MPE: Python (3.6+), OpenAI gym (0.10.5), torch (1.10.0+cu102), numpy (1.21.5)
-
Known dependencies for PP: Python (3.6+), OpenAI gym (0.23.1), torch (1.11.0+cu102), numpy (1.22.3)
-
Known dependencies for MA-GYM: Python (3.6+), OpenAI gym (0.19.0), torch (1.11.0+cu102), numpy (1.22.3)
You can find these parameters in the main.py file for all the environments.
-
iteration: seed index (default:0, options:0, 1, 2, 3, 4) -
update_type: policy update algorithm (default:ppo, options:ppo, a2c) -
attention_type: transformer attention mechanism for the critic network (default:soft, options:soft, semi-hard) -
device: device to run the code on (default:gpu, option:gpu, cpu) -
grad_clip_critic: gradient clip for critic network (default:10.0 (MPE) or 0.5 (MA-GYM/PP)) -
grad_clip_actor: gradient clip for actor network (default:10.0 (MPE) or 0.5 (MA-GYM/PP)) -
critic_dir: directory to save critic network models -
actor_dir: directory to save actor network models -
gif_dir: directory to save gifs -
policy_eval_dir: directory to save policy metrics -
policy_clip: imposes a clip interval on the probability ratio term while computing policy loss, which is clipped into a range [1 — policy_clip, 1 + policy_clip] (default:0.05) -
value_clip: imposes a clip interval on the probability ratio term while computing value loss, which is clipped into a range [1 — value_clip, 1 + value_clip] (default:0.05) -
n_epochs: number of epochs to train the policy and critic network (default:5) -
env: environment name -
value_lr: critic learning rate (default:1e-3 (Crossing) or 3e-4 (Combat) or 7e-4 (Pressure Plate) or 5e-5 (Traffic Junction)) -
policy_lr: actor learning rate (default:7e-4 (Crossing) or 3e-4 (Combat) or 7e-4 (Pressure Plate) or 5e-5 (Traffic Junction)) -
entropy_pen: entropy penalty (default:0.0 (Crossing) or 8e-3 (Combat) or 0.4 (Pressure Plate) or 0.0 (Traffic Junction)) -
gamma: discount factor (default:0.99) -
gae_lambda: temperature factor for Generalized Advantage Estimation (default:0.95) -
lambda: temperature factor for computing TD-lambda targets (default:0.95) -
select_above_threshold: weight threshold to identify relevant set (default:0.05 (Crossing) or 0.2 (Combat) or 0.05 (Pressure Plate) or 0.2 (Traffic Junction)) -
gif: enable rendering of gif -
gif_checkpoint: episodes after which render gif (default:1) -
load_models: enable to load critic and actor models -
model_path_value: critic model path -
model_path_policy: actor model path -
eval_policy: enable to capture policy evaluation metrics -
save_model: enable to save critic and actor models -
save_model_checkpoint: save model aftersave_model_checkpointepisodes -
save_comet_ml_plot: enable to record data on comet -
learn: enable updating critic and actor networks -
max_episodes: total number of episodes (default:80K (Crossing) or 120K (Combat) or 20K (Pressure Plate) or 20K (Traffic Junction)) -
max_time_steps: number of timesteps per episode (default:50 (Crossing) or 40 (Combat) or 70 (Pressure Plate) or 40 (Traffic Junction)) -
experiment_type: type of update (default:prd, options:prd, shared (fully cooperative))
-
./Agent MA GYM/MA_Controller/Combat/main.py: contains code for setting parameters of PRD-MAPPO on the MA-GYM Combat environment -
./Agent MA GYM/MA_Controller/Traffic_Junc/main.py: contains code for setting parameters of PRD-MAPPO on the MA-GYM Traffic Junction environment -
./Agent MPE/MA_Controller/main.py: contains code for setting parameters of PRD-MAPPO on the MPE Crossing environment -
./Agent Pressure Plate/MA_Controller/main.py: contains code for setting parameters of PRD-MAPPO on the PP 4 Person Pressure Plate environment -
./Agent MA GYM/MA_Controller/Combat/agent.pyor./Agent MA GYM/MA_Controller/Traffic_Junc/agent.pyor./Agent Pressure Plate/MA_Controller/agent.pyor./Agent MPE/MA_Controller/agent.py: core code for the PRD-MAPPO algorithm -
./Agent MA GYM/MA_Controller/Combat/multiagent.pyor./Agent MA GYM/MA_Controller/Traffic_Junc/multiagent.pyor./Agent Pressure Plate/MA_Controller/multiagent.pyor./Agent MPE/MA_Controller/multiagent.py: code that deals with environment and agent interaction -
./Agent MA GYM/MA_Controller/Combat/model.pyor./Agent MA GYM/MA_Controller/Traffic_Junc/model.pyor./Agent Pressure Plate/MA_Controller/model.pyor./Agent MPE/MA_Controller/model.py: Policy, Q Network, Replay Buffer code for PRD-MAPPO