We introduce VS-Bench, a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return.
Setup a Conda Environment:
conda create -n vs-bench python=3.10 -y
conda activate vs-bench
pip install -r requirements.txtTo run a minimal example, first set the OPENAI_API_KEY environment variable using your own OpenAI API key:
export OPENAI_API_KEY=<your_api_key>Next, you can run the following command to evaluate the decision-making ability of GPT-4.1 in the Tic-Tac-Toe environment:
python main.py --eval decision-making --exp tic_tac_toeThe results of this experiment, including the episode returns, images of each step in the match, and GPT-4.1's responses, will be saved in the ./results/decision-making directory.
Our evaluation considers three dimensions: perception, strategic reasoning and decision-making.
We provide 400 samples for each environment to evaluate the perception capability of VLMs. You can download the VS-Bench dataset from Hugging Face and place it in the ./data/ directory. Note that the perception folder is specifically used for testing perception.
Next, run the following command to evaluate strategic reasoning:
python main.py --eval perception --exp <exp_name>Replace <exp_name> with one of the environment name provided in the ./configs/env_configs directory.
We provide 400 samples for each environment to evaluate the perception capability of VLMs. You can download the VS-Bench dataset from Hugging Face and place it in the ./data/ directory. Note that the reasoning and text_reasoning (without visual information) folders are specifically used for testing strategic reasoning.
Next, run the following command to evaluate strategic reasoning:
python main.py --eval strategic-reasoning --exp <exp_name>Replace <exp_name> with one of the environment name provided in the ./configs/env_configs directory.
To evaluate decision-making ability, run the following command:
python main.py --eval decision-making --exp <exp_name>Replace <exp_name> with one of the experiment name provided in the ./configs/exp_configs directory.
The default configuration file for each <exp_name> is located at ./configs/exp_configs/<exp_name>.yaml. Below is the configuration file for Tic-Tac-Toe:
experiment:
name: default
seed: 0
async_mode: true
num_episodes: 10
results_dir: results
environment: tic_tac_toe
agents:
- type: prompt_agent
params:
model: gpt-4.1
visual_obs: true
- type: mcts_agentBy default, the VLM is set to GPT-4.1. To use a different VLM, change the model parameter in the configuration file. All available VLMs can be found in the ./configs/model_configs/ directory.
We offer two different VLM agent types:
prompt_agent(let the VLM only output the action)cot_agent(let the VLM think step by step)
Additionally, to compare VLM performance with traditional algorithms, we provide three baseline agents:
random_agentmcts_agent(for board games)cfr_agent(for card games)
We provide complete scripts for evaluating human-level performance by allowing human players to directly participate in the game.
For single-player experiments, the game can be launched on a single computer. For multi-player settings (two or more players), we recommend using the same number of computers as players. All computers should be connected to a shared directory, with one machine acting as the host and the others as clients.
In addition to running the client processes, the host must launch an extra main function responsible for transmitting information to all clients.
First, set the user_terminal_path to the shared directory where each player will read the latest game state and related information.
Next, configure the corresponding game YAML file to use human agents and synchronous mode. Specifically, set async_mode to false and specify human_agent as the agent type. For example:
experiment:
name: default
async_mode: False
results_dir: results_human
user_terminal_path: /YOUR/SHARE/DIRECTORY
environment:
- simple_push:
num_episodes: 5
seed: 1
agents:
- type: "human_agent:0"
- type: "builtin_agent"Assume there are two players: player0 and player1.
On one player's machine, open two terminal windows:
In the first terminal, run the decision-making evaluation:
python main.py --eval human-hci --exp humanIn the second terminal, run:
python user.py --player 0On the other player's machine, run:
python user.py --player 1@article{xu2025vs,
title={VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments},
author={Xu, Zelai and Xu, Zhexuan and Yi, Xiangmin and Yuan, Huining and Chen, Xinlei and Wu, Yi and Yu, Chao and Wang, Yu},
journal={coming soon},
year={2025}
}