Skip to content

VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

License

Notifications You must be signed in to change notification settings

zelaix/VS-Bench

Repository files navigation

VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

Website | Paper | Dataset

📝 Overview

overview_00

We introduce VS-Bench, a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return.

📦 Installation

Setup a Conda Environment:

conda create -n vs-bench python=3.10 -y
conda activate vs-bench
pip install -r requirements.txt

⚡ Quickstart

To run a minimal example, first set the OPENAI_API_KEY environment variable using your own OpenAI API key:

export OPENAI_API_KEY=<your_api_key>

Next, you can run the following command to evaluate the decision-making ability of GPT-4.1 in the Tic-Tac-Toe environment:

python main.py --eval decision-making --exp tic_tac_toe

The results of this experiment, including the episode returns, images of each step in the match, and GPT-4.1's responses, will be saved in the ./results/decision-making directory.

🚀 Experiments

Our evaluation considers three dimensions: perception, strategic reasoning and decision-making.

Perception

We provide 400 samples for each environment to evaluate the perception capability of VLMs. You can download the VS-Bench dataset from Hugging Face and place it in the ./data/ directory. Note that the perception folder is specifically used for testing perception.

Next, run the following command to evaluate strategic reasoning:

python main.py --eval perception --exp <exp_name>

Replace <exp_name> with one of the environment name provided in the ./configs/env_configs directory.

Strategic Reasoning

We provide 400 samples for each environment to evaluate the perception capability of VLMs. You can download the VS-Bench dataset from Hugging Face and place it in the ./data/ directory. Note that the reasoning and text_reasoning (without visual information) folders are specifically used for testing strategic reasoning.

Next, run the following command to evaluate strategic reasoning:

python main.py --eval strategic-reasoning --exp <exp_name>

Replace <exp_name> with one of the environment name provided in the ./configs/env_configs directory.

Decision-Making

To evaluate decision-making ability, run the following command:

python main.py --eval decision-making --exp <exp_name>

Replace <exp_name> with one of the experiment name provided in the ./configs/exp_configs directory.

The default configuration file for each <exp_name> is located at ./configs/exp_configs/<exp_name>.yaml. Below is the configuration file for Tic-Tac-Toe:

experiment:
  name: default
  seed: 0
  async_mode: true
  num_episodes: 10
  results_dir: results

environment: tic_tac_toe

agents:
  - type: prompt_agent
    params:
      model: gpt-4.1
      visual_obs: true

  - type: mcts_agent

By default, the VLM is set to GPT-4.1. To use a different VLM, change the model parameter in the configuration file. All available VLMs can be found in the ./configs/model_configs/ directory.

We offer two different VLM agent types:

  • prompt_agent (let the VLM only output the action)
  • cot_agent (let the VLM think step by step)

Additionally, to compare VLM performance with traditional algorithms, we provide three baseline agents:

  • random_agent
  • mcts_agent (for board games)
  • cfr_agent (for card games)

Human Evaluation

We provide complete scripts for evaluating human-level performance by allowing human players to directly participate in the game.
For single-player experiments, the game can be launched on a single computer. For multi-player settings (two or more players), we recommend using the same number of computers as players. All computers should be connected to a shared directory, with one machine acting as the host and the others as clients.
In addition to running the client processes, the host must launch an extra main function responsible for transmitting information to all clients.

YAML Configuration

First, set the user_terminal_path to the shared directory where each player will read the latest game state and related information.
Next, configure the corresponding game YAML file to use human agents and synchronous mode. Specifically, set async_mode to false and specify human_agent as the agent type. For example:

experiment:
  name: default
  async_mode: False
  results_dir: results_human

user_terminal_path: /YOUR/SHARE/DIRECTORY

environment:
  - simple_push:
      num_episodes: 5
      seed: 1
      agents:
        - type: "human_agent:0"        
        - type: "builtin_agent"

Multiplayer Setup

Assume there are two players: player0 and player1.

On one player's machine, open two terminal windows:

In the first terminal, run the decision-making evaluation:

python main.py --eval human-hci --exp human

In the second terminal, run:

python user.py --player 0

On the other player's machine, run:

python user.py --player 1

📚 Citation

@article{xu2025vs,
  title={VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments},
  author={Xu, Zelai and Xu, Zhexuan and Yi, Xiangmin and Yuan, Huining and Chen, Xinlei and Wu, Yi and Yu, Chao and Wang, Yu},
  journal={coming soon},
  year={2025}
}

About

VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages