This repository contains data and code for using the OPTiCAL positional reasoning benchmark for VLMs. Models are asked which shape is furthest in one direction in a set of images, and the primary metric of performance is Accuracy.
To start, get the dataset used in our paper from OSF or generate your own data using the Shape Maker. See the Accessing Data and Shape Maker sections for details.
With the data in hand, label it using the data labeler. See the Labeling Data section for details.
Finally run the inference scripts run_models.sh or cal_model.py to run inference on all or individual models. See the Running Inference section for details.
Setup has two steps.
First, rename the env file provided with this repository to the hidden name .env. On linux, run
cp env .envSecond, create mamba environments for each of the models used in our experiment using mamba. For each file in the dependencies directory, run
mamba create -f <env>replacing <env> with the appropriate yml file.
Users are free to download the data (n=30,000) used in our paper from OSF, to generate novel samples with the Shape Maker script available in this repo, or to use the sample data (n=300) included in the repository.
Using the original data, place it into the data/imgs directory. Continue to the Labeling Data section.
Please adjust the N_IMG and N_IMG_GEN parameters in the env file to the correct size for your dataset.
Note that the data on OSF are zipped using the Linux zip utility. Please unzip the data using unzip before attempting to use it.
As an alternative to downloading the original data, this repository contains a Python script generates a customizable grid of geometric shapes using matplotlib. Each shape can be defined with a distinct color. The resulting plot is either rendered as a PNG or shown in a GUI window
- Supports the following shapes:
circle,square,triangle,rectangle,pentagon- Directional wedges (partial circles):
upper_wedge,lower_wedge,left_wedge,right_wedge
- Custom colors for each shape
- Easily scalable grid layout
- CLI interface for integration with other tools or batch generation
- Blank spaces using the keyword
none
python shape_maker.py
--shapes "circle,square,triangle;none,rectangle,pentagon"
--colors "red,green,blue;none,orange,purple"
Note that you will not be able to see the plot unless you include plt.show() at the end of the script. Also, the parameter N_IMG_GEN in the env file controls the number of shapes generated.
When the data are changed, either by downloading it from OSF or by generating it using the Shape Maker, it must be relabeled or results gathered from inference will be garbage data based on labels from the previous data samples.
To relabel the data, place it in a subdirectory inside data and adjust the IMG_DIR parameter to match the location. Run
python src/relabel_data.pyOnce the data are generated using the shapemaker, to reproduce the experiment in our paper, run the run_models.sh script.
chmod +x run_models.sh
./run_models.shTo run inference with an individual model, activate the appropriate mamba environment and run the call_model.py python script independently.
mamba activate <model_env>
python src/call_model.py -m <model>To run with the desired model, replace <model> and <model_env> with entries from the table above corresponding to the model's HF tag.
Results are written to files in the data data directory with names <model>.tsv, <model>_counts.tsv, <model>_metrics.tsv. These contain the raw results, counts of the total number of questions for each category of shape and direction, and metrics (accuracy) for each direction and shape.