Skip to content

bombode/VideoRAG

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

This is the PyTorch implementation for VideoRAG proposed in this paper:

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
Xubin Ren*, Lingrui Xu*, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang†

* denotes equal contribution. † denotes corresponding author

In this paper, we proposed a retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos.

VideoRAG Framework

VideoRAG

VideoRAG introduces a novel dual-channel architecture that synergistically combines graph-driven textual knowledge grounding for modeling cross-video semantic relationships with hierarchical multimodal context encoding to preserve spatiotemporal visual patterns, enabling unbounded-length video understanding through dynamically constructed knowledge graphs that maintain semantic coherence across multi-video contexts while optimizing retrieval efficiency via adaptive multimodal fusion mechanisms.

πŸ’» Efficient Extreme Long-Context Video Processing

  • Leveraging a Single NVIDIA RTX 3090 GPU (24G) to comprehend Hundreds of Hours of video content πŸ’ͺ

πŸ—ƒοΈ Structured Video Knowledge Indexing

  • Multi-Modal Knowledge Indexing Framework distills hundreds of hours of video into a concise, structured knowledge graph πŸ—‚οΈ

πŸ” Multi-Modal Retrieval for Comprehensive Responses

  • Multi-Modal Retrieval Paradigm aligns textual semantics and visual content to identify the most relevant video for comprehensive responses πŸ’¬

πŸ“š The New Established LongerVideos Benchmark

  • The new established LongerVideos Benchmark features over 160 Videos totaling 134+ Hours across lectures, documentaries, and entertainment 🎬

Installation

To utilize VideoRAG, please first create a conda environment with the following commands:

conda create --name videorag python=3.11
conda activate videorag

pip install numpy==1.26.4
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2
pip install accelerate==0.30.1
pip install bitsandbytes==0.43.1
pip install moviepy==1.0.3
pip install git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d
pip install timm==0.6.7 ftfy regex einops fvcore eva-decord==0.6.1 iopath matplotlib types-regex cartopy
pip install ctranslate2==4.4.0 faster_whisper==1.0.3 neo4j hnswlib xxhash nano-vectordb
pip install transformers==4.37.1
pip install tiktoken openai tenacity

# Install ImageBind using the provided code in this repository, where we have removed the requirements.txt to avoid environment conflicts.
cd ImageBind
pip install .

Then, please download the necessary checkpoints in the repository's root folder for MiniCPM-V, Whisper, and ImageBind as follows:

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install

# minicpm-v
git lfs clone https://huggingface.co/openbmb/MiniCPM-V-2_6-int4

# whisper
git lfs clone https://huggingface.co/Systran/faster-distil-whisper-large-v3

# imagebind
mkdir .checkpoints
cd .checkpoints
wget https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth
cd ../

Your final directory structure after downloading all checkpoints should look like this:

VideoRAG
β”œβ”€β”€ .checkpoints
β”œβ”€β”€ faster-distil-whisper-large-v3
β”œβ”€β”€ ImageBind
β”œβ”€β”€ LICENSE
β”œβ”€β”€ longervideos
β”œβ”€β”€ MiniCPM-V-2_6-int4
β”œβ”€β”€ README.md
β”œβ”€β”€ reproduce
β”œβ”€β”€ videorag
β”œβ”€β”€ VideoRAG_cover.png
└── VideoRAG.png

Quick Start

VideoRAG is capable of extracting knowledge from multiple videos and answering queries based on those videos. Now, try VideoRAG with your own videos πŸ€—.

Note

Currently, VideoRAG has only been tested in an English environment. To process videos in multiple languages, it is recommended to modify the WhisperModel in asr.py. For more details, please refer to faster-whisper.

At first, let the VideoRAG extract and indexing the knowledge from given videos (Only one GPU with 24GB of memory is sufficient, such as the RTX 3090):

import os
import logging
import warnings
import multiprocessing

warnings.filterwarnings("ignore")
logging.getLogger("httpx").setLevel(logging.WARNING)

# Please enter your openai key
os.environ["OPENAI_API_KEY"] = ""

from videorag._llm import *
from videorag import VideoRAG, QueryParam


if __name__ == '__main__':
    multiprocessing.set_start_method('spawn')

    # Please enter your video file path in this list; there is no limit on the length.
    # Here is an example; you can use your own videos instead.
    video_paths = [
        'movies/Iron-Man.mp4',
        'movies/Spider-Man.mkv',
    ]
    videorag = VideoRAG(cheap_model_func=gpt_4o_mini_complete, best_model_func=gpt_4o_mini_complete, working_dir=f"./videorag-workdir")
    videorag.insert_video(video_path_list=video_paths)

Then, ask any questions about the videos! Here is an exmaple:

import os
import logging
import warnings
import multiprocessing

warnings.filterwarnings("ignore")
logging.getLogger("httpx").setLevel(logging.WARNING)

# Please enter your openai key
os.environ["OPENAI_API_KEY"] = ""

from videorag._llm import *
from videorag import VideoRAG, QueryParam


if __name__ == '__main__':
    multiprocessing.set_start_method('spawn')

    query = 'What is the relationship between Iron Man and Spider-Man? How do they meet, and how does Iron Man help Spider-Man?'
    param = QueryParam(mode="videorag")
    # if param.wo_reference = False, VideoRAG will add reference to video clips in the response
    param.wo_reference = True

    videorag = VideoRAG(cheap_model_func=gpt_4o_mini_complete, best_model_func=gpt_4o_mini_complete, working_dir=f"./videorag-workdir")
    videorag.load_caption_model(debug=False)
    response = videorag.query(query=query, param=param)
    print(response)

Evaluation

LongerVideos

We constructed the LongerVideos benchmark to evaluate the model's performance in comprehending multiple long-context videos and answering open-ended queries. All the videos are open-access videos on YouTube, and we record the URLs of the collections of videos as well as the corresponding queries in the JSON file.

Video Type #video list #video #query #avg. queries per list #overall duration
Lecture 12 135 376 31.3 ~ 64.3 hours
Documentary 5 12 114 22.8 ~ 28.5 hours
Entertainment 5 17 112 22.4 ~ 41.9 hours
All 22 164 602 27.4 ~ 134.6 hours

Process LongerVideos with VideoRAG

First, we need to download all the videos in 720p resolution using the yt-dlp library. You can use the following commands to download all the videos in LongerVideos:

cd longervideos
python prepare_videos.py # create collection folders
sh download.sh # downloading videos

Then, you can run the following example command to process and answer queries for LongerVideos with VideoRAG:

# Please enter your openai_key in line 18 at first
python videorag_experiment.py --collection 4-rag-lecture --cuda 0

Evaluation

We conduct win-rate comparisons as well as quantitative comparisons with RAG-based baselines and long-context video understanding methods separately. NaiveRAG, GraphRAG and LightRAG are implemented using the nano-graphrag library, which is consistent with our VideoRAG, ensuring a fair comparison.

In this part, we directly provided the answers from all the methods (including VideoRAG) as well as the evaluation codes for experiment reproduction. Please utilize the following commands to download the answers:

cd reproduce
wget https://archive.org/download/videorag/all_answers.zip
unzip all_answers

Win-Rate Comparison

We conduct the win-rate comparison with RAG-based baselines. To reproduce the results, please follow these steps:

cd reproduce/winrate_comparison

# First Step: Upload the batch request to OpenAI (remember to enter your key in the file, same for the following steps).
python batch_winrate_eval_upload.py

# Second Step: Download the results. Please enter the batch ID and then the output file ID in the file. Generally, you need to run this twice: first to obtain the output file ID, and then to download it.
python batch_winrate_eval_download.py

# Third Step: Parsing the results. Please the output file ID in the file.
python batch_winrate_eval_download.py

# Fourth Step: Calculate the results. Please enter the parsed result file name in the file.
python batch_winrate_eval_calculate.py

Quantitative Comparison

We conduct a quantitative comparison, which extends the win-rate comparison by assigning a 5-point score to long-context video understanding methods. We use the answers from NaiveRAG as the baseline response for scoring each query. To reproduce the results, please follow these steps:

cd reproduce/quantitative_comparison

# First Step: Upload the batch request to OpenAI (remember to enter your key in the file, same for the following steps).
python batch_winrate_quant_upload.py

# Second Step: Download the results. Please enter the batch ID and then the output file ID in the file. Generally, you need to run this twice: first to obtain the output file ID, and then to download it.
python batch_winrate_quant_download.py

# Third Step: Parsing the results. Please the output file ID in the file.
python batch_winrate_quant_download.py

# Fourth Step: Calculate the results. Please enter the parsed result file name in the file.
python batch_winrate_quant_calculate.py

Citation

If you find this work is helpful to your research, please consider citing our paper:

@article{VideoRAG,
  title={VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos},
  author={Ren, Xubin and Xu, Lingrui and Xia, Long and Wang, Shuaiqiang and Yin, Dawei and Huang, Chao},
  journal={arXiv preprint arXiv:2502.01549},
  year={2025}
}

Thank you for your interest in our work!

Acknowledgement

You may refer to related work that serves as foundations for our framework and code repository, nano-graphrag and LightRAG. Thanks for their wonderful works.

About

"VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.8%
  • Shell 0.2%