# VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos This is the PyTorch implementation for VideoRAG proposed in this paper: >**VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos** >Xubin Ren*, Lingrui Xu*, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang† \* denotes equal contribution. † denotes corresponding author In this paper, we proposed a retrieval-augmented generation framework specifically designed for processing and understanding **extremely long-context videos**. ## VideoRAG Framework

VideoRAG

VideoRAG introduces a novel dual-channel architecture that synergistically combines graph-driven textual knowledge grounding for modeling cross-video semantic relationships with hierarchical multimodal context encoding to preserve spatiotemporal visual patterns, enabling unbounded-length video understanding through dynamically constructed knowledge graphs that maintain semantic coherence across multi-video contexts while optimizing retrieval efficiency via adaptive multimodal fusion mechanisms. ## Installation To utilize VideoRAG, please first create a conda environment with the following commands: ```bash conda create --name videorag python=3.11 conda activate videorag pip install numpy==1.26.4 pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pip install accelerate==0.30.1 pip install bitsandbytes==0.43.1 pip install moviepy==1.0.3 pip install git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d pip install timm==0.6.7 ftfy regex einops fvcore eva-decord==0.6.1 iopath matplotlib types-regex cartopy pip install ctranslate2==4.4.0 faster_whisper neo4j hnswlib xxhash nano-vectordb pip install transformers==4.37.1 pip install tiktoken openai tenacity # Install ImageBind using the provided code in this repository, where we have removed the requirements.txt to avoid environment conflicts. cd ImageBind pip install . ``` Then, please download the necessary checkpoints in **the repository's root folder** for MiniCPM-V, Whisper, and ImageBind as follows: ```bash # minicpm-v git clone https://huggingface.co/openbmb/MiniCPM-V-2_6-int4 # whisper git clone https://huggingface.co/Systran/faster-distil-whisper-large-v3 # imagebind mkdir .checkpoints cd .checkpoints wget https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth cd ../ ``` ## Quick Start VideoRAG is capable of extracting knowledge from multiple videos and answering queries based on those videos. Now, try VideoRAG with your own videos 🤗. > [!NOTE] > Currently, VideoRAG has only been tested in an English environment. To process videos in multiple languages, it is recommended to modify the ```WhisperModel``` in [asr.py](VideoRAG/videorag/_videoutil/asr.py). For more details, please refer to [faster-whisper](https://github.com/systran/faster-whisper). **At first**, let the VideoRAG extract and indexing the knowledge from given videos (Only one GPU with 24GB of memory is sufficient, such as the RTX 3090): ```python import os import logging import warnings import multiprocessing warnings.filterwarnings("ignore") logging.getLogger("httpx").setLevel(logging.WARNING) # Please enter your openai key os.environ["OPENAI_API_KEY"] = "" from videorag._llm import * from videorag import VideoRAG, QueryParam if __name__ == '__main__': multiprocessing.set_start_method('spawn') # Please enter your video file path in this list; there is no limit on the length. # Here is an example; you can use your own videos instead. video_paths = [ 'movies/Iron-Man.mp4', 'movies/Spider-Man.mkv', ] videorag = VideoRAG(cheap_model_func=gpt_4o_mini_complete, best_model_func=gpt_4o_mini_complete, working_dir=f"./videorag-workdir") videorag.insert_video(video_path_list=video_paths) ``` **Then**, ask any questions about the videos! Here is an exmaple: ```python import os import logging import warnings import multiprocessing warnings.filterwarnings("ignore") logging.getLogger("httpx").setLevel(logging.WARNING) # Please enter your openai key os.environ["OPENAI_API_KEY"] = "" from videorag._llm import * from videorag import VideoRAG, QueryParam if __name__ == '__main__': multiprocessing.set_start_method('spawn') query = 'What is the relationship between Iron Man and Spider-Man? How do they meet, and how does Iron Man help Spider-Man?' param = QueryParam(mode="videorag") # if param.wo_reference = False, VideoRAG will add reference to video clips in the response param.wo_reference = True videorag = VideoRAG(cheap_model_func=gpt_4o_mini_complete, best_model_func=gpt_4o_mini_complete, working_dir=f"./videorag-workdir") videorag.load_caption_model(debug=False) response = videorag.query(query=query, param=param) print(response) ``` ## Evaluation ### LongerVideos We constructed the LongerVideos benchmark to evaluate the model's performance in comprehending multiple long-context videos and answering open-ended queries. All the videos are open-access videos on YouTube, and we record the URLs of the collections of videos as well as the corresponding queries in the [JSON](longervideos/dataset.json) file. | Video Type | #video list | #video | #query | #avg. queries per list | #overall duration | |------------------|------------:|-------:|-------:|-----------------------:|-------------------------| | **Lecture** | 12 | 135 | 376 | 31.3 | ~ 64.3 hours | | **Documentary** | 5 | 12 | 114 | 22.8 | ~ 28.5 hours | | **Entertainment**| 5 | 17 | 112 | 22.4 | ~ 41.9 hours | | **All** | 22 | 164 | 602 | 27.4 | ~ 134.6 hours | ### Process LongerVideos with VideoRAG First, we need to download all the videos in 720p resolution using the `yt-dlp` library. You can use the following commands to download all the videos in LongerVideos: ```shell cd longervideos python prepare_videos.py # create collection folders sh download.sh # downloading videos ``` Then, you can run the following example command to process and answer queries for LongerVideos with VideoRAG: ```shell # Please enter your openai_key in line 18 at first python videorag_experiment.py --collection 4-rag-lecture --cuda 0 ``` ### Evaluation We conduct win-rate comparisons as well as quantitative comparisons with RAG-based baselines and long-context video understanding methods separately. **NaiveRAG, GraphRAG and LightRAG** are implemented using the `nano-graphrag` library, which is consistent with our VideoRAG, ensuring a fair comparison. In this part, we directly provided the **answers from all the methods** (including VideoRAG) as well as the evaluation codes for experiment reproduction. Please utilize the following commands to download the answers: ```shell cd reproduce wget https://archive.org/download/videorag/all_answers.zip unzip all_answers ``` #### Win-Rate Comparison We conduct the win-rate comparison with RAG-based baselines. To reproduce the results, please follow these steps: ```shell cd reproduce/winrate_comparison # First Step: Upload the batch request to OpenAI (remember to enter your key in the file, same for the following steps). python batch_winrate_eval_upload.py # Second Step: Download the results. Please enter the batch ID and then the output file ID in the file. Generally, you need to run this twice: first to obtain the output file ID, and then to download it. python batch_winrate_eval_download.py # Third Step: Parsing the results. Please the output file ID in the file. python batch_winrate_eval_download.py # Fourth Step: Calculate the results. Please enter the parsed result file name in the file. python batch_winrate_eval_calculate.py ``` #### Quantitative Comparison We conduct a quantitative comparison, which extends the win-rate comparison by assigning a 5-point score to long-context video understanding methods. We use the answers from NaiveRAG as the baseline response for scoring each query. To reproduce the results, please follow these steps: ```shell cd reproduce/quantitative_comparison # First Step: Upload the batch request to OpenAI (remember to enter your key in the file, same for the following steps). python batch_winrate_quant_upload.py # Second Step: Download the results. Please enter the batch ID and then the output file ID in the file. Generally, you need to run this twice: first to obtain the output file ID, and then to download it. python batch_winrate_quant_download.py # Third Step: Parsing the results. Please the output file ID in the file. python batch_winrate_quant_download.py # Fourth Step: Calculate the results. Please enter the parsed result file name in the file. python batch_winrate_quant_calculate.py ``` ### Acknowledgement You may refer to related work that serves as foundations for our framework and code repository, [nano-graphrag](https://github.com/gusye1234/nano-graphrag) and [LightRAG](https://github.com/HKUDS/LightRAG). Thanks for their wonderful works. **Thank you for your interest in our work!**