Logo Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Jiajun Liu1*, Yibing Wang1,2*, Hanghang Ma1†, Xiaoping Wu1†, Xiaoqi Ma1†, Jie Hu1‡
1 Meituan   2 University of Chinese Academy of Sciences
* Joint first authors    Key contributors    Project lead & Corresponding author

Abstract

Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for visionlanguage pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.

Highlights:
  1. Large-scale Data Curation. Large-scale Data Curation. We develop a data curation system to generate captions for open-source and internal videos and construct a video instruction tuning dataset covering a variety of tasks.
  2. Long-context Video Input. We extend the maximum frames of input videos to 160, with corresponding sequence length up to 22k tokens.
  3. Superior Performance. Our model achieves state-of-the-art performance on the a variety of comprehensive benchmarks and outperforms some larger open-source models with over 10B parameters and proprietary models on certain benchmarks.
  4. Bilingual Conversation. Our proposed model is equipped with the capability of Chinese, English and bilingual conversations, and support single/multi-round conversation paradigms.

Model

Architecture of our Kangaroo model.

Dataset Statistics

Data Curation
We select 300M images from LAION, COYO and Wukong for image-text alignment and 60M videos from Webvid, Youku-mPLUG and internal social media short videos for video-text alignment, following a category balance strategy. Videos with excessive text coverage, significant face coverage and low optical flow scores are filtered out.

Automatic Annotation
We develop an automatic video annotation system. The process begins with the extraction of five key frames of each video, followed by the utilization of three distinct off-the-shelf Multimodal Large Language Models (MLLM) to generate frame captions. Next, we employ an LLM to synthesize the key frame captions into a comprehensive video caption. We further curate a subset of 6M pre-training data and incorporate 900K dense caption data from ShareGPTVideo dataset for the refined pre-training stage.

Instruction Tuning Dataset
To enhance the instruction following ability of the model, we compile a video instruction tuning dataset comprising 2.24M samples from public and internal sources. The dataset consists of short caption, detailed description, multi-choice/open-ended QA, single/multi-round conversation in both Chinese and English.

Distribution of instruction tuning dataset.

Model Card

Modules Vision Encoder Projector Patchify Module LLM
EVA-CLIP-L Linear 3D Depthwise Convolution Llama3-8B-Instruct
Stage 1
Image Pre-training
Resolution 224
Training Data 300M
Trainable Module ViT + Projector
Stage 2
Video Pre-training
Resolution 224 × 8 frames
Training Data 60M
Trainable Module ViT + Projector
Stage 3
Pre-training Refinement
Resolution 448 × 16 frames
Training Data 6.9M
Trainable Module ViT + Projector + Patchify Module
Stage 4
Instruction Tuning
Resolution 448 × (16 to 64 frames)
Training Data 2.24M
Trainable Module Full model
Stage 5
Long Video Tuning
Resolution 448 × (16 to 64 frames for short videos, 64 to 160 frames for long videos)
Training Data 700K
Trainable Module Projector + Patchify Module + LLM

Results

Results on Comprehensive Video Understanding Benchmarks.

Results on VideoMME.

Results on Seedbench-Video.

Demo

(Demo videos are from openai sora)

Citation


        @article{kangaroogroup,
            title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
            author={Liu, Jiajun and Wang, Yibing and Ma, Hanghang and Wu, Xiaoping and Ma, Xiaoqi and Wei, xiaoming and Jiao, Jianbin and Wu, Enhua and Hu, Jie},
            journal={arXiv preprint arXiv:2408.15542},
            year={2024}
        }