Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for visionlanguage pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.
Highlights:Architecture of our Kangaroo model.
Data Curation
We select 300M images from LAION, COYO and Wukong for image-text alignment and 60M videos from Webvid, Youku-mPLUG and internal social media short
videos for video-text alignment, following a category balance strategy. Videos with excessive text coverage, significant face coverage and low
optical flow scores are filtered out.
Automatic Annotation
We develop an automatic video annotation system. The process begins with the extraction of five key frames of each video, followed by the utilization
of three distinct off-the-shelf Multimodal Large Language Models (MLLM) to generate frame captions. Next, we employ an LLM to synthesize the key
frame captions into a comprehensive video caption. We further curate a subset of 6M pre-training data and incorporate 900K dense caption data from
ShareGPTVideo dataset for the refined pre-training stage.
Instruction Tuning Dataset
To enhance the instruction following ability of the model, we compile a video instruction tuning dataset comprising 2.24M samples from public and
internal sources. The dataset consists of short caption, detailed description, multi-choice/open-ended QA, single/multi-round conversation
in both Chinese and English.
Distribution of instruction tuning dataset.
Modules | Vision Encoder | Projector | Patchify Module | LLM |
---|---|---|---|---|
EVA-CLIP-L | Linear | 3D Depthwise Convolution | Llama3-8B-Instruct | |
Stage 1 Image Pre-training |
Resolution | 224 | ||
Training Data | 300M | |||
Trainable Module | ViT + Projector | |||
Stage 2 Video Pre-training |
Resolution | 224 × 8 frames | ||
Training Data | 60M | |||
Trainable Module | ViT + Projector | |||
Stage 3 Pre-training Refinement |
Resolution | 448 × 16 frames | ||
Training Data | 6.9M | |||
Trainable Module | ViT + Projector + Patchify Module | |||
Stage 4 Instruction Tuning |
Resolution | 448 × (16 to 64 frames) | ||
Training Data | 2.24M | |||
Trainable Module | Full model | |||
Stage 5 Long Video Tuning |
Resolution | 448 × (16 to 64 frames for short videos, 64 to 160 frames for long videos) | ||
Training Data | 700K | |||
Trainable Module | Projector + Patchify Module + LLM |
Results on Comprehensive Video Understanding Benchmarks.
Results on VideoMME.
Results on Seedbench-Video.
@article{kangaroogroup,
title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
author={Liu, Jiajun and Wang, Yibing and Ma, Hanghang and Wu, Xiaoping and Ma, Xiaoqi and Wei, xiaoming and Jiao, Jianbin and Wu, Enhua and Hu, Jie},
journal={arXiv preprint arXiv:2408.15542},
year={2024}
}