Kangaroo

Abstract

Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for visionlanguage pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.

Highlights:

Large-scale Data Curation. Large-scale Data Curation. We develop a data curation system to generate captions for open-source and internal videos and construct a video instruction tuning dataset covering a variety of tasks.
Long-context Video Input. We extend the maximum frames of input videos to 160, with corresponding sequence length up to 22k tokens.
Superior Performance. Our model achieves state-of-the-art performance on the a variety of comprehensive benchmarks and outperforms some larger open-source models with over 10B parameters and proprietary models on certain benchmarks.
Bilingual Conversation. Our proposed model is equipped with the capability of Chinese, English and bilingual conversations, and support single/multi-round conversation paradigms.

Model

Architecture of our Kangaroo model.

Dataset Statistics

Data Curation
We select 300M images from LAION, COYO and Wukong for image-text alignment and 60M videos from Webvid, Youku-mPLUG and internal social media short videos for video-text alignment, following a category balance strategy. Videos with excessive text coverage, significant face coverage and low optical flow scores are filtered out.

Automatic Annotation
We develop an automatic video annotation system. The process begins with the extraction of five key frames of each video, followed by the utilization of three distinct off-the-shelf Multimodal Large Language Models (MLLM) to generate frame captions. Next, we employ an LLM to synthesize the key frame captions into a comprehensive video caption. We further curate a subset of 6M pre-training data and incorporate 900K dense caption data from ShareGPTVideo dataset for the refined pre-training stage.

Instruction Tuning Dataset
To enhance the instruction following ability of the model, we compile a video instruction tuning dataset comprising 2.24M samples from public and internal sources. The dataset consists of short caption, detailed description, multi-choice/open-ended QA, single/multi-round conversation in both Chinese and English.

Distribution of instruction tuning dataset.

Model Card

Modules	Vision Encoder	Projector	Patchify Module	LLM
Modules	EVA-CLIP-L	Linear	3D Depthwise Convolution	Llama3-8B-Instruct
Stage 1 Image Pre-training	Resolution	224
	Training Data	300M
	Trainable Module	ViT + Projector
Stage 2 Video Pre-training	Resolution	224 × 8 frames
	Training Data	60M
	Trainable Module	ViT + Projector
Stage 3 Pre-training Refinement	Resolution	448 × 16 frames
	Training Data	6.9M
	Trainable Module	ViT + Projector + Patchify Module
Stage 4 Instruction Tuning	Resolution	448 × (16 to 64 frames)
	Training Data	2.24M
	Trainable Module	Full model
Stage 5 Long Video Tuning	Resolution	448 × (16 to 64 frames for short videos, 64 to 160 frames for long videos)
	Training Data	700K
	Trainable Module	Projector + Patchify Module + LLM

Results

Results on Comprehensive Video Understanding Benchmarks.

Results on VideoMME.

Results on Seedbench-Video.

Demo

(Demo videos are from openai sora)

Citation


        @article{kangaroogroup,
            title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
            author={Liu, Jiajun and Wang, Yibing and Ma, Hanghang and Wu, Xiaoping and Ma, Xiaoqi and Wei, xiaoming and Jiao, Jianbin and Wu, Enhua and Hu, Jie},
            journal={arXiv preprint arXiv:2408.15542},
            year={2024}
        }