Apollo: An Exploration of Video Understanding in Large Multimodal Models

¹Meta GenAI ²Stanford University

We investigate the mechanisms that drive video understanding in large multimodal models and provide actionable insights for the community. Our work includes:

Systematic exploration of the design space of video-LMMs, uncovering critical factors that drive performance.

Investigation of training schedules and data mixtures, providing practical insights for optimizing model performance.

Discovery of "Scaling Consistency," enabling efficient design decisions on smaller LMMs that generalize to larger scales.

A novel benchmark, ApolloBench, for efficient evaluation.

Introducing Apollo, a family of state-of-the-art video-LMMs.

Results

Model	Existing Benchmarks					ApolloBench
	TempCompass	MLVU	PerceptionTest	VideoMME	L-VideoBench	OCR	Egocentric	Spatial	Perception	Reasoning	Overall
	mc	m-avg	val	wo/w sub.	val
Proprietary
GPT-4V (OpenAI, 2023)	-	49.2	-	59.9/63.3	61.3	65.7	55.0	70.8	41.0	44.7	58.7
GPT-4o (OpenAI, 2024)	70.9	64.6	-	71.9/77.2	66.7	76.0	69.2	90.1	82.0	83.1	79.8
Gemini-1.5-Flash (Team et al., 2023)	-	-	-	70.3/75.0	61.6	-	-	-	-	-	-
Gemini-1.5-Pro (Team et al., 2023)	69.3	-	-	75.0/81.3	64.0	74.5	77.1	79.5	85.1	88.1	80.6
Claude-3.5-Sonnet (Anthropic, 2024)	-	36.5	-	60.0/62.9	-	-	-	-	-	-	-
Open-weight
Qwen2VL-2B (Wang et al., 2024a)	60.6	59.5	53.9	55.6/60.4	48.5	29.0	29.0	47.0	50.0	46.0	40.2
Qwen2VL-7B (Wang et al., 2024a)	68.5	65.5	62.3	63.3/69.0	55.6	57.4	67.5	63.7	71.2	67.9	66.0
Qwen2VL-72B (Wang et al., 2024a)	-	-	68.0	71.2/77.8	-	-	-	-	-	-	-
Aria 8x3.5B (Li et al., 2024b)	69.9	-	53.9	67.6/72.1	64.2	-	-	-	-	-	-
Pixtral-12B (Agrawal et al., 2024)	-	-	-	40.7/47.5	44.9	-	-	-	-	-	-
Open-source
LLaVA-OV-0.5B (Li et al., 2024a)	53.2	50.3	49.2	44.0/43.5	45.8	38.0	27.0	28.0	20.0	38.0	30.0
VILA1.5 3B (Lin et al., 2024)	56.1	44.4	49.1	42.2/44.2	42.9	31.7	33.0	29.3	38.0	44.7	36.1
InternVL2-2B (Li et al., 2024a)	53.4	48.2	49.6	30.8/-	44.8	40.8	46.3	34.3	44.7	45.3	42.1
Phi-3.5-Vision-4.2B (Abdin et al., 2024)	-	-	-	50.8/-	-	-	-	-	-	-	-
LongVU 3.2B (Shen et al., 2024)	-	55.9	-	51.5/-	-	-	-	-	-	-	-
Apollo-1.5B	60.8	63.3	61.0	53.0/54.6	54.1	49.0	63.3	50.0	66.5	57.4	57.0
LongVA-7B (Zhang et al., 2024e)	-	56.3	-	52.6/54.3	-	32.4	43.1	41.0	37.7	51.1	41.5
XComposer-8B (Zhang et al., 2024d)	-	37.3	34.4	55.8/58.8	-	50.7	42.0	54.7	54.7	40.5	48.6
Kangaroo-8B (Liu et al., 2024b)	61.3	61.0	-	56.0/57.6	54.2	-	-	-	-	-	-
Video-XL 7B (Shu et al., 2024)	-	64.9	-	55.5/61.0	49.5	-	-	-	-	-	-
Oryx 7B (Liu et al., 2024d)	-	67.5	-	50.3/55.3	55.5	-	-	-	-	-	-
Apollo-3B	62.5	68.7	65.0	58.4/60.6	55.1	49.6	68.6	59.3	67.0	68.4	62.7
InternVL2-8B (Chen et al., 2024b)	65.3	50.8	57.4	54.0/56.9	51.8	50.0	48.4	54.3	57.7	51.8	52.8
LLaVA-OV-7B (Li et al., 2024a)	64.8	64.7	57.1	58.2/61.5	56.4	56.0	69.1	69.0	63.3	63.2	64.0
LongVU 7B (Shen et al., 2024)	-	65.4	-	60.6/-	-	-	-	-	-	-	-
LLaVA-N-Video-32B (Zhang et al., 2024f)	-	39.3	59.4	60.2/63.0	50.5	-	-	-	-	-	-
Oryx 34B (Liu et al., 2024d)	-	70.8	-	53.9/58.0	62.2	-	-	-	-	-	-
VILA-1.5-40B (Lin et al., 2024)	-	56.7	54.0	60.1/61.1	-	-	-	-	-	-	-
InternVL2-34B (Chen et al., 2024b)	-	59.9	-	61.2/62.4	-	-	-	-	-	-	-
Apollo-7B	64.9	70.9	67.3	61.3/63.3	58.5	51.6	68.4	67.5	69.8	71.2	66.3

Citation

If you find this useful, please consider citing our work:

@article{apollo, title={Apollo: An Exploration of Video Understanding in Large Multimodal Models}, author={Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia}, journal={arXiv preprint arXiv:2412.10360}, year={2024} }