We investigate the mechanisms that drive video understanding in large multimodal models and provide actionable insights for the community. Our work includes:
We introduce Apollo, a new family of state-of-the-art video-LMMs. In developing Apollo, we uncover Scaling Consistency, enabling us to reliably make design decisions on smaller models and datasets, dramatically cutting computational costs. Guided by these principles, we train hundreds of model variants—systematically exploring video sampling strategies, token integration, training schedules, and data mixtures. Leveraging these insights, Apollo sets a new benchmark in efficient, high-performance video-language modeling.
Model | Existing Benchmarks | ApolloBench | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
TempCompass |
MLVU |
PerceptionTest |
VideoMME |
L-VideoBench |
OCR |
Egocentric |
Spatial |
Perception |
Reasoning |
Overall |
|
mc | m-avg | val | wo/w sub. | val | |||||||
Proprietary | |||||||||||
GPT-4V (OpenAI, 2023) | - | 49.2 | - | 59.9/63.3 | 61.3 | 65.7 | 55.0 | 70.8 | 41.0 | 44.7 | 58.7 |
GPT-4o (OpenAI, 2024) | 70.9 | 64.6 | - | 71.9/77.2 | 66.7 | 76.0 | 69.2 | 90.1 | 82.0 | 83.1 | 79.8 |
Gemini-1.5-Flash (Team et al., 2023) | - | - | - | 70.3/75.0 | 61.6 | - | - | - | - | - | - |
Gemini-1.5-Pro (Team et al., 2023) | 69.3 | - | - | 75.0/81.3 | 64.0 | 74.5 | 77.1 | 79.5 | 85.1 | 88.1 | 80.6 |
Claude-3.5-Sonnet (Anthropic, 2024) | - | 36.5 | - | 60.0/62.9 | - | - | - | - | - | - | - |
Open-weight | |||||||||||
Qwen2VL-2B (Wang et al., 2024a) | 60.6 | 59.5 | 53.9 | 55.6/60.4 | 48.5 | 29.0 | 29.0 | 47.0 | 50.0 | 46.0 | 40.2 |
Qwen2VL-7B (Wang et al., 2024a) | 68.5 | 65.5 | 62.3 | 63.3/69.0 | 55.6 | 57.4 | 67.5 | 63.7 | 71.2 | 67.9 | 66.0 |
Qwen2VL-72B (Wang et al., 2024a) | - | - | 68.0 | 71.2/77.8 | - | - | - | - | - | - | - |
Aria 8x3.5B (Li et al., 2024b) | 69.9 | - | 53.9 | 67.6/72.1 | 64.2 | - | - | - | - | - | - |
Pixtral-12B (Agrawal et al., 2024) | - | - | - | 40.7/47.5 | 44.9 | - | - | - | - | - | - |
Open-source | |||||||||||
LLaVA-OV-0.5B (Li et al., 2024a) | 53.2 | 50.3 | 49.2 | 44.0/43.5 | 45.8 | 38.0 | 27.0 | 28.0 | 20.0 | 38.0 | 30.0 |
VILA1.5 3B (Lin et al., 2024) | 56.1 | 44.4 | 49.1 | 42.2/44.2 | 42.9 | 31.7 | 33.0 | 29.3 | 38.0 | 44.7 | 36.1 |
InternVL2-2B (Li et al., 2024a) | 53.4 | 48.2 | 49.6 | 30.8/- | 44.8 | 40.8 | 46.3 | 34.3 | 44.7 | 45.3 | 42.1 |
Phi-3.5-Vision-4.2B (Abdin et al., 2024) | - | - | - | 50.8/- | - | - | - | - | - | - | - |
LongVU 3.2B (Shen et al., 2024) | - | 55.9 | - | 51.5/- | - | - | - | - | - | - | - |
Apollo-1.5B | 60.8 | 63.3 | 61.0 | 53.0/54.6 | 54.1 | 49.0 | 63.3 | 50.0 | 66.5 | 57.4 | 57.0 |
LongVA-7B (Zhang et al., 2024e) | - | 56.3 | - | 52.6/54.3 | - | 32.4 | 43.1 | 41.0 | 37.7 | 51.1 | 41.5 |
XComposer-8B (Zhang et al., 2024d) | - | 37.3 | 34.4 | 55.8/58.8 | - | 50.7 | 42.0 | 54.7 | 54.7 | 40.5 | 48.6 |
Kangaroo-8B (Liu et al., 2024b) | 61.3 | 61.0 | - | 56.0/57.6 | 54.2 | - | - | - | - | - | - |
Video-XL 7B (Shu et al., 2024) | - | 64.9 | - | 55.5/61.0 | 49.5 | - | - | - | - | - | - |
Oryx 7B (Liu et al., 2024d) | - | 67.5 | - | 50.3/55.3 | 55.5 | - | - | - | - | - | - |
Apollo-3B | 62.5 | 68.7 | 65.0 | 58.4/60.6 | 55.1 | 49.6 | 68.6 | 59.3 | 67.0 | 68.4 | 62.7 |
InternVL2-8B (Chen et al., 2024b) | 65.3 | 50.8 | 57.4 | 54.0/56.9 | 51.8 | 50.0 | 48.4 | 54.3 | 57.7 | 51.8 | 52.8 |
LLaVA-OV-7B (Li et al., 2024a) | 64.8 | 64.7 | 57.1 | 58.2/61.5 | 56.4 | 56.0 | 69.1 | 69.0 | 63.3 | 63.2 | 64.0 |
LongVU 7B (Shen et al., 2024) | - | 65.4 | - | 60.6/- | - | - | - | - | - | - | - |
LLaVA-N-Video-32B (Zhang et al., 2024f) | - | 39.3 | 59.4 | 60.2/63.0 | 50.5 | - | - | - | - | - | - |
Oryx 34B (Liu et al., 2024d) | - | 70.8 | - | 53.9/58.0 | 62.2 | - | - | - | - | - | - |
VILA-1.5-40B (Lin et al., 2024) | - | 56.7 | 54.0 | 60.1/61.1 | - | - | - | - | - | - | - |
InternVL2-34B (Chen et al., 2024b) | - | 59.9 | - | 61.2/62.4 | - | - | - | - | - | - | - |
Apollo-7B | 64.9 | 70.9 | 67.3 | 61.3/63.3 | 58.5 | 51.6 | 68.4 | 67.5 | 69.8 | 71.2 | 66.3 |
Website under construction, more coming soon...
If you find this useful, please consider citing our work:
@article{apollo,
title={Apollo: An Exploration of Video Understanding in Large Multimodal Models},
author={Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia},
journal={arXiv preprint arXiv:2412.10360},
year={2024}
}