Apollo: An Exploration of Video Understanding in Large Multimodal Models

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia

1Meta GenAI    2Stanford University

We investigate the mechanisms that drive video understanding in large multimodal models and provide actionable insights for the community. Our work includes:

  • Systematic exploration of the design space of video-LMMs, uncovering critical factors that drive performance.
  • Investigation of training schedules and data mixtures, providing practical insights for optimizing model performance.
  • Discovery of "Scaling Consistency," enabling efficient design decisions on smaller LMMs that generalize to larger scales.
  • A novel benchmark, ApolloBench, for efficient evaluation.
  • Introducing Apollo, a family of state-of-the-art video-LMMs.

We introduce Apollo, a new family of state-of-the-art video-LMMs. In developing Apollo, we uncover Scaling Consistency, enabling us to reliably make design decisions on smaller models and datasets, dramatically cutting computational costs. Guided by these principles, we train hundreds of model variants—systematically exploring video sampling strategies, token integration, training schedules, and data mixtures. Leveraging these insights, Apollo sets a new benchmark in efficient, high-performance video-language modeling.

Finding 1

Finding 1: We discover Scaling Consistency, where design decisions can be made on smaller models and datasets and transfer reliably to larger ones.

Finding 1

Finding 8: Progressively unfreezing the different components in different stages leads to superior model training dynamics.

Finding 1

Finding 2: fps sampling is preferable over uniform sampling during model training and inference.

Finding 1

Finding 3: There is a trade-off between tps and fps, with 8-32 tokens per frame being optimal.

Finding 1

Finding 6: Perceiver resampling shows superior performance when reducing the tokens/frame.

Finding 1

Finding 4: SigLIP-SO400M is the best single encoder for video-LMMs.

Finding 1

Finding 5: Combining SigLIP-SO400M with InternVideo2 leads to the best overall performance.

Finding 1

Finding 7: Adding tokens (text, learned, etc.) between the video tokens derived from different frames or clips is sufficient for efficient token integration.

Finding 1

Finding 9: Finetuning video encoders on only video data further improves overall performance, especially on reasoning and domain-specific tasks.

Finding 1

Finding 10: Data mixture matters, and including a moderate amount of text data and maintaining a slight video-heavy mix leads to optimal performance.

Results

Model Existing Benchmarks ApolloBench
TempCompass
MLVU
PerceptionTest
VideoMME
L-VideoBench
OCR
Egocentric
Spatial
Perception
Reasoning
Overall
mc m-avg val wo/w sub. val
Proprietary
GPT-4V (OpenAI, 2023) -49.2-59.9/63.361.3 65.755.070.841.044.758.7
GPT-4o (OpenAI, 2024) 70.964.6-71.9/77.266.7 76.069.290.182.083.179.8
Gemini-1.5-Flash (Team et al., 2023) ---70.3/75.061.6 ------
Gemini-1.5-Pro (Team et al., 2023) 69.3--75.0/81.364.0 74.577.179.585.188.180.6
Claude-3.5-Sonnet (Anthropic, 2024) -36.5-60.0/62.9- ------
Open-weight
Qwen2VL-2B (Wang et al., 2024a) 60.659.553.955.6/60.448.5 29.029.047.050.046.040.2
Qwen2VL-7B (Wang et al., 2024a) 68.565.562.363.3/69.055.6 57.467.563.771.267.966.0
Qwen2VL-72B (Wang et al., 2024a) --68.071.2/77.8- ------
Aria 8x3.5B (Li et al., 2024b) 69.9-53.967.6/72.164.2 ------
Pixtral-12B (Agrawal et al., 2024) ---40.7/47.544.9 ------
Open-source
LLaVA-OV-0.5B (Li et al., 2024a) 53.250.349.244.0/43.545.8 38.027.028.020.038.030.0
VILA1.5 3B (Lin et al., 2024) 56.144.449.142.2/44.242.9 31.733.029.338.044.736.1
InternVL2-2B (Li et al., 2024a) 53.448.249.630.8/-44.8 40.846.334.344.745.342.1
Phi-3.5-Vision-4.2B (Abdin et al., 2024) ---50.8/-- ------
LongVU 3.2B (Shen et al., 2024) -55.9-51.5/-- ------
Apollo-1.5B 60.863.361.053.0/54.654.1 49.063.350.066.557.457.0
LongVA-7B (Zhang et al., 2024e) -56.3-52.6/54.3- 32.443.141.037.751.141.5
XComposer-8B (Zhang et al., 2024d) -37.334.455.8/58.8- 50.742.054.754.740.548.6
Kangaroo-8B (Liu et al., 2024b) 61.361.0-56.0/57.654.2 ------
Video-XL 7B (Shu et al., 2024) -64.9-55.5/61.049.5 ------
Oryx 7B (Liu et al., 2024d) -67.5-50.3/55.355.5 ------
Apollo-3B 62.568.765.058.4/60.655.1 49.668.659.367.068.462.7
InternVL2-8B (Chen et al., 2024b) 65.350.857.454.0/56.951.8 50.048.454.357.751.852.8
LLaVA-OV-7B (Li et al., 2024a) 64.864.757.158.2/61.556.4 56.069.169.063.363.264.0
LongVU 7B (Shen et al., 2024) -65.4-60.6/-- ------
LLaVA-N-Video-32B (Zhang et al., 2024f) -39.359.460.2/63.050.5 ------
Oryx 34B (Liu et al., 2024d) -70.8-53.9/58.062.2 ------
VILA-1.5-40B (Lin et al., 2024) -56.754.060.1/61.1- ------
InternVL2-34B (Chen et al., 2024b) -59.9-61.2/62.4- ------
Apollo-7B 64.970.967.361.3/63.358.5 51.668.467.569.871.266.3

Website under construction, more coming soon...

Citation

If you find this useful, please consider citing our work:

@article{apollo,
    title={Apollo: An Exploration of Video Understanding in Large Multimodal Models},
    author={Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia},
    journal={arXiv preprint arXiv:2412.10360},
    year={2024}
}