We compare STAVEQ2 with the Qwen2-VL and Qwen2.5-VL, as well as other state-of-the-art models.
Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video-LLM architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs.
Current Video-LLMs perform reasonably well on spatially grounded questions (e.g., "What color is the ball?"). However, they struggle when the query requires precise comprehension of temporal progression within the video. Many benchmarks fail to capture this, as their questions can be answered even from a single-frame.
To address this, we propose STAVEQ2, a Video-LLM that injects Stacked Temporal Attention (STA) modules directly into the vision encoder. This design forces the model to learn spatiotemporal features at the visual encoding stage, before passing tokens to the Large Language Model.
We compare STAVEQ2 with the Qwen2-VL and Qwen2.5-VL, as well as other state-of-the-art models.
Our STA-enhanced models (STAVEQ2) consistently outperform their baselines (Qwen2-VL) and other SOTA models, including GPT-4o, on diverse video understanding benchmarks.
| Model | VITATECS | MVBench | *VMME (wo/w) | |||||
|---|---|---|---|---|---|---|---|---|
| Comp. | Dir. | Int. | Loc. | Seq. | Type | |||
| Qwen2-VL 2B | 80.8 | 82.1 | 69.6 | 76.1 | 72.2 | 85.9 | 63.2 | 55.6 / 60.4 |
| STAVEQ2 2B (Ours) | 81.3 | 83.0 | 70.1 | 76.9 | 72.9 | 86.6 | 65.1 | 56.2 / 61.3 |
| ST-LLM 7B | – | – | – | – | – | – | 54.9 | – |
| TG-Vid 7B | – | – | – | – | – | – | 56.4 | – |
| LLaVA-OneVision 7B | – | – | – | – | – | – | 56.7 | 58.2 / – |
| VideoRoPE | 81.1 | 81.8 | 60.9 | 79.4 | 80.7 | 85.8 | 57.3 | 61.6 / – |
| VideoRoPE + STA (Ours) | 81.9 | 82.9 | 61.8 | 79.9 | 81.3 | 86.3 | 59.2 | 62.5 / – |
| Qwen2-VL 7B | 88.9 | 86.6 | 78.2 | 80.6 | 82.8 | 88.8 | 67.0 | 63.3 / 69.0 |
| Qwen2.5-VL 7B | 86.1 | 80.0 | 73.0 | 77.3 | 78.8 | 88.2 | 69.6 | 65.1 / 71.6 |
| STAVEQ2 7B (Ours) | 89.8 | 87.6 | 78.7 | 80.9 | 83.9 | 88.9 | 70.1 | 66.8 / 71.8 |
| STAVEQ2.5 7B (Ours) | 88.0 | 82.1 | 74.2 | 77.9 | 79.7 | 88.9 | 70.3 | 66.2 / 72.5 |
| InternVideo2.5-Chat 8B | 91.3 | 88.7 | 82.0 | 84.8 | 84.7 | 91.0 | 75.7 | 65.1 / – |
| IV2.5-Chat 8B + STA (Ours) | 91.6 | 89.7 | 82.7 | 85.6 | 85.8 | 91.3 | 76.8 | 65.9 / – |
| LLaVA-OneVision 72B | – | – | – | – | – | – | 59.4 | 66.2 / 69.5 |
| VideoLLaMA2 72B | – | – | – | – | – | – | 62.0 | 61.4 / 63.1 |
| LLaVA-Video 72B | – | – | – | – | – | – | – | 70.6 / 76.9 |
| Qwen2-VL 72B | 89.8 | 87.8 | 77.9 | 85.3 | 84.8 | 90.4 | 73.6 | 71.2 / 77.8 |
| Qwen2.5-VL 72B | 92.1 | 88.9 | 81.9 | 87.1 | 89.4 | 91.8 | 70.4 | 73.3 / 79.1 |
| STAVEQ2 72B (Ours) | 92.8 | 90.1 | 82.3 | 87.9 | 90.3 | 92.8 | 74.5 | 73.9 / 79.9 |
| STAVEQ2.5 72B (Ours) | 93.1 | 90.9 | 82.1 | 88.0 | 90.8 | 93.3 | 72.4 | 74.2 / 79.8 |
| GPT-4o | – | – | – | – | – | – | – | 71.9 / 77.2 |
We applied STA to InternVideo2, a vision-only foundation model. STA boosted the 1B model to 78.0% accuracy, beating the much larger 6B baseline. This demonstrates that the gain comes from enhanced temporal modeling, not parameter scaling.
| Model | Accuracy (%) |
|---|---|
| InternVideo2 1B | 77.1 |
| InternVideo2 6B | 77.5 |
| InternVideo2 1B + STA (Ours) | 78.0 |
@inproceedings{rasekh2025enhancing,
author = {Rasekh, Ali and Bagheri Soula, Erfan and Daliran, Omid and Gottschalk, Simon and Fayyaz, Mohsen},
title = {Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders},
booktitle = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS)},
year = {2025},
month = oct,
url = {https://www.microsoft.com/en-us/research/publication/enhancing-temporal-understanding-in-video-llms-through-stacked-temporal-attention-in-vision-encoders/}
}