Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

1L3S Research Center  â€¢  2Independent Researcher  â€¢  3Microsoft
*Denotes equal contribution.

News

  • (Nov 2025) 🎉 Paper accepted to 7th International Workshop on Large Scale Holistic Video Understanding.
  • (Sep 2025) 🎉 Paper accepted to NeurIPS 2025. Paper and code links are now available.

Abstract

Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video-LLM architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs.

Observation: Video-LLMs are Temporally Naive

Current Video-LLMs perform reasonably well on spatially grounded questions (e.g., "What color is the ball?"). However, they struggle when the query requires precise comprehension of temporal progression within the video. Many benchmarks fail to capture this, as their questions can be answered even from a single-frame.

Comparison of Qwen2-VL and STAVEQ2 on temporal tasks showing biking and pulling actions
Figure 1: Motivation. Baseline models like Qwen2-VL can identify simple actions (e.g., "Biking") but fail on temporally complex tasks. As shown on the right, STAVEQ2 correctly identifies the direction ("right to left") where the baseline fails.

Methodology: Stacked Temporal Attention

To address this, we propose STAVEQ2, a Video-LLM that injects Stacked Temporal Attention (STA) modules directly into the vision encoder. This design forces the model to learn spatiotemporal features at the visual encoding stage, before passing tokens to the Large Language Model.

STAVEQ2 Architecture diagram showing Vision Encoder with STA blocks, Projector, and LLM
Figure 2: The STAVEQ2 architecture. (STA blocks are inserted into the vision encoder.)
  • Each block processes spatial information (within-frame) first, then temporal information (across-frames).
  • Temporal attention uses 4x fewer heads than spatial attention, improving performance with minimal compute overhead.
  • We use 1D Rotary Position Embeddings (RoPE) to encode temporal structure across frames.
  • Two-Stage Training:
    • Stage 1: We freeze the base model and train only the new STA blocks (initialized to output zero) with a linear warmup.
    • Stage 2: We add LoRA adapters and jointly train the STA blocks and adapters to align the new spatiotemporal features with the LLM.

Evaluation Results

We compare STAVEQ2 with the Qwen2-VL and Qwen2.5-VL, as well as other state-of-the-art models.


Comparison on General Video Benchmarks

Our STA-enhanced models (STAVEQ2) consistently outperform their baselines (Qwen2-VL) and other SOTA models, including GPT-4o, on diverse video understanding benchmarks.

Model VITATECS MVBench *VMME (wo/w)
Comp. Dir. Int. Loc. Seq. Type
Qwen2-VL 2B 80.8 82.1 69.6 76.1 72.2 85.9 63.2 55.6 / 60.4
STAVEQ2 2B (Ours) 81.3 83.0 70.1 76.9 72.9 86.6 65.1 56.2 / 61.3
ST-LLM 7B 54.9
TG-Vid 7B 56.4
LLaVA-OneVision 7B 56.7 58.2 / –
VideoRoPE 81.1 81.8 60.9 79.4 80.7 85.8 57.3 61.6 / –
VideoRoPE + STA (Ours) 81.9 82.9 61.8 79.9 81.3 86.3 59.2 62.5 / –
Qwen2-VL 7B 88.9 86.6 78.2 80.6 82.8 88.8 67.0 63.3 / 69.0
Qwen2.5-VL 7B 86.1 80.0 73.0 77.3 78.8 88.2 69.6 65.1 / 71.6
STAVEQ2 7B (Ours) 89.8 87.6 78.7 80.9 83.9 88.9 70.1 66.8 / 71.8
STAVEQ2.5 7B (Ours) 88.0 82.1 74.2 77.9 79.7 88.9 70.3 66.2 / 72.5
InternVideo2.5-Chat 8B 91.3 88.7 82.0 84.8 84.7 91.0 75.7 65.1 / –
IV2.5-Chat 8B + STA (Ours) 91.6 89.7 82.7 85.6 85.8 91.3 76.8 65.9 / –
LLaVA-OneVision 72B 59.4 66.2 / 69.5
VideoLLaMA2 72B 62.0 61.4 / 63.1
LLaVA-Video 72B 70.6 / 76.9
Qwen2-VL 72B 89.8 87.8 77.9 85.3 84.8 90.4 73.6 71.2 / 77.8
Qwen2.5-VL 72B 92.1 88.9 81.9 87.1 89.4 91.8 70.4 73.3 / 79.1
STAVEQ2 72B (Ours) 92.8 90.1 82.3 87.9 90.3 92.8 74.5 73.9 / 79.9
STAVEQ2.5 72B (Ours) 93.1 90.9 82.1 88.0 90.8 93.3 72.4 74.2 / 79.8
GPT-4o 71.9 / 77.2

SOTA on SSv2 Action Recognition

We applied STA to InternVideo2, a vision-only foundation model. STA boosted the 1B model to 78.0% accuracy, beating the much larger 6B baseline. This demonstrates that the gain comes from enhanced temporal modeling, not parameter scaling.

Model Accuracy (%)
InternVideo2 1B 77.1
InternVideo2 6B 77.5
InternVideo2 1B + STA (Ours) 78.0

BibTeX

@inproceedings{rasekh2025enhancing,
    author    = {Rasekh, Ali and Bagheri Soula, Erfan and Daliran, Omid and Gottschalk, Simon and Fayyaz, Mohsen},
    title     = {Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders},
    booktitle = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS)},
    year      = {2025},
    month     = oct,
    url       = {https://www.microsoft.com/en-us/research/publication/enhancing-temporal-understanding-in-video-llms-through-stacked-temporal-attention-in-vision-encoders/}
}