STAVEQ2

Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

^*Denotes equal contribution.

Abstract

Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video-LLM architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs.

Observation: Video-LLMs are Temporally Naive

Current Video-LLMs perform reasonably well on spatially grounded questions (e.g., "What color is the ball?"). However, they struggle when the query requires precise comprehension of temporal progression within the video. Many benchmarks fail to capture this, as their questions can be answered even from a single-frame.

Comparison of Qwen2-VL and STAVEQ2 on temporal tasks showing biking and pulling actions

Figure 1: Motivation. Baseline models like Qwen2-VL can identify simple actions (e.g., "Biking") but fail on temporally complex tasks. As shown on the right, STAVEQ2 correctly identifies the direction ("right to left") where the baseline fails.

Methodology: Stacked Temporal Attention

To address this, we propose STAVEQ2, a Video-LLM that injects Stacked Temporal Attention (STA) modules directly into the vision encoder. This design forces the model to learn spatiotemporal features at the visual encoding stage, before passing tokens to the Large Language Model.

STAVEQ2 Architecture diagram showing Vision Encoder with STA blocks, Projector, and LLM

Figure 2: The STAVEQ2 architecture. (STA blocks are inserted into the vision encoder.)

Each block processes spatial information (within-frame) first, then temporal information (across-frames).
Temporal attention uses 4x fewer heads than spatial attention, improving performance with minimal compute overhead.
We use 1D Rotary Position Embeddings (RoPE) to encode temporal structure across frames.
Two-Stage Training:
- Stage 1: We freeze the base model and train only the new STA blocks (initialized to output zero) with a linear warmup.
- Stage 2: We add LoRA adapters and jointly train the STA blocks and adapters to align the new spatiotemporal features with the LLM.

Evaluation Results

We compare STAVEQ2 with the Qwen2-VL and Qwen2.5-VL, as well as other state-of-the-art models.

Comparison on General Video Benchmarks

Our STA-enhanced models (STAVEQ2) consistently outperform their baselines (Qwen2-VL) and other SOTA models, including GPT-4o, on diverse video understanding benchmarks.

Model	VITATECS						MVBench	*VMME (wo/w)
Model	Comp.	Dir.	Int.	Loc.	Seq.	Type	MVBench	*VMME (wo/w)
Qwen2-VL 2B	80.8	82.1	69.6	76.1	72.2	85.9	63.2	55.6 / 60.4
STAVEQ2 2B (Ours)	81.3	83.0	70.1	76.9	72.9	86.6	65.1	56.2 / 61.3
ST-LLM 7B	–	–	–	–	–	–	54.9	–
TG-Vid 7B	–	–	–	–	–	–	56.4	–
LLaVA-OneVision 7B	–	–	–	–	–	–	56.7	58.2 / –
VideoRoPE	81.1	81.8	60.9	79.4	80.7	85.8	57.3	61.6 / –
VideoRoPE + STA (Ours)	81.9	82.9	61.8	79.9	81.3	86.3	59.2	62.5 / –
Qwen2-VL 7B	88.9	86.6	78.2	80.6	82.8	88.8	67.0	63.3 / 69.0
Qwen2.5-VL 7B	86.1	80.0	73.0	77.3	78.8	88.2	69.6	65.1 / 71.6
STAVEQ2 7B (Ours)	89.8	87.6	78.7	80.9	83.9	88.9	70.1	66.8 / 71.8
STAVEQ2.5 7B (Ours)	88.0	82.1	74.2	77.9	79.7	88.9	70.3	66.2 / 72.5
InternVideo2.5-Chat 8B	91.3	88.7	82.0	84.8	84.7	91.0	75.7	65.1 / –
IV2.5-Chat 8B + STA (Ours)	91.6	89.7	82.7	85.6	85.8	91.3	76.8	65.9 / –
LLaVA-OneVision 72B	–	–	–	–	–	–	59.4	66.2 / 69.5
VideoLLaMA2 72B	–	–	–	–	–	–	62.0	61.4 / 63.1
LLaVA-Video 72B	–	–	–	–	–	–	–	70.6 / 76.9
Qwen2-VL 72B	89.8	87.8	77.9	85.3	84.8	90.4	73.6	71.2 / 77.8
Qwen2.5-VL 72B	92.1	88.9	81.9	87.1	89.4	91.8	70.4	73.3 / 79.1
STAVEQ2 72B (Ours)	92.8	90.1	82.3	87.9	90.3	92.8	74.5	73.9 / 79.9
STAVEQ2.5 72B (Ours)	93.1	90.9	82.1	88.0	90.8	93.3	72.4	74.2 / 79.8
GPT-4o	–	–	–	–	–	–	–	71.9 / 77.2

SOTA on SSv2 Action Recognition

We applied STA to InternVideo2, a vision-only foundation model. STA boosted the 1B model to 78.0% accuracy, beating the much larger 6B baseline. This demonstrates that the gain comes from enhanced temporal modeling, not parameter scaling.

Model	Accuracy (%)
InternVideo2 1B	77.1
InternVideo2 6B	77.5
InternVideo2 1B + STA (Ours)	78.0

BibTeX

@inproceedings{rasekh2025enhancing, author = {Rasekh, Ali and Bagheri Soula, Erfan and Daliran, Omid and Gottschalk, Simon and Fayyaz, Mohsen}, title = {Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders}, booktitle = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS)}, year = {2025}, month = oct, url = {https://www.microsoft.com/en-us/research/publication/enhancing-temporal-understanding-in-video-llms-through-stacked-temporal-attention-in-vision-encoders/} }