CIRCUITPROBE: Tracing Visual Temporal Evidence Flow in Video Language Models

Yiming Zhang*,1,2, Zhuokai Zhao*,3, Chengzhang Yu*4, Kun Wang†,5, Zhendong Chu6, Qiankun Li5, Zihan Chen1,2, Yang Liu5, Zenghui Ding†,1, Yining Sun1, Qingsong Wen6
*Equal contribution   Corresponding authors
1HFIPS, Chinese Academy of Sciences, 2University of Science and Technology of China, 3University of Chicago, 4South China University of Technology, 5Nanyang Technological University, 6Squirrel AI
Teaser: Concept Evolution

CircuitProbe reveals the internal dynamics of Video-LLMs. The visualization shows how the model's decoded concept shifts from "sitting" to "standing" frame-by-frame, tracking when and where visual temporal evidence is consolidated.

Abstract

Autoregressive large vision–language models (LVLMs) interface video and language by projecting video features into the LLM's embedding space. However, it remains unclear where temporal evidence is represented and how it causally influences decoding.

To address this gap, we present CircuitProbe, a circuit-level analysis framework that dissects the end-to-end video-language pathway through two stages:

  • Visual Auditing: Localizes object semantics within the projected video-token sequence and reveals their causal necessity via targeted ablations.
  • Semantic Tracing: Uses logit-lens probing to track the layer-wise emergence of object and temporal concepts.

Based on the analysis, we design a targeted Surgical Intervention: identifying temporally specialized attention heads and selectively amplifying them within the critical layer interval. This yields consistent improvements (up to 2.4% absolute) on the temporal-heavy TempCompass benchmark without retraining.

Insights: Dissecting the Model via Circuits

Exploring Circuit 1 (Visual Auditing) and Circuit 2 (Semantic Tracing)

Q1: Where does task-critical visual information reside?

Visual Auditing Results

Finding: Visual semantics are strongly localized to object-aligned tokens.

Ablating object tokens causes a massive performance drop (e.g., -92.6%), whereas ablating an equal number of random or register tokens has minimal impact. This confirms that critical information is spatially sparse.

Q3: When do semantics become language-aligned?

Semantic Tracing Results

Finding: The "Consolidation Interval".

We find a sharp phase transition in mid-to-late layers. Before this interval, visual features are processed but not readable by the language head. This defines the critical window for our surgical intervention.

Application: Surgical Intervention

Method Overview

Our framework consists of two analytic probes and one surgical intervention. Visual Auditing (left) identifies task-critical tokens via causal ablation. Semantic Tracing (right) tracks when these tokens become language-aligned.

We model the attention head selection using a routing score $m^{(l,h)}$ and temporal dispersion metrics. Only heads that satisfy the criteria are amplified during the critical consolidation interval.

Results

By amplifying the identified Temporal Attention Heads only within the consolidation interval, we can correct temporal hallucinations and ordering errors.

Camera Motion Result

We evaluate the proposed inference-time intervention on the TempCompass benchmark under 16- and 32-frame settings. As shown in Table, amplifying the identified temporal heads improves performance on TempCompas across all evaluated LVLMs, with gains reaching up to +2.4% absolute accuracy. These improvements are observed under both 16-frame and 32-frame settings, indicating that the intervention benefits genuine temporal reasoning rather than exploiting a specific input length or configuration.

Apple Drying Result

Sensitivity analysis of the head amplification factor $\lambda$ across layer windows. Notably, these gains are achieved without modifying model parameters or retraining, demonstrating that temporal reasoning capacity already exists within the model and can be selectively strengthened at inference time, with the optimal scaling factor $\alpha$ for these heads further illustrated in the figure.

Qualitive Example

Qualitative comparison on Visual Dynamics (Camera Motion).

BibTeX

@article{circuitprobe2025,
                        title={CIRCUITPROBE: Tracing Visual Temporal Evidence Flow in Video Language Models},
                        author={Yiming Zhang, Zhuokai Zhao, Chengzhang Yu, Zhendong Chu, Kun Wang , Qiankun Li , Zihan Chen , Yang Liu , Zenghui Ding , Yining Sun, Qingsong Wen},
                        journal={Arxiv},
                        year={2025}
}