CIRCUITPROBE: Tracing Visual Temporal Evidence Flow in Video Language Models

Yiming Zhang^*,1,2, Zhuokai Zhao^*,3, Chengzhang Yu^*4, Kun Wang^†,5, Zhendong Chu⁶, Qiankun Li⁵, Zihan Chen^1,2, Yang Liu⁵, Zenghui Ding^†,1, Yining Sun¹, Qingsong Wen⁶

^*Equal contribution ^†Corresponding authors

¹HFIPS, Chinese Academy of Sciences, ²University of Science and Technology of China, ³University of Chicago, ⁴South China University of Technology, ⁵Nanyang Technological University, ⁶Squirrel AI

Abstract

Autoregressive large vision–language models (LVLMs) interface video and language by projecting video features into the LLM's embedding space. However, it remains unclear where temporal evidence is represented and how it causally influences decoding.

To address this gap, we present CircuitProbe, a circuit-level analysis framework that dissects the end-to-end video-language pathway through two stages:

Visual Auditing: Localizes object semantics within the projected video-token sequence and reveals their causal necessity via targeted ablations.
Semantic Tracing: Uses logit-lens probing to track the layer-wise emergence of object and temporal concepts.

Based on the analysis, we design a targeted Surgical Intervention: identifying temporally specialized attention heads and selectively amplifying them within the critical layer interval. This yields consistent improvements (up to 2.4% absolute) on the temporal-heavy TempCompass benchmark without retraining.

Insights: Dissecting the Model via Circuits

Exploring Circuit 1 (Visual Auditing) and Circuit 2 (Semantic Tracing)

Q1: Where does task-critical visual information reside?

Finding: Visual semantics are strongly localized to object-aligned tokens.

Ablating object tokens causes a massive performance drop (e.g., -92.6%), whereas ablating an equal number of random or register tokens has minimal impact. This confirms that critical information is spatially sparse.

Q3: When do semantics become language-aligned?

Finding: The "Consolidation Interval".

We find a sharp phase transition in mid-to-late layers. Before this interval, visual features are processed but not readable by the language head. This defines the critical window for our surgical intervention.

Application: Surgical Intervention

Our framework consists of two analytic probes and one surgical intervention. Visual Auditing (left) identifies task-critical tokens via causal ablation. Semantic Tracing (right) tracks when these tokens become language-aligned.

We model the attention head selection using a routing score $m^{(l,h)}$ and temporal dispersion metrics. Only heads that satisfy the criteria are amplified during the critical consolidation interval.

Results

By amplifying the identified Temporal Attention Heads only within the consolidation interval, we can correct temporal hallucinations and ordering errors.

We evaluate the proposed inference-time intervention on the TempCompass benchmark under 16- and 32-frame settings. As shown in Table, amplifying the identified temporal heads improves performance on TempCompas across all evaluated LVLMs, with gains reaching up to +2.4% absolute accuracy. These improvements are observed under both 16-frame and 32-frame settings, indicating that the intervention benefits genuine temporal reasoning rather than exploiting a specific input length or configuration.

Sensitivity analysis of the head amplification factor $\lambda$ across layer windows. Notably, these gains are achieved without modifying model parameters or retraining, demonstrating that temporal reasoning capacity already exists within the model and can be selectively strengthened at inference time, with the optimal scaling factor $\alpha$ for these heads further illustrated in the figure.

Qualitative comparison on Visual Dynamics (Camera Motion).

BibTeX

@article{circuitprobe2025, title={CIRCUITPROBE: Tracing Visual Temporal Evidence Flow in Video Language Models}, author={Yiming Zhang, Zhuokai Zhao, Chengzhang Yu, Zhendong Chu, Kun Wang , Qiankun Li , Zihan Chen , Yang Liu , Zenghui Ding , Yining Sun, Qingsong Wen}, journal={Arxiv}, year={2025} }