Causally Steered Diffusion for Automated Video Counterfactual Generation

1National & Kapodistrian University of Athens, Greece, 2Archimedes/Athena RC, Greece
3The University of Edinburgh, UK, 4Technical University of Denmark
5Pioneer Centre for AI, Denmark
6The University of Essex, UK, 7Monzo Bank, UK

Abstract

Adapting text-to-image (T2I) latent diffusion models (LDMs) to video editing has shown strong visual fidelity and controllability, but challenges remain in maintaining causal relationships inherent to the video data generating process. Edits affecting causally dependent attributes often generate unrealistic or misleading outcomes if these relationships are ignored. In this work, we introduce a causally faithful framework for counterfactual video generation, formulated as an Out-of-Distribution (OOD) prediction problem. We embed prior causal knowledge by encoding the relationships specified in a causal graph into text prompts and guide the generation process by optimizing these prompts using a vision-language model (VLM)-based textual loss. This loss encourages the latent space of the LDMs to capture OOD variations in the form of counterfactuals, effectively steering generation toward causally meaningful alternatives. The proposed framework, dubbed CSVC, is agnostic to the underlying video editing system and does not require access to its internal mechanisms or fine-tuning. We evaluate our approach using standard video quality metrics and counterfactual-specific criteria, such as causal effectiveness and minimality. Experimental results show that CSVC generates causally faithful video counterfactuals within the LDM distribution via prompt-based causal steering, achieving state-of-the-art causal effectiveness without compromising temporal consistency or visual quality on real-world facial videos. Due to its compatibility with any black-box video editing system, our framework has significant potential to generate realistic 'what if' hypothetical video scenarios in diverse areas such as digital media and healthcare.

Method

We approach video editing as the generation of video counterfactuals, where the goal is to modify specific attributes of a factual (source) video (e.g., transforming a woman into a man with a beard). To generate plausible and semantically meaningful video counterfactuals, we introduce a novel framework that integrates an assumed prior causal graph with a vision-language model (VLM) loss using textual differentiation (Yuksekgonul et al.) for optimized prompt-driven causal steering. Both the VLM and the LDM are treated as black boxes, allowing us to focus on their interaction without the need for explicit manipulation or particular knowledge of their internal workings. Figure 1 depicts how counterfactual estimations improve with a causally consistent prompt using text differentiation optimization.

VLM method diagram
\(\textbf{CSVC at a glance:}\) The initial counterfactual prompts (e.g., She is young) are generated using GPT-4 by providing the causal graph and the factual prompts (e.g., She is old) and leveraging in-context learning. The video editing system operates as a black-box (frozen) counterfactual generator, and the (black-box) VLM as an evaluator of the generated counterfactuals. The VLM takes as input a generated counterfactual frame, the evaluation instruction, and the target counterfactual prompt \( \mathcal{P} \), and outputs textual feedback used to compute a ‘textual gradient’ \( \frac{\partial \mathcal{L}}{\partial \mathcal{P}} \), which guides the optimization of \( \mathcal{P} \) by focusing on the unsuccessful interventions.

The proposed framework is illustrated in 2. Given a generated counterfactual video frame, the counterfactual prompt, and an evaluation instruction containing the target interventions, we implement our proposed “multimodal loss” using a VLM:

\[ \mathcal{L} = \textit{VLM}(\mathcal{V'}_{\mathit{frame}}, \textit{evaluation instruction}, \mathcal{P}) \]

To optimize \(\mathcal{P}\), we employ Textual Gradient Descent (TGD), which directly updates the prompt:

\[ \begin{aligned} \mathcal{P}' &= \text{TGD.step} \left(\mathcal{P}, \frac{\partial \mathcal{L}}{\partial \mathcal{P}} \right) \\ &\triangleq \textit{LLM}\Big( \textit{Below are the criticisms on } \{ \mathcal{P} \}: \left\{ \frac{\partial \mathcal{L}}{\partial \mathcal{P}} \right\} \\ &\quad \textit{Incorporate the criticisms, and produce a new prompt.} \Big) \end{aligned} \]

where \(\frac{\partial \mathcal{L}}{\partial \mathcal{P}}\) denotes the “textual gradients,” passed through an LLM at each TGD update to generate a new prompt incorporating the VLM criticisms.

CSVC Results

FLATTEN

TokenFlow

Tune-A-Video

Progressive Counterfactual Transformation with VLM Causal Steering

Progressive transformation GIF

Progressive counterfactual transformation of an old woman to young through causal prompt steering.

BibTeX

@article{spyrou2025causally,
  title={Causally Steered Diffusion for Automated Video Counterfactual Generation},
  author={Spyrou, Nikos and Vlontzos, Athanasios and Pegios, Paraskevas and Melistas, Thomas and Gkouti, Nefeli and Panagakis, Yannis and Papanastasiou, Giorgos and Tsaftaris, Sotirios A},
  journal={arXiv preprint arXiv:2506.14404},
  year={2025}
}