Causally Steered Diffusion for Automated Video Counterfactual Generation

1National & Kapodistrian University of Athens, Greece, 2Archimedes/Athena RC, Greece
3The University of Edinburgh, UK, 4Technical University of Denmark
5Pioneer Centre for AI, Denmark
6The University of Essex, UK, 7Monzo Bank, UK

Abstract

Adapting text-to-image (T2I) latent diffusion models for video editing has shown strong visual fidelity and controllability, but challenges remain in maintaining causal relationships in video content. Edits affecting causally dependent attributes risk generating unrealistic or misleading outcomes if these relationships are ignored. In this work, we propose a causally faithful framework for counterfactual video generation, guided by a vision-language model (VLM). Our method is agnostic to the underlying video editing system and does not require access to its internal mechanisms or finetuning. Instead, we guide the generation by optimizing text prompts based on an assumed causal graph, addressing the challenge of latent space control in LDMs. We evaluate our approach using standard video quality metrics and counterfactual-specific criteria, such as causal effectiveness and minimality. Our results demonstrate that causally faithful video counterfactuals can be effectively generated within the learned distribution of LDMs through prompt-based causal steering. With its compatibility with any black-box video editing system, our method holds significant potential for generating realistic “what-if” video scenarios in diverse areas such as healthcare and digital media.

Method

We approach video editing as the generation of video counterfactuals, where the goal is to modify specific attributes of a factual (source) video (e.g., transforming a woman into a man with a beard). To generate plausible and semantically meaningful video counterfactuals, we introduce a novel framework that integrates an assumed prior causal graph with a vision-language model (VLM) loss using textual differentiation (Yuksekgonul et al.) for optimized prompt-driven causal steering. Both the VLM and the LDM are treated as black boxes, allowing us to focus on their interaction without the need for explicit manipulation or particular knowledge of their internal workings. Figure 1 depicts how counterfactual estimations improve with a causally consistent prompt using text differentiation optimization.

VLM method diagram
\(\textbf{VLM causal steering at a glance:}\) The video editing system operates as a black-box (frozen) counterfactual generator and the (black-box) VLM as an evaluator of the generated counterfactuals. The VLM receives as input a generated counterfactual frame, the evaluation instruction, and the target counterfactual prompt \(\mathcal{P}\), and returns textual feedback, which is used to compute a “textual gradient” \(\frac{\partial \mathcal{L}}{\partial \mathcal{P}}\) and optimize \(\mathcal{P}\).

The proposed framework is illustrated in 2. Given a generated counterfactual video frame, the counterfactual prompt, and an evaluation instruction containing the target interventions, we implement our proposed “multimodal loss” using a VLM:

\[ \mathcal{L} = \textit{VLM}(\mathcal{V'}_{\mathit{frame}}, \textit{evaluation instruction}, \mathcal{P}) \]

To optimize \(\mathcal{P}\), we employ Textual Gradient Descent (TGD), which directly updates the prompt:

\[ \begin{aligned} \mathcal{P}' &= \text{TGD.step} \left(\mathcal{P}, \frac{\partial \mathcal{L}}{\partial \mathcal{P}} \right) \\ &\triangleq \textit{LLM}\Big( \textit{Below are the criticisms on } \{ \mathcal{P} \}: \left\{ \frac{\partial \mathcal{L}}{\partial \mathcal{P}} \right\} \\ &\quad \textit{Incorporate the criticisms, and produce a new prompt.} \Big) \end{aligned} \]

where \(\frac{\partial \mathcal{L}}{\partial \mathcal{P}}\) denotes the “textual gradients,” passed through an LLM at each TGD update to generate a new prompt incorporating the VLM criticisms.

VLM Causal Steering Results

FLATTEN

TokenFlow

Tune-A-Video

Progressive Counterfactual Transformation with VLM Causal Steering

Progressive transformation GIF

Progressive counterfactual transformation of an old woman to young through causal prompt steering.

BibTeX

@article{spyrou2025causally,
  title={Causally Steered Diffusion for Automated Video Counterfactual Generation},
  author={Spyrou, Nikos and Vlontzos, Athanasios and Pegios, Paraskevas and Melistas, Thomas and Gkouti, Nefeli and Panagakis, Yannis and Papanastasiou, Giorgos and Tsaftaris, Sotirios A},
  journal={arXiv preprint arXiv:2506.14404},
  year={2025}
}