1National & Kapodistrian University of Athens, Greece,
2Archimedes/Athena RC, Greece 3The University of Edinburgh, UK,
4Technical University of Denmark 5Pioneer Centre for AI, Denmark 6The University of Essex, UK,
7Monzo Bank, UK
Adapting text-to-image (T2I) latent diffusion models for video editing has shown strong visual fidelity and controllability, but challenges remain in maintaining causal relationships in video content. Edits affecting causally dependent attributes risk generating unrealistic or misleading outcomes if these relationships are ignored. In this work, we propose a causally faithful framework for counterfactual video generation, guided by a vision-language model (VLM). Our method is agnostic to the underlying video editing system and does not require access to its internal mechanisms or finetuning. Instead, we guide the generation by optimizing text prompts based on an assumed causal graph, addressing the challenge of latent space control in LDMs. We evaluate our approach using standard video quality metrics and counterfactual-specific criteria, such as causal effectiveness and minimality. Our results demonstrate that causally faithful video counterfactuals can be effectively generated within the learned distribution of LDMs through prompt-based causal steering. With its compatibility with any black-box video editing system, our method holds significant potential for generating realistic “what-if” video scenarios in diverse areas such as healthcare and digital media.
Intervention on age: Transforming an old woman to young.
Intervention on gender: Tranforming a woman into a man with a beard.
Method
We approach video editing as the generation of video counterfactuals,
where the goal is to modify specific attributes of a factual (source) video
(e.g., transforming a woman into a man with a beard). To generate plausible and semantically meaningful video counterfactuals,
we introduce a novel framework that integrates an assumed prior causal graph with a vision-language model (VLM) loss
using textual differentiation (Yuksekgonul et al.)
for optimized prompt-driven causal steering.
Both the VLM and the LDM are treated as black boxes,
allowing us to focus on their interaction without the need for explicit manipulation or
particular knowledge of their internal workings.
Figure 1 depicts how counterfactual estimations improve with a
causally consistent prompt using text differentiation optimization.
\(\textbf{VLM causal steering at a glance:}\) The video editing system operates as a black-box (frozen) counterfactual generator and the (black-box) VLM as an evaluator of the generated counterfactuals. The VLM receives as input a generated counterfactual frame, the evaluation instruction, and the target counterfactual prompt \(\mathcal{P}\), and returns textual feedback, which is used to compute a “textual gradient” \(\frac{\partial \mathcal{L}}{\partial \mathcal{P}}\) and optimize \(\mathcal{P}\).
The proposed framework is illustrated in 2.
Given a generated counterfactual video frame, the counterfactual prompt, and an evaluation instruction containing the target interventions,
we implement our proposed “multimodal loss” using a VLM:
To optimize \(\mathcal{P}\), we employ Textual Gradient Descent (TGD), which directly updates the prompt:
\[
\begin{aligned}
\mathcal{P}' &= \text{TGD.step} \left(\mathcal{P}, \frac{\partial \mathcal{L}}{\partial \mathcal{P}} \right) \\
&\triangleq \textit{LLM}\Big( \textit{Below are the criticisms on } \{ \mathcal{P} \}:
\left\{ \frac{\partial \mathcal{L}}{\partial \mathcal{P}} \right\} \\
&\quad \textit{Incorporate the criticisms, and produce a new prompt.} \Big)
\end{aligned}
\]
where \(\frac{\partial \mathcal{L}}{\partial \mathcal{P}}\) denotes the “textual gradients,” passed through an LLM at each TGD update to generate a new prompt incorporating the VLM criticisms.
Textual Gradient \(\frac{\partial \mathcal{L}}{\partial \mathcal{P}}\) — Age Intervention of Figure 1 (old to young)
The variable "A woman is young" lacks specificity and clarity, which likely contributed to the misalignment between the generated image and the intended attribute of youthfulness. The term "young" is highly subjective and can be interpreted variably depending on cultural, personal, and contextual factors. This ambiguity in the prompt may have led the Text-to-Image model to generate an image that does not meet the counterfactual criteria specified.
**Feedback and Criticism:**
1. **Specificity in Age Description**: The term "young" should be replaced with a more precise age or age range to guide the model more effectively. For instance, specifying "a woman in her early twenties" or "a woman aged 20-25 years" would provide a clearer target for the model, reducing ambiguity and increasing the likelihood of generating an image that aligns with the youthful attribute desired.
2. **Inclusion of Contextual Details**: While the prompt should remain focused on the age attribute, adding minimal contextual details that align with youthful appearances could enhance the effectiveness of the prompt. For example, mentioning "a youthful demeanor" or "fresh-faced" could cue the model to focus on features typically associated with younger individuals.
3. **Avoiding Overly Broad Descriptors**: The use of broad descriptors like "young" without additional qualifiers can lead to a wide range of interpretations. It's beneficial to avoid these where precision is required for generating specific outcomes.
4. **Testing and Iteration**: After adjusting the prompt, it would be advisable to test the new prompt with the model to see if the changes yield a more accurate representation. This iterative approach can help fine-tune the prompt based on actual outcomes.
By addressing these points, the prompt "A woman is young" can be optimized to more effectively communicate the desired attribute of youthfulness to the Text-to-Image model, thereby improving the alignment of the generated image with the counterfactual target.
Textual Gradient \(\frac{\partial \mathcal{L}}{\partial \mathcal{P}}\) — Gender Intervention of Age Intervention of Figure 2 (woman to man)
The variable "He is young, he has a beard" needs to be optimized to better align with the specified attributes of being a young man with a beard. The current prompt lacks specificity and clarity, which may have contributed to the generation of an image that does not meet the counterfactual target attributes.
**Feedback and Criticism:**
1. **Specificity and Clarity**: The current prompt is somewhat vague and does not explicitly mention the gender, which is crucial for the intervention. The phrase "he has a beard" implies masculinity, but it seems that the model did not pick up on this cue effectively. To improve, the prompt should explicitly state the gender to avoid ambiguity. For example, incorporating the word "man" or "male" could guide the model more effectively.
2. **Descriptive Language**: The prompt could benefit from more descriptive language to emphasize the attributes. Instead of just saying "young," it might be helpful to describe what "young" typically implies in this context, such as "a youthful appearance" or "looks in his twenties." This could help the model in generating features that are commonly associated with younger individuals.
3. **Attribute Focus**: The prompt should focus more on the attributes that need intervention. Since "beard" is a key attribute but was completely missed, the prompt could emphasize this feature more strongly. For instance, describing the beard in more detail like "sporting a well-groomed beard" could make it a focal point for the generation process.
4. **Use of Active Language**: The prompt could use more active language to make the descriptions more dynamic and engaging, which might help in better capturing the desired attributes. For example, "A young man with a striking beard" adds character and emphasis.
5. **Avoiding Misinterpretation**: To ensure that the attributes are not misinterpreted, the prompt could include a brief explanation that aligns with common perceptions or stereotypes, ensuring that the text-to-image model has a clear and direct reference to work from.
By addressing these points, the prompt can be optimized to more effectively communicate the desired attributes to the text-to-image model, thereby improving the likelihood of generating an image that aligns with the counterfactual target attributes.
VLM Causal Steering Results
FLATTEN
Intervention on beard.
Intervention on age
Intervention on beard
Intervention on age
TokenFlow
Intervention on age
Intervention on gender
Intervention on age
Tune-A-Video
Intervention on beard
Intervention on age
Intervention on beard
Progressive Counterfactual Transformation with VLM Causal Steering
Progressive counterfactual transformation of an old woman to young through causal prompt steering.
BibTeX
@article{spyrou2025causally,
title={Causally Steered Diffusion for Automated Video Counterfactual Generation},
author={Spyrou, Nikos and Vlontzos, Athanasios and Pegios, Paraskevas and Melistas, Thomas and Gkouti, Nefeli and Panagakis, Yannis and Papanastasiou, Giorgos and Tsaftaris, Sotirios A},
journal={arXiv preprint arXiv:2506.14404},
year={2025}
}