Goal Force: Teaching Video Models
To Accomplish Physics-Conditioned Goals

1Brown University, 2Cornell University

TL;DR: We train a video model which allows users to define goals via explicit force vectors, and then the model generates the actions that make that force happen.


Goal Force Prompting

One Scene, Multiple Outcomes via Varying Goal Force Vectors

Video Generation Model Generalizes Goal Force Conditioning

Generalizes to Different Settings and Materials

Generalizes to Tool-Object Interaction

Generalizes to Human-Object Interaction

Generalizes to (non-Human)-Object Interaction

Probabilistic Task Completion

Visual Planning with Physical Constraints


Overview

Recent advancements in video generation have enabled the development of "world models" capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks.

To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives—such as elastic collisions and falling dominos—teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains.

Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines.



Interacting with Image using Goal Force Prompts

A user can interact with an image by specifying a goal force vector (location, angle, magnitude) on the image. With this goal force prompt, the video generator then generates the resultant scene. In particular, no physics simulator or 3D assets are required at inference time.

Goal Force Prompts


Click an icon to select a scene. Then, select a Goal Force and a seed (if multiple exist).



Goal Force Prompting Generates Diverse Probabilistic Plans

Sampling multiple seeds yields diverse ways of accomplishing the same task.



Goal Force Prompting Generates Physically Plausible Plans

When there are multiple plausible objects that could initiate the goal force, the model chooses to use the object that does not violate physical laws.



Comparison to Prior Methods



Synthetic Training Dataset


Our synthetic training dataset is in three parts: dominos (generated using Blender), balls (generated using Blender), and a flower (generated using PhysDreamer). Because the dominos and balls involve collision chain reactions, they are used to teach the model direct force (top, red arrow) as well as the goal force (bottom, green arrow).



Ablation: Importance of Specificity of Text Prompt


We find that for situations involving tool use, it is important to specify high level information such as the tool and how it is used (4th, 5th columns). If you don't specify the tool in the text prompt, then the model might find another way to accomplish the goal force (1st, 2nd, 3rd columns).



Limitations



Computational Resources

Our approach is trained on only 12k training examples for two days on four A100 GPUs, making these techniques broadly accessible for future research.

BibTeX

@misc{gillman2026goalforceteachingvideo,
      title={Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals}, 
      author={Nate Gillman and Yinghua Zhou and Zitian Tang and Evan Luo and Arjan Chakravarthy and Daksh Aggarwal and Michael Freeman and Charles Herrmann and Chen Sun},
      year={2026},
      eprint={2601.05848},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.05848}, 
}