NeurIPS 2025 - Motion-Controllable Video Generation

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Wan-Move is a simple and scalable motion-control framework for video generation. Create high-quality 5-second 480p videos with precise point-level control through latent trajectory guidance. Accepted at NeurIPS 2025.

What is Wan-Move?

Wan-Move represents a new approach to motion-controllable video generation. Developed by researchers from Tongyi Lab at Alibaba Group, Tsinghua University, HKU, and CUHK, this framework introduces a method for generating videos where each element's movement can be precisely controlled through point trajectories. The research was accepted at NeurIPS 2025, demonstrating its significance in the field of video generation.

Traditional video generation models produce content based on text prompts, but controlling the specific motion of objects within those videos has been challenging. Wan-Move addresses this by introducing latent trajectory guidance, a technique that represents motion conditions by propagating features from the first frame along defined trajectories. This allows users to specify exactly how objects should move across the video frames.

The framework builds on the Wan2.1 image-to-video foundation and implements motion control as a minimal extension. This design choice means that existing infrastructure and trained models can be adapted without requiring complete architectural changes or additional motion-specific modules. The result is a practical system that generates 5-second videos at 480p resolution with motion quality comparable to commercial solutions.

What sets Wan-Move apart is its point-level control mechanism. Instead of vague descriptions like "move left" or "rotate," users can define dense point trajectories that specify the exact path each element should follow. This fine-grained control enables applications ranging from single-object animation to complex multi-object choreography, camera movements, motion transfer between videos, and 3D rotations.

Overview of Wan-Move

FeatureDescription
Model NameWan-Move-14B-480P
CategoryMotion-Controllable Video Generation
Parameters14 Billion
Video Duration5 seconds
Resolution832×480p
Base ModelWan-I2V-14B
Control MethodDense Point Trajectories
ConferenceNeurIPS 2025

Understanding Latent Trajectory Guidance

The core innovation in Wan-Move is latent trajectory guidance. This technique addresses a fundamental challenge in motion-controlled video generation: how to convey motion information to the model in a way that integrates naturally with existing architectures. The solution is surprisingly simple yet effective.

In Wan-Move, motion is represented by taking features from the first frame of the video and propagating them along user-defined trajectories. Think of it as marking points on objects in the first frame and then telling the model where those points should appear in subsequent frames. The model learns to generate video content that respects these trajectory constraints while maintaining visual quality and coherence.

This approach has several advantages. First, it requires no changes to the underlying video generation architecture. The trajectory information is provided as an additional input condition that guides the generation process. Second, it allows for precise control at the point level, meaning individual parts of objects can be controlled independently. Third, it scales naturally to handle multiple objects, each with their own trajectory paths.

The training pipeline involves pairing video data with trajectory annotations. The model learns to associate the trajectory patterns with the corresponding motion in the video. Once trained, the system can generate new videos where the motion follows user-specified trajectories, producing results that match the intended movement while maintaining high visual quality.

MoveBench: A New Evaluation Standard

Alongside Wan-Move, the research team introduced MoveBench, a carefully curated benchmark for evaluating motion-controllable video generation systems. This benchmark addresses the lack of standardized evaluation methods in the field by providing a diverse set of test cases with high-quality trajectory annotations.

MoveBench includes samples across diverse content categories, featuring both single-object and multi-object scenarios. Each sample includes a reference image, trajectory annotations, visibility masks, and corresponding text descriptions in both English and Chinese. The benchmark is designed to test various aspects of motion control, from simple single-object movements to complex multi-object interactions.

The construction pipeline for MoveBench involved careful curation of video content, extraction of trajectory data, and annotation of visibility information. The result is a benchmark that can reliably evaluate whether generated videos match the intended motion patterns. Researchers can use MoveBench to compare different approaches and track progress in the field of motion-controllable video generation.

Core Capabilities

Key Features of Wan-Move

5s

High-Quality 5-Second Videos

Through scaled training, Wan-Move generates 5-second videos at 480p resolution with state-of-the-art motion controllability. User studies show that its performance is on par with commercial systems like Kling 1.5 Pro's Motion Brush feature.

LTG

Latent Trajectory Guidance

The core technique represents motion conditions by propagating first-frame features along trajectories. This method integrates into off-the-shelf image-to-video models without architecture changes or extra motion modules.

P+

Fine-Grained Point-Level Control

Object motions are defined using dense point trajectories, providing precise control over how each element moves. This region-level control allows for detailed choreography of scene elements.

MB

MoveBench Benchmark

A dedicated motion-control benchmark with large-scale samples, diverse content categories, longer video durations, and high-quality trajectory annotations for standardized evaluation.

14B

14B Parameter Foundation

Built on the Wan-I2V-14B foundation model, Wan-Move extends existing capabilities with minimal modifications. Users familiar with Wan2.1 can reuse their setup with low migration cost.

GPU

Multi-GPU Support

Supports FSDP and xDiT USP acceleration for faster inference. Includes options for model offloading and CPU execution to reduce GPU memory usage when needed.

Motion Control Applications

🎯

Single-Object Motion Control

Guide the movement of individual objects within a scene. Define the path an object should follow, and Wan-Move generates video where that object moves along the specified trajectory while maintaining natural appearance and interaction with the environment.

🎪

Multi-Object Motion Control

Choreograph multiple objects simultaneously, each following independent trajectories. This capability enables complex scenes where different elements move in coordinated or independent patterns, creating dynamic compositions.

📹

Camera Control

Simulate camera movements such as panning, dollying in and out, and linear displacement. Control the viewpoint to create professional-looking camera work without physically moving a camera.

🔄

Motion Transfer

Extract motion patterns from one video and apply them to different content. This allows for reusing successful motion templates across different scenes and subjects.

🌐

3D Rotation

Generate videos that show objects rotating in three-dimensional space. This is particularly useful for product demonstrations, architectural visualization, and any application requiring 360-degree views.

🎬

Content Creation

Create animated content for marketing, education, entertainment, and social media. The precise motion control enables professional-quality animations without traditional animation software or expertise.

Technical Implementation

Wan-Move is implemented as a minimal extension on top of the Wan2.1 codebase. This design philosophy offers practical benefits for researchers and developers. If you have previously worked with Wan2.1, most of your existing setup remains usable. The model requires PyTorch 2.4.0 or later and can be installed using standard Python package management tools.

The model weights are available through both Hugging Face and ModelScope platforms. The Wan-Move-14B-480P checkpoint contains the trained parameters for generating 5-second videos at 480p resolution. Download tools are provided through both platforms' command-line interfaces, making it straightforward to obtain the necessary files.

For inference, Wan-Move supports both single-GPU and multi-GPU configurations. Single-GPU inference is suitable for generating individual videos or small batches. For larger-scale evaluation or production use, multi-GPU inference with FSDP (Fully Sharded Data Parallel) provides significant speedup. The system also includes options to reduce memory usage through model offloading and CPU execution of certain components.

Trajectory data is provided in NumPy array format, with separate files for trajectory coordinates and visibility masks. The trajectory file contains the x,y coordinates for each tracked point across all frames. The visibility mask indicates when points are occluded or leave the frame. This straightforward format makes it easy to create custom trajectory data or integrate with motion tracking tools.

Performance and Comparisons

User studies comparing Wan-Move with both academic methods and commercial solutions show that Wan-Move achieves competitive motion controllability. Qualitative comparisons demonstrate that Wan-Move produces videos with accurate motion that follows the specified trajectories while maintaining high visual quality and temporal consistency.

Compared to other academic approaches, Wan-Move offers the advantage of its simple integration method. The latent trajectory guidance technique requires no specialized architecture components, making it easier to implement and adapt. The point-level control also provides more precision than methods that rely on region-based or text-based motion descriptions.

When compared to commercial solutions like Kling 1.5 Pro, Wan-Move demonstrates similar motion accuracy while being open for research and development. This openness allows researchers to understand how the system works, adapt it for specific needs, and build upon the foundation. The inclusion of MoveBench also provides a standardized way to measure progress in the field.

Getting Started with Wan-Move

Multiple options are available for exploring and using Wan-Move, from quick evaluation with MoveBench to full local installation for development work.

📦

Quick Start

Clone the repository, install dependencies, and download the model weights. Example code is provided for running inference on sample images with predefined trajectories.

Installation Guide
📊

MoveBench Evaluation

Download the MoveBench dataset and run evaluation scripts to test the model on standardized benchmarks. Supports both English and Chinese language options.

🎨

Custom Trajectories

Create your own trajectory data to control object motion in your videos. The NumPy array format makes it straightforward to define custom motion paths.

🔧

Multi-GPU Scaling

For larger workloads, use the multi-GPU inference capabilities with FSDP support. Memory optimization options are available for systems with limited VRAM.

Research and Development

Wan-Move was developed through collaboration between Tongyi Lab at Alibaba Group, Tsinghua University, the University of Hong Kong, and the Chinese University of Hong Kong. The research team includes Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, and Yujiu Yang.

The paper describing Wan-Move was accepted at NeurIPS 2025, one of the premier conferences in machine learning and artificial intelligence. The acceptance reflects the significance of the work and its contribution to the field of video generation. The paper is available on arXiv for those interested in the technical details and experimental results.

The release includes not just the model weights but also the code, evaluation benchmark, and documentation needed for others to reproduce and build upon the work. This comprehensive release supports the research community in advancing motion-controllable video generation technology.

Future Development

The research team has indicated plans for future releases, including a Gradio demo interface that will make the technology more accessible to users without programming experience. This demo will allow users to upload images, define trajectories through an interactive interface, and generate videos with controlled motion.

The current release focuses on 480p resolution and 5-second duration. Future work may explore higher resolutions, longer videos, and additional control mechanisms. The modular design of Wan-Move makes it well-suited for such extensions, as new capabilities can be added without requiring complete redesign of the system.

As the field of video generation continues to advance, motion control will likely become a standard feature in video generation systems. Wan-Move provides a foundation for this capability through its simple yet effective latent trajectory guidance approach. The open release of code and models encourages community involvement in advancing these technologies.

Common Questions

Frequently Asked Questions

Find answers to common questions about Wan-Move's capabilities, requirements, and usage.