DDP-WM: Disentangled Dynamics Prediction for Efficient World Models

Shicheng Yin^1,*, Kaixuan Yin^1,*, Weixing Chen¹, Yang Liu¹, Guanbin Li¹, Liang Lin^1,2

^*Equal contribution

¹Sun Yat-sen University

²X-Era AI Lab

Abstract

World models are essential for autonomous robotic planning. However, the substantial computational overhead of existing dense Transformer-based models significantly hinders real-time deployment. To address this efficiency-performance bottleneck, we introduce DDP-WM, a novel world model centered on the principle of Disentangled Dynamics Prediction (DDP). We hypothesize that latent state evolution in observed scenes is heterogeneous and can be decomposed into sparse primary dynamics driven by physical interactions and secondary context-driven background updates. DDP-WM realizes this decomposition through an architecture that integrates efficient historical processing with dynamic localization to isolate primary dynamics. By employing a cross-attention mechanism for background updates, the framework optimizes resource allocation and provides a smooth optimization landscape for planners. Extensive experiments demonstrate that DDP-WM achieves significant efficiency and performance across diverse tasks, including navigation, precise tabletop manipulation, and complex deformable or multi-body interactions. Specifically, on the challenging Push-T task, DDP-WM achieves an approximately 9x inference speedup and improves the MPC success rate from 90% to 98% compared to state-of-the-art dense models.

Motivation: Redundancy and Sparsity

Modern dense world models (e.g., DINO-WM) apply the same costly self-attention to all parts of an image, whether it's a moving object or a static wall. This creates a massive efficiency bottleneck. Our design is motivated by two key insights into the nature of physical dynamics in the feature space of pre-trained models like DINOv2:

PCA visualization showing static background features in a dense model.

(1) Computational Redundancy: We analyzed the internal features of a dense Transformer-based world model. As shown by the PCA visualization above, the features for background regions (most of the image) remain almost unchanged across multiple, computationally expensive layers. This confirms that vast amounts of computation are wasted on static parts of the scene.

Visualization of the difference between consecutive frames, showing sparsity.

(2) Inherent Sparsity: The root cause of this redundancy is that physical dynamics are inherently sparse. By visualizing the difference between feature maps of two consecutive frames, we see that significant changes (non-green areas) occur in only a tiny fraction of the scene. This confirms that the core dynamics are localized and sparse.

Method: Disentangled Dynamics Prediction (DDP)

Based on these insights, we introduce DDP-WM, a world model built on the principle of Disentangled Dynamics Prediction (DDP). The core idea is to decompose scene evolution into two distinct streams and allocate computational resources accordingly. Our architecture systematically implements this idea through several specialized modules:

A Dynamic Localization Network first identifies the sparse regions where primary, action-driven dynamics will occur.
A powerful Sparse Primary Dynamics Predictor then focuses all its computational power on these few important tokens to model complex physical interactions with high fidelity.
Finally, a novel and highly efficient Low-Rank Correction Module (LRM) handles the subtle, context-driven updates to the vast static background, ensuring feature-space consistency with minimal cost.

This intelligent allocation of computation is the key to achieving a massive leap in both efficiency and planning performance.

The DDP-WM Framework Overview. A four-stage process: (1) historical information is fused, (2) dynamic regions are localized, (3) a powerful predictor models the sparse primary dynamics, and (4) an efficient LRM updates the background, providing a smooth landscape for planning.

Finding: A Plannable Optimization Landscape

Why does DDP-WM succeed in closed-loop planning where naive sparse models fail? The answer lies in the optimization landscape provided to the planner. Naive "copy-paste" for the background creates a rugged, noisy cost surface full of local minima. Our Low-Rank Correction Module (LRM) ensures feature-space consistency, resulting in a smooth, funnel-shaped landscape that enables the planner to efficiently find the optimal solution.

The MPC cost landscape of a naive sparse model is rugged and noisy, trapping the optimizer in local minima.

Naive Sparse Model: A rugged, noisy landscape traps the optimizer.

The MPC cost landscape of our DDP-WM is smooth with a clear, funnel-shaped global minimum, enabling efficient optimization.

Our DDP-WM (w/ LRM): A smooth landscape provides a clear path to the global minimum.

Why it Works: LRM Learns the True Low-Rank Structure

This smooth landscape is possible because our LRM successfully learns the true, inherent low-rank structure of background dynamics. A PCA analysis shows the updates generated by our LRM almost perfectly match the low-dimensional nature of the ground-truth dynamics.

Parallel PCA analysis of LRM-predicted background updates and ground-truth updates, showing they share the same low-rank structure.

Results at a Glance

Radar chart showing overall performance gains of DDP-WM.

Planning Performance

DDP-WM matches or surpasses SOTA dense models on all tasks, with significant gains on complex manipulation.

Model	PointMaze (SR ↑)	Push-T (SR ↑)	Wall (SR ↑)	Rope (CD ↓)	Granular (CD ↓)
DINO-WM	98%	90%	98%	0.41	0.26
DDP-WM (Ours)	100%	98%	96%	0.31	0.24

MPC Decision Time

Our model provides significant speedups in the full MPC loop, enabling higher-frequency control.

Task / Iterations	DINO-WM	DDP-WM (Ours)	Speedup
PointMaze / 10	39 s	5.5 s	7.1x
Push-T / 30	120 s	16 s	7.5x
Wall / 10	12 s	4.2 s	2.9x
Rope / 10	12 s	4.3 s	2.8x
Granular / 30	35 s	13 s	2.7x

Open-Loop Prediction: Higher Fidelity Rollouts

Qualitative comparisons of 5-step open-loop rollouts highlight DDP-WM's ability to generate high-fidelity predictions. While the dense baseline (DINO-WM) is powerful, its predictions can sometimes degrade over the horizon, accumulating blurriness or visual artifacts (e.g., rigid objects appearing to distort). In contrast, DDP-WM shows a greater tendency to produce sharp and physically coherent rollouts that better preserve key details. Each animation below shows results from DINO-WM (top row), DDP-WM (middle row), and the ground truth (bottom row).

Push-T
Granular
Rope

Sample 1

Sample 2

Sample 3

Sample 1

Sample 2

Sample 3

Sample 1

Sample 2

Sample 3

Closed-Loop Planning on Push-T

Below are side-by-side comparisons of MPC planning on the Push-T task. DDP-WM not only achieves a higher success rate (98% vs. 90%) but also demonstrates more precise and direct control. In cases where the dense baseline fails, DDP-WM often finds a successful trajectory.

Successful Planning Comparison

DINO-WM (Success)

DDP-WM (Success)

DINO-WM (Success)

DDP-WM (Success)

DINO-WM (Success)

DDP-WM (Success)

Failure Recovery

DDP-WM successfully solves trajectories where DINO-WM fails.

DINO-WM (Failure)

DDP-WM (Solves the case)

DINO-WM (Failure)

DDP-WM (Solves the case)

BibTeX

@misc{yin2026ddpwmdisentangleddynamicsprediction,
      title={DDP-WM: Disentangled Dynamics Prediction for Efficient World Models}, 
      author={Shicheng Yin and Kaixuan Yin and Weixing Chen and Yang Liu and Guanbin Li and Liang Lin},
      year={2026},
      eprint={2602.01780},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.01780}, 
}