Bridge-WA | Predicting Where and How the World Changes for Robotic Action

Abstract

General-purpose vision-language-action models benefit from large vision-language priors, but effective manipulation also requires anticipating action-relevant scene changes. Existing world-action models often rely on large generative world models or dense future rollouts, which are expensive and spend capacity on visual details weakly coupled to control.

Bridge-WA distills a frozen future-change teacher into three compact priors: future tokens for intended outcomes, change maps for intervention support, and motion-flow maps for local transition direction. A BridgeWA block conditions the action transformer on these priors through multi-source attention memories and spatial-temporal biases, while the teacher model is removed at inference.

                By focusing action generation on where and how the scene will change, Bridge-WA suppresses nuisance appearance factors such as background, lighting, and distractors, improving task success, progress, and robustness across simulation and real-robot evaluations.
            

Framework

Compact World-Change Priors for Action Generation

Bridge-WA represents the action-relevant future with three compact world-change representations: future tokens for the intended outcome, change maps for where the scene should change, and motion-flow maps for how the change should move. A lightweight predictor estimates these representations from the current robot context, and BridgeWA conditions the action transformer on them for world-aware action generation. The training-time future model and offline cache are removed at deployment.

World Teacher Pretraining and BridgeWA Block

The world teacher is first pretrained as a 5B robot-conditioned future-change model on BridgeData V2 trajectories. It maps the current visual observation, language instruction, and robot state into a predictive representation of the future scene, providing supervision that is aligned with manipulation-induced change rather than generic video appearance. After this stage, the teacher is frozen and used only to construct offline caches.

The BridgeWA block turns the cached future tokens, change maps, and future-change-flow maps into policy-readable conditioning. Future tokens provide global outcome context; change maps identify the spatial support of task-relevant intervention; and flow maps encode local look-ahead transition direction. The action transformer keeps policy-centered queries while reading these structured memories through attention, enabling world-aware action decoding without deploying the large teacher at inference time.

Training-time world teacher. A frozen future-change model produces cached supervision targets that summarize outcome, spatial change, and local motion.

Lightweight prior predictor. The deployable model learns to recover future tokens, change maps, and flow maps from the current robot context.

World-aware action decoding. BridgeWA injects predicted priors into the action transformer as memory tokens and attention guidance.

The rollout highlights a real-robot execution together with the predicted future, change map, and motion-flow views used by Bridge-WA. The numbered cursors summarize the strongest localized world-prior responses, while the arrows indicate the look-ahead motion direction estimated from future-change-flow.

1 Primary anchor

The orange cursor marks the strongest task-relevant future-change region over the look-ahead horizon.

2 Secondary anchor

The blue cursor marks a second stable future-change region.

In short, a cursor denotes where key change is expected to happen, and the nearby arrow denotes how that future change is expected to move. These arrows are derived from future-change-flow and serve as look-ahead motion-direction cues for the scene, robot, or manipulated object.

Bridge-WA: Predicting Where and How the World Changes for Robotic Action

Abstract

Framework

World Teacher Pretraining and BridgeWA Block

World-Prior Visualization on Dobot Rollouts

Dobot Evaluation Demonstration

Franka Evaluation Demonstration

Simulation

Citation