A lightweight world-action framework that turns future-change prediction into compact priors for robust robotic manipulation.
General-purpose vision-language-action models benefit from large vision-language priors, but effective manipulation also requires anticipating action-relevant scene changes. Existing world-action models often rely on large generative world models or dense future rollouts, which are expensive and spend capacity on visual details weakly coupled to control.
Bridge-WA distills a frozen future-change teacher into three compact priors: future tokens for intended outcomes, change maps for intervention support, and motion-flow maps for local transition direction. A BridgeWA block conditions the action transformer on these priors through multi-source attention memories and spatial-temporal biases, while the teacher model is removed at inference.
Bridge-WA represents the action-relevant future with three compact world-change representations: future tokens for the intended outcome, change maps for where the scene should change, and motion-flow maps for how the change should move. A lightweight predictor estimates these representations from the current robot context, and BridgeWA conditions the action transformer on them for world-aware action generation. The training-time future model and offline cache are removed at deployment.
The world teacher is first pretrained as a 5B robot-conditioned future-change model on BridgeData V2 trajectories. It maps the current visual observation, language instruction, and robot state into a predictive representation of the future scene, providing supervision that is aligned with manipulation-induced change rather than generic video appearance. After this stage, the teacher is frozen and used only to construct offline caches.
The BridgeWA block turns the cached future tokens, change maps, and future-change-flow maps into policy-readable conditioning. Future tokens provide global outcome context; change maps identify the spatial support of task-relevant intervention; and flow maps encode local look-ahead transition direction. The action transformer keeps policy-centered queries while reading these structured memories through attention, enabling world-aware action decoding without deploying the large teacher at inference time.
@article{bai2026bridgewa,
title = {Bridge-WA: Predicting Where and How the World Changes for Robotic Action},
author = {Bai, Yongjie and Wang, Hanting and Dai, Mingtong and Zhong, Qijun and Liu, Yang and Lin, Liang},
year = {2026}
}