Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

Zijian Song1, Sihan Qin1, Tianshui Chen4,5, Liang Lin1,2,3,4, Guangrun Wang1,2,4*,
1Sun Yat-sen University     2Guangdong Key Laboratory of Big Data Analysis and Processing
3Peng Cheng Laboratory     4X-Era AI Lab     5Guangdong University of Technology

*corresponding author
image

The illustration of Physical Autoregression. Our autoregressive process operates over a sequence of physical tokens (marked by red), each combining the visual world state (marked by orange) and the embodiment state (marked by black). The autoregressive process runs in sync with the environment: at each step, the predicted token is decoded into an image and an action, which interact with the environment to update its state (marked by blue), while the resulting observations and proprios are encoded back into the context.

Abstract

The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining.

image

Method

The autoregressive nature provides an ideal foundation for representing the physical world. Based on this insight, this work presents the Physical Autoregressive Model, PAR. Specifically, the frame tokens and the action tokens are combined into physical tokens, which effectively represent the joint evolution of robotic manipulation and environmental feedback. To mitigate the scarcity of human demonstrations, we transfer world knowledge from video pretraining (NOVA, Deng et.al 2024) into our PAR model, facilitating a seamless transition from understanding visual dynamics to capturing physical dynamics.

Most existing methods rely on discrete token representations to represent visual signals and action signals. This may introduce resolution errors whose accumulation propagates through long-horizon prediction and produces substantial trajectory drift. In this work, we propose to represent both frame and action as continuous signals. Specifically, we leverage the Diffusion-Transformer model (DiT) coupled with Diffusion Loss training objective to model the arbitrary distribution of the continuous frame and action tokens. This design not only improves smoothness and coherence, but also facilitates a deeper mutual interaction between the continuous spaces of vision and action.

Experiment

We use the ManiSkill Benchmark and follow prior works by evaluating the single task scenarios. We specifically focus on three manipulation tasks: PushCube, PickCube, and StackCube. We render 1K demonstrates for each task as training data. Our method achieves the second-best success rate, surprising the ICRT (Fu et.al 2025), only behind RDT (Liu et.al 2024). It is worth noting that RDT require extensive action pretraining, while our method does not.

The ablation study compares PAR-Full against two ablated variants: (1) PAR-NoAR, which removes the autoregressive architecture. (1) PAR-Discrete, which replaces the generative de-tokenizer with a discriminative one. PAR-Full consistently outperforms both baselines by a large margin, confirming the effectiveness of our designs.

image

image

Visualization

We visualize the predicted video sequences alongside the corresponding execution videos. The results reveal a clear alignment between the predicted visual trajectories and actual robot actions. This consistency highlights not only the model’s fine-grained scene understanding from video pretraining, but also its effectiveness in transferring that understanding into action planning.

The visualization of PickCube task. Left side shows the video generation result from PAR's frame tokens. Right side shows the actual execution video of PAR's predicted actions.
The visualization of PickCube task. Left side shows the video generation result from PAR's frame tokens. Right side shows the actual execution video of PAR's predicted actions.
The visualization of PickCube task. Left side shows the video generation result from PAR's frame tokens. Right side shows the actual execution video of PAR's predicted actions.
The visualization of PickCube task. Left side shows the video generation result from PAR's frame tokens. Right side shows the actual execution video of PAR's predicted actions.
The visualization of PickCube task. Left side shows the video generation result from PAR's frame tokens. Right side shows the actual execution video of PAR's predicted actions.
The visualization of PushCube task. Left side shows the video generation result from PAR's frame tokens. Right side shows the actual execution video of PAR's predicted actions.
The visualization of PushCube task. Left side shows the video generation result from PAR's frame tokens. Right side shows the actual execution video of PAR's predicted actions.
The visualization of PushCube task. Left side shows the video generation result from PAR's frame tokens. Right side shows the actual execution video of PAR's predicted actions.
The visualization of PushCube task. Left side shows the video generation result from PAR's frame tokens. Right side shows the actual execution video of PAR's predicted actions.
The visualization of StackCube task. Left side shows the video generation result from PAR's frame tokens. Right side shows the actual execution video of PAR's predicted actions.
The visualization of StackCube task. Left side shows the video generation result from PAR's frame tokens. Right side shows the actual execution video of PAR's predicted actions.

BibTeX

@article{song2025physical,
        title={Physical Autoregressive Model for Robotic Manipulation without Action Pretraining},
        author={Song, Zijian and Qin, Sihan and Chen, Tianshui and Lin, Liang and Wang, Guangrun},
        journal={arXiv preprint arXiv:2508.09822},
        year={2025}
      }