Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation

*Equal Contribution , Corresponding Author
1Sun Yat-sen University
2Pengcheng Laboratory
3Nanyang Technological University
4Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
5X-Era AI Lab

Abstract

Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-aware Virtual View Exploration (TVVE), a framework designed to overcome these challenges by integrating virtual view exploration with task-specific representation learning. TVVE employs an efficient exploration policy, accelerated by a novel pseudo-environment, to acquire informative views.

Furthermore, we introduce a Task-aware Mixture-of-Experts (TaskMoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization.

By learning to see the world in a task-aware way, TVVE generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. To further validate the robustness and generalization capability of TVVE under out-of-distribution (OOD) settings, we construct a challenging benchmark, RLBench-OG, covering various visual perturbations and camera pose variations. Extensive experiments on RLBench and RLBench-OG show that our TVVE achieves superior performance over state-of-the-art approaches. In real-robot experiments, TVVE demonstrates exceptional performance and generalizes robustly in multiple OOD settings, including visual disturbances and unseen instructions.

Motivation & Approach

The Challenge: Fixed View Limitations & Dense Encoders Limitations

Traditional robotic systems rely on static camera viewpoints and shared visual encoders, which often lead to:

  • Occlusion issues: Critical objects or end-effectors hidden from view
  • Incomplete scene understanding: Limited spatial awareness
  • Task interference: Shared encoders struggle with diverse tasks
  • Reduced generalization: Poor performance on novel scenarios

Our Solution: Task-Aware Virtual View Exploration

TVVE introduces two key innovations:

  • Multi-Viewpoint Exploration Policy (MVEP): Selects better viewpoints to maximize information gain
  • Task-aware Mixture-of-Experts (TaskMoE): Dynamically routes features to specialized experts based on task requirements
Fixed View vs. TVVE Comparison

Fig. 1: Motivation Illustration. Observations captured from fixed cameras often miss parts of the target objects. For example, the front view only captures the cupboard (highlighted with a red circle), while the left and right shoulder views only show the sugar (already grasped by the end-effector and highlighted with green circles). These incomplete observations may lead to failed operations. In contrast, our proposed TVVE is designed to dynamically explore and re-render informative viewpoints that maximize coverage of target-relevant information, thereby improving the reliability of manipulation outcomes.

Problem: Fixed Viewpoints
  • Target objects often occluded
  • End-effector visibility compromised
  • Incomplete scene understanding
  • Limited spatial awareness
Solution: MVEP
  • Dynamic virtual multi-view re-rendering
  • Optimized camera positioning
  • Enhanced 3D perception
  • Occlusion minimization
Solution: TaskMoE
  • Task-specific feature extraction
  • Dynamic expert routing
  • Parameter sharing for similar tasks
  • Improved multi-task generalization

Methodology

TVVE Framework Overview
TVVE Framework Diagram

Fig. 2: The overview of our Task-Aware Virtual View Exploration (TVVE) framework. The input of this framework is four RGB-D images from fixed viewpoints. First, it converts them into point clouds and aggregates them into a global point cloud in the world coordinate system. Then, it diverges into two branches: One branch (orange) performs Coarse Grounding to predict the approximate position of the end-effector. Subsequently, it moves the center of the global point cloud to this predicted position, performs scaling and cropping, retaining the important point cloud region. Another branch (green) receives the global point cloud, passes it through MVEP to predict the optimal camera parameters for the observation viewpoint. Then, using these parameters, it renders a 2D image from the point cloud processed by the red branch. This rendered image is fed into the Fine Grounding to predict the final robot action, including the end-effector position, rotation, gripper status, and collision state.

Our proposed Task-Aware Virtual View Exploration (TVVE) framework aims to identify optimal viewpoints for accurate and robust robotic manipulation, guided by task-specific visual feature extraction. TVVE takes as input a language instruction, the current visual observations from RGB-D cameras, and the current gripper state. To enable the model to explore the optimal observations from arbitrary viewpoints, we first reconstruct a 3D point cloud of the scene using the input RGB-D images. To narrow down the search space and enable task-aligned view selection, we leverage the coarse prediction stage from RVT-2 to generate an area of interest. However, unlike RVT-2, which extracts visual features for all tasks using a shared Multi-View Transformer (MVT), we introduce a Task-Aware Mixture-of-Experts (TaskMoE) module before MVT to route instructions to specialized expert encoders. This design enables more precise and task-aligned visual feature extraction, which benefits both viewpoint selection and action prediction. Starting from the identified area of interest, we employ a Multi-Viewpoint Exploration Policy Network (MVEPN) to search for optimal camera poses that maximize the visibility of the target object and the end-effector. The selected viewpoints are then re-rendered into image observations and processed using another TaskMoE-based MVT before being passed to the action prediction model. For action prediction, we also upgrade the autoregressive action sequence model proposed in ARP with our proposed TaskMoE to attain more task-specific feature extraction and action prediction.

TaskMoE Architecture

Key Innovations

  • Dynamic expert routing guided by fused instruction and scene cues
  • Decoupled gating strategy (\( N_G \) gates for \(N_J\) tasks, \(N_G\) < \(N_J\))
  • Parameter sharing for semantically similar tasks
  • Top-k expert activation for efficiency
TaskMoE Architecture

Fig. 3: TaskMoE pipeline

To address the inherent heterogeneity of complex manipulation tasks in multi-task learning, where different tasks often require substantially distinct visual representations and action policies, we introduce a task-aware Mixture-of-Experts module (TaskMoE), as illustrated in right Fig. Our TaskMoE introduces two key innovations. First, instead of relying solely on task identifiers for expert selection, we incorporate richer instruction- and scene-related cues to guide expert routing more effectively, which is crucial for accurate multi-task robotic manipulation. Specifically, as shown in right Fig, we design a cross-modality module that employs a cross-attention mechanism to model the interaction between instruction and visual information. The resulting context-aware features are then fused with the task identifier via a Feature-wise Linear Modulation (FiLM) layer, enabling more adaptive and task-sensitive expert selection.

Second, to improve the scalability and generalization of TaskMoE, we decouple the number of routing gates from the total number of tasks. Concretely, we allocate \( N_G \) gates for all \( N_J \) tasks, where \( N_G < N_J \). This design not only accommodates task diversity but also promotes parameter sharing among tasks with similar visual or semantic characteristics. For example, as illustrated in right Fig, Task 1 and Task 2 (both involving opening drawer) are routed through the same gate but directed to different experts based on their specific operation requirements. In contrast, Task 3, which is semantically dissimilar, is routed through a different gate. This setup encourages the discovery of latent task clusters and provides the capacity to generalize to unseen tasks that share structural similarities with seen ones, thereby enhancing the transferability and robustness of TaskMoE. Notably, all tasks share a common pool of \( N_E \) experts, and for each input, only the top-\( k \) experts (based on gating scores) are activated to guide task-specific visual feature extraction.

Real-Robot Experiment

To evaluate the generalization of TVVE across different robotic platforms, we conducted experiments on both the Franka Research 3 and Dobot Nova 2 robots. The Franka used a single front-mounted third-person camera, while the Dobot utilized three depth cameras at different positions. Furthermore, to assess performance across a broader range of tasks, we extended testing on the Franka platform to include ten additional tasks, such as articulated and deformable object manipulation. Experimental results show that our method achieves high success rates, highlighting its adaptability and effectiveness across diverse tasks.

Conclusion

TVVE establishes that dynamic view planning and task-aware representation learning significantly advance robotic manipulation capabilities. The MVEP module's viewpoint optimization effectively overcomes occlusion limitations in fixed-viewpoint systems, while TaskMoE's specialized feature extraction mitigates multi-task interference. These innovations collectively enhance performance across diverse manipulation challenges and enable meaningful generalization to novel tasks.

BibTeX

@article{bai2025learning, title={Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation}, author={Bai, Yongjie and Wang, Zhouxia and Liu, Yang and Luo, Kaijun and Wen, Yifan and Dai, Mingtong and Chen, Weixing and Chen, Ziliang and Liu, Lingbo and Li, Guanbin and Lin, Liang}, journal={arXiv preprint arXiv:2508.05186}, year={2025} }