Learning to See and Act: Task-Aware View Planning for Robotic Manipulation

*Equal Contribution , Corresponding Author
1School of Computer Science and Engineering, Sun Yat-sen University
2Pengcheng Laboratory
3College of Computing and Data Science, Nanyang Technological University
4Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

Abstract

Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-Aware View Planning (TAVP), a framework designed to overcome these challenges by integrating active view planning with task-specific representation learning.

TAVP employs an efficient exploration policy, accelerated by a novel pseudo-environment, to actively acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization.

By learning to see the world in a task-aware way, TAVP generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. Extensive experiments on RLBench tasks show that our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches.

Motivation & Approach

The Challenge: Fixed View Limitations & Dense Encoders Limitations

Traditional robotic systems rely on static camera viewpoints and shared visual encoders, which often lead to:

  • Occlusion issues: Critical objects or end-effectors hidden from view
  • Incomplete scene understanding: Limited spatial awareness
  • Task interference: Shared encoders struggle with diverse tasks
  • Reduced generalization: Poor performance on novel scenarios

Our Solution: Task-Aware View Planning

TAVP introduces two key innovations:

  • Multi-Viewpoint Exploration Policy (MVEP): Actively selects optimal viewpoints to maximize information gain
  • Task-aware Mixture-of-Experts (TaskMoE): Dynamically routes features to specialized experts based on task requirements
Fixed View vs. TAVP Comparison

Fig. 1: Motivation Illustration. Observations captured from fixed cameras often miss parts of the target objects. For example, the front view only captures the cupboard (highlighted with a red circle), while the left and right shoulder views only show the sugar (already grasped by the end-effector and highlighted with green circles). These incomplete observations may lead to failed operations. In contrast, our proposed TAVP is designed to dynamically explore and re-render informative viewpoints that maximize coverage of target-relevant information, thereby improving the reliability of manipulation outcomes.

Problem: Fixed Viewpoints
  • Target objects often occluded
  • End-effector visibility compromised
  • Incomplete scene understanding
  • Limited spatial awareness
Solution: MVEP
  • Active viewpoint exploration
  • Dynamic multi-view re-rendering
  • Optimized camera positioning
  • Enhanced 3D perception
  • Occlusion minimization
Solution: TaskMoE
  • Task-specific feature extraction
  • Dynamic expert routing
  • Parameter sharing for similar tasks
  • Reduced task interference
  • Improved multi-task generalization

Methodology

TAVP Framework Overview
TAVP Framework Diagram

Fig. 2: The overview of our Task-Aware View Planning (TAVP) framework. The input of this framework is four RGB-D images from fixed viewpoints. First, it converts them into point clouds and aggregates them into a global point cloud in the world coordinate system. Then, it diverges into two branches: One branch (orange) performs Coarse Grounding to predict the approximate position of the end-effector. Subsequently, it moves the center of the global point cloud to this predicted position, performs scaling and cropping, retaining the important point cloud region. Another branch (green) receives the global point cloud, passes it through MVEP to predict the optimal camera parameters for the observation viewpoint. Then, using these parameters, it renders a 2D image from the point cloud processed by the red branch. This rendered image is fed into the Fine Grounding to predict the final robot action, including the end-effector position, rotation, gripper status, and collision state.

Our proposed Task-Aware View Planning (TAVP) framework aims to identify optimal viewpoints for accurate and robust robotic manipulation, guided by task-specific visual feature extraction. TAVP takes as input a language instruction, the current visual observations from RGB-D cameras, and the current gripper state. To enable the model to explore the optimal observations from arbitrary viewpoints, we first reconstruct a 3D point cloud of the scene using the input RGB-D images. To narrow down the search space and enable task-aligned view selection, we leverage the coarse prediction stage from RVT-2 to generate an area of interest. However, unlike RVT-2, which extracts visual features for all tasks using a shared Multi-View Transformer (MVT), we introduce a Task-Aware Mixture-of-Experts (TaskMoE) module before MVT to route instructions to specialized expert encoders. This design enables more precise and task-aligned visual feature extraction, which benefits both viewpoint selection and action prediction. Starting from the identified area of interest, we employ a Multi-Viewpoint Exploration Policy Network (MVEPN) to search for optimal camera poses that maximize the visibility of the target object and the end-effector. The selected viewpoints are then re-rendered into image observations and processed using another TaskMoE-based MVT before being passed to the action prediction model. For action prediction, we also upgrade the autoregressive action sequence model proposed in ARP with our proposed TaskMoE to attain more task-specific feature extraction and action prediction.

TaskMoE Architecture

Key Innovations

  • Dynamic expert routing guided by fused instruction and scene cues
  • Decoupled gating strategy (\( N_G \) gates for \(N_J\) tasks, \(N_G\) < \(N_J\))
  • Parameter sharing for semantically similar tasks
  • Top-k expert activation for efficiency
TaskMoE Architecture

Fig. 3: TaskMoE pipeline

To address the inherent heterogeneity of complex manipulation tasks in multi-task learning, where different tasks often require substantially distinct visual representations and action policies, we introduce a task-aware Mixture-of-Experts module (TaskMoE), as illustrated in right Fig. Our TaskMoE introduces two key innovations. First, instead of relying solely on task identifiers for expert selection, we incorporate richer instruction- and scene-related cues to guide expert routing more effectively, which is crucial for accurate multi-task robotic manipulation. Specifically, as shown in right Fig, we design a cross-modality module that employs a cross-attention mechanism to model the interaction between instruction and visual information. The resulting context-aware features are then fused with the task identifier via a Feature-wise Linear Modulation (FiLM) layer, enabling more adaptive and task-sensitive expert selection.

Second, to improve the scalability and generalization of TaskMoE, we decouple the number of routing gates from the total number of tasks. Concretely, we allocate \( N_G \) gates for all \( N_J \) tasks, where \( N_G < N_J \). This design not only accommodates task diversity but also promotes parameter sharing among tasks with similar visual or semantic characteristics. For example, as illustrated in right Fig, Task 1 and Task 2 (both involving opening drawer) are routed through the same gate but directed to different experts based on their specific operation requirements. In contrast, Task 3, which is semantically dissimilar, is routed through a different gate. This setup encourages the discovery of latent task clusters and provides the capacity to generalize to unseen tasks that share structural similarities with seen ones, thereby enhancing the transferability and robustness of TaskMoE. Notably, all tasks share a common pool of \( N_E \) experts, and for each input, only the top-\( k \) experts (based on gating scores) are activated to guide task-specific visual feature extraction.

Experimental Results

Simulation Task Demonstrations

Close the red jar

Put the ring on the black spoke

Screw in the silver light bulb

Take the steak off the grill

Open the top drawer

Place 2 cups on the cup holder

Put the star in the shape sorter

Stack the wine bottle to the left of the rack

Push the maroon button

Put the coffee in the cupboard

Put the item in the top drawer

Put the money away in the safe on the top shelf

Use the stick to drag the cube onto the navy target

Slide the block to pink target

Stack 2 maroon blocks

Stack the other cups on top of the maroon cup

Sweep dirt to the short dustpan

Turn right tap

Conclusion

TAVP establishes that dynamic view planning and task-aware representation learning significantly advance robotic manipulation capabilities. The MVEP module's viewpoint optimization effectively overcomes occlusion limitations in fixed-viewpoint systems, while TaskMoE's specialized feature extraction mitigates multi-task interference. These innovations collectively enhance performance across diverse manipulation challenges and enable meaningful generalization to novel tasks.

BibTeX

@article{bai2025learning, title={Learning to See and Act: Task-Aware View Planning for Robotic Manipulation}, author={Bai, Yongjie and Wang, Zhouxia and Liu, Yang and Chen, Weixing and Chen, Ziliang and Dai, Mingtong and Zheng, Yongsen and Liu, Lingbo and Li, Guanbin and Lin, Liang}, journal={arXiv preprint arXiv:2508.05186}, year={2025} }