Learning to See and Act: Task-Aware View Planning for Robotic Manipulation

Abstract

Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-Aware View Planning (TAVP), a framework designed to overcome these challenges by integrating active view planning with task-specific representation learning.

                TAVP employs an efficient exploration policy, accelerated by a novel pseudo-environment, to actively acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization.
            

By learning to see the world in a task-aware way, TAVP generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. Extensive experiments on RLBench tasks show that our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches.

Motivation & Approach

The Challenge: Fixed View Limitations & Dense Encoders Limitations

Traditional robotic systems rely on static camera viewpoints and shared visual encoders, which often lead to:

Occlusion issues: Critical objects or end-effectors hidden from view
Incomplete scene understanding: Limited spatial awareness
Task interference: Shared encoders struggle with diverse tasks
Reduced generalization: Poor performance on novel scenarios

Our Solution: Task-Aware View Planning

TAVP introduces two key innovations:

Multi-Viewpoint Exploration Policy (MVEP): Actively selects optimal viewpoints to maximize information gain
Task-aware Mixture-of-Experts (TaskMoE): Dynamically routes features to specialized experts based on task requirements

Fig. 1: Motivation Illustration. Observations captured from fixed cameras often miss parts of the target objects. For example, the front view only captures the cupboard (highlighted with a red circle), while the left and right shoulder views only show the sugar (already grasped by the end-effector and highlighted with green circles). These incomplete observations may lead to failed operations. In contrast, our proposed TAVP is designed to dynamically explore and re-render informative viewpoints that maximize coverage of target-relevant information, thereby improving the reliability of manipulation outcomes.

Problem: Fixed Viewpoints

Target objects often occluded
End-effector visibility compromised
Incomplete scene understanding
Limited spatial awareness

Solution: MVEP

Active viewpoint exploration
Dynamic multi-view re-rendering
Optimized camera positioning
Enhanced 3D perception
Occlusion minimization

Solution: TaskMoE

Task-specific feature extraction
Dynamic expert routing
Parameter sharing for similar tasks
Reduced task interference
Improved multi-task generalization

Methodology

TAVP Framework Overview

Fig. 2: The overview of our Task-Aware View Planning (TAVP) framework. The input of this framework is four RGB-D images from fixed viewpoints. First, it converts them into point clouds and aggregates them into a global point cloud in the world coordinate system. Then, it diverges into two branches: One branch (orange) performs Coarse Grounding to predict the approximate position of the end-effector. Subsequently, it moves the center of the global point cloud to this predicted position, performs scaling and cropping, retaining the important point cloud region. Another branch (green) receives the global point cloud, passes it through MVEP to predict the optimal camera parameters for the observation viewpoint. Then, using these parameters, it renders a 2D image from the point cloud processed by the red branch. This rendered image is fed into the Fine Grounding to predict the final robot action, including the end-effector position, rotation, gripper status, and collision state.

Our proposed Task-Aware View Planning (TAVP) framework aims to identify optimal viewpoints for accurate and robust robotic manipulation, guided by task-specific visual feature extraction. TAVP takes as input a language instruction, the current visual observations from RGB-D cameras, and the current gripper state. To enable the model to explore the optimal observations from arbitrary viewpoints, we first reconstruct a 3D point cloud of the scene using the input RGB-D images. To narrow down the search space and enable task-aligned view selection, we leverage the coarse prediction stage from RVT-2 to generate an area of interest. However, unlike RVT-2, which extracts visual features for all tasks using a shared Multi-View Transformer (MVT), we introduce a Task-Aware Mixture-of-Experts (TaskMoE) module before MVT to route instructions to specialized expert encoders. This design enables more precise and task-aligned visual feature extraction, which benefits both viewpoint selection and action prediction. Starting from the identified area of interest, we employ a Multi-Viewpoint Exploration Policy Network (MVEPN) to search for optimal camera poses that maximize the visibility of the target object and the end-effector. The selected viewpoints are then re-rendered into image observations and processed using another TaskMoE-based MVT before being passed to the action prediction model. For action prediction, we also upgrade the autoregressive action sequence model proposed in ARP with our proposed TaskMoE to attain more task-specific feature extraction and action prediction.

TaskMoE Architecture

Key Innovations

Dynamic expert routing guided by fused instruction and scene cues
Decoupled gating strategy (\( N_G \) gates for \(N_J\) tasks, \(N_G\) < \(N_J\))
Parameter sharing for semantically similar tasks
Top-k expert activation for efficiency

Fig. 3: TaskMoE pipeline

To address the inherent heterogeneity of complex manipulation tasks in multi-task learning, where different tasks often require substantially distinct visual representations and action policies, we introduce a task-aware Mixture-of-Experts module (TaskMoE), as illustrated in right Fig. Our TaskMoE introduces two key innovations. First, instead of relying solely on task identifiers for expert selection, we incorporate richer instruction- and scene-related cues to guide expert routing more effectively, which is crucial for accurate multi-task robotic manipulation. Specifically, as shown in right Fig, we design a cross-modality module that employs a cross-attention mechanism to model the interaction between instruction and visual information. The resulting context-aware features are then fused with the task identifier via a Feature-wise Linear Modulation (FiLM) layer, enabling more adaptive and task-sensitive expert selection.

Second, to improve the scalability and generalization of TaskMoE, we decouple the number of routing gates from the total number of tasks. Concretely, we allocate \( N_G \) gates for all \( N_J \) tasks, where \( N_G < N_J \). This design not only accommodates task diversity but also promotes parameter sharing among tasks with similar visual or semantic characteristics. For example, as illustrated in right Fig, Task 1 and Task 2 (both involving opening drawer) are routed through the same gate but directed to different experts based on their specific operation requirements. In contrast, Task 3, which is semantically dissimilar, is routed through a different gate. This setup encourages the discovery of latent task clusters and provides the capacity to generalize to unseen tasks that share structural similarities with seen ones, thereby enhancing the transferability and robustness of TaskMoE. Notably, all tasks share a common pool of \( N_E \) experts, and for each input, only the top-\( k \) experts (based on gating scores) are activated to guide task-specific visual feature extraction.

Experiment

Real-World Evaluation Video Demonstrations

We trained and deployed our TAVP on real-world manipulation tasks, and conducted analyses of both successful and failed cases. Below are the demonstration videos.

Collect Fruits Task: There are various fruits on the desktop. The robot needs to grab the corresponding fruits. Language instruction: "Put a carrot, a carambola, and an eggplant into the white basin."

Stack Bowls Task: The scene contains a stack of bowls with background objects. The robot must pick up the bowls and stack them in a specific order. Language instruction: "Stack a pink bowl and a white bowl on a green plate."

Grape Picking Task: There are plates and grapes in the scene. Language instruction: "Put the grapes into the green plate."

Press Buttons Task: The robot needs to press buttons in a specified order. Language instruction: "Press the green button, then the blue button, followed by the yellow button, and finally the red button."

Put Item in Drawer Task: The robot needs to open the top drawer and place the rectangular block inside. Language instruction: "Open the top drawer and put the rectangular block into it."

Real-World Failure Case Analysis

We analyzed the failure cases encountered during the real-world evaluation of our TAVP. Below are the demonstration videos.

Failure Case 1: The robot fails to collect the fruits as instructed.

Failure Case 2: The robot fails to put the item in the drawer as instructed.

Real-World Robustness Testing

We conducted robustness testing on the "Grape Picking" task, covering Unseen Instance, Unseen Background, Unseen Object, Heavy Occlusion, and Illumination Variation.