Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-aware Virtual View Exploration (TVVE), a framework designed to overcome these challenges by integrating virtual view exploration with task-specific representation learning. TVVE employs an efficient exploration policy, accelerated by a novel pseudo-environment, to acquire informative views.
By learning to see the world in a task-aware way, TVVE generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. To further validate the robustness and generalization capability of TVVE under out-of-distribution (OOD) settings, we construct a challenging benchmark, RLBench-OG, covering various visual perturbations and camera pose variations. Extensive experiments on RLBench and RLBench-OG show that our TVVE achieves superior performance over state-of-the-art approaches. In real-robot experiments, TVVE demonstrates exceptional performance and generalizes robustly in multiple OOD settings, including visual disturbances and unseen instructions.
Fig. 2: The overview of our Task-Aware Virtual View Exploration (TVVE) framework. The input of this framework is four RGB-D images from fixed viewpoints. First, it converts them into point clouds and aggregates them into a global point cloud in the world coordinate system. Then, it diverges into two branches: One branch (orange) performs Coarse Grounding to predict the approximate position of the end-effector. Subsequently, it moves the center of the global point cloud to this predicted position, performs scaling and cropping, retaining the important point cloud region. Another branch (green) receives the global point cloud, passes it through MVEP to predict the optimal camera parameters for the observation viewpoint. Then, using these parameters, it renders a 2D image from the point cloud processed by the red branch. This rendered image is fed into the Fine Grounding to predict the final robot action, including the end-effector position, rotation, gripper status, and collision state.
Our proposed Task-Aware Virtual View Exploration (TVVE) framework aims to identify optimal viewpoints for accurate and robust robotic manipulation, guided by task-specific visual feature extraction. TVVE takes as input a language instruction, the current visual observations from RGB-D cameras, and the current gripper state. To enable the model to explore the optimal observations from arbitrary viewpoints, we first reconstruct a 3D point cloud of the scene using the input RGB-D images. To narrow down the search space and enable task-aligned view selection, we leverage the coarse prediction stage from RVT-2 to generate an area of interest. However, unlike RVT-2, which extracts visual features for all tasks using a shared Multi-View Transformer (MVT), we introduce a Task-Aware Mixture-of-Experts (TaskMoE) module before MVT to route instructions to specialized expert encoders. This design enables more precise and task-aligned visual feature extraction, which benefits both viewpoint selection and action prediction. Starting from the identified area of interest, we employ a Multi-Viewpoint Exploration Policy Network (MVEPN) to search for optimal camera poses that maximize the visibility of the target object and the end-effector. The selected viewpoints are then re-rendered into image observations and processed using another TaskMoE-based MVT before being passed to the action prediction model. For action prediction, we also upgrade the autoregressive action sequence model proposed in ARP with our proposed TaskMoE to attain more task-specific feature extraction and action prediction.
Fig. 3: TaskMoE pipeline
To address the inherent heterogeneity of complex manipulation tasks in multi-task learning, where different tasks often require substantially distinct visual representations and action policies, we introduce a task-aware Mixture-of-Experts module (TaskMoE), as illustrated in right Fig. Our TaskMoE introduces two key innovations. First, instead of relying solely on task identifiers for expert selection, we incorporate richer instruction- and scene-related cues to guide expert routing more effectively, which is crucial for accurate multi-task robotic manipulation. Specifically, as shown in right Fig, we design a cross-modality module that employs a cross-attention mechanism to model the interaction between instruction and visual information. The resulting context-aware features are then fused with the task identifier via a Feature-wise Linear Modulation (FiLM) layer, enabling more adaptive and task-sensitive expert selection.
Second, to improve the scalability and generalization of TaskMoE, we decouple the number of routing gates from the total number of tasks. Concretely, we allocate \( N_G \) gates for all \( N_J \) tasks, where \( N_G < N_J \). This design not only accommodates task diversity but also promotes parameter sharing among tasks with similar visual or semantic characteristics. For example, as illustrated in right Fig, Task 1 and Task 2 (both involving opening drawer) are routed through the same gate but directed to different experts based on their specific operation requirements. In contrast, Task 3, which is semantically dissimilar, is routed through a different gate. This setup encourages the discovery of latent task clusters and provides the capacity to generalize to unseen tasks that share structural similarities with seen ones, thereby enhancing the transferability and robustness of TaskMoE. Notably, all tasks share a common pool of \( N_E \) experts, and for each input, only the top-\( k \) experts (based on gating scores) are activated to guide task-specific visual feature extraction.
To evaluate the generalization of TVVE across different robotic platforms, we conducted experiments on both the Franka Research 3 and Dobot Nova 2 robots. The Franka used a single front-mounted third-person camera, while the Dobot utilized three depth cameras at different positions. Furthermore, to assess performance across a broader range of tasks, we extended testing on the Franka platform to include ten additional tasks, such as articulated and deformable object manipulation. Experimental results show that our method achieves high success rates, highlighting its adaptability and effectiveness across diverse tasks.
We trained and deployed our TVVE on real-world manipulation tasks, and conducted analyses of both successful and failed cases. Below are the demonstration videos.
We analyzed the failure cases encountered during the real-world evaluation of our TVVE. Below are the demonstration videos.
We conducted robustness testing on the "Grape Picking" task, covering Unseen Instance, Unseen Background, Unseen Object, Heavy Occlusion, and Illumination Variation.
RLBench-OG is derived from RLBench and used for validate the robustness of our TVVE under occluded scenarios and its generalization capability in complex environments.
RLBench-OG consists of two distinct suites: the Occlusion Suite and the Generalization Suite. We created variants across 10 tasks for evaluation. In the Occlusion Suite, we designed two experimental configurations: 1) Models are both trained and tested directly under the occluded task configurations; 2) Models are trained in the original RLBench task settings and then evaluated in a zero-shot manner under occluded conditions. For the Generalization Suite, models are trained in the original RLBench environment and subsequently evaluated in a zero-shot manner across various generalization settings. Each task is configured with 50 episodes for training. The mean success rate, evaluated on 25 episodes per task, is reported as mean ± standard deviation across three independent experimental runs. For simulating occlusion, we introduce occluders or rotate manipulated objects to create occlusions from a frontal viewpoint. For generalization, we introduce perturbations by: 1) altering scene lighting, 2) changing the color and texture of the tabletop, 3) modifying the color and texture of the background, 4) adding distractors, and 5) adjusting the pose of the observation camera.
TVVE is evaluated on RLBench-OG, stressing both distribution shifts and occlusions. The following clips summarize successful adaptations and challenging failure cases across the Generalization and Occlusion suites.
Close the red jar
Put the ring on the black spoke
Screw in the silver light bulb
Take the steak off the grill
Open the top drawer
Place 2 cups on the cup holder
Put the star in the shape sorter
Stack the wine bottle to the left of the rack
Push the maroon button
Put the coffee in the cupboard
Put the item in the top drawer
Put the money away in the safe on the top shelf
Use the stick to drag the cube onto the navy target
Slide the block to pink target
Stack 2 maroon blocks
Stack the other cups on top of the maroon cup
Sweep dirt to the short dustpan
Turn right tap
TVVE establishes that dynamic view planning and task-aware representation learning significantly advance robotic manipulation capabilities. The MVEP module's viewpoint optimization effectively overcomes occlusion limitations in fixed-viewpoint systems, while TaskMoE's specialized feature extraction mitigates multi-task interference. These innovations collectively enhance performance across diverse manipulation challenges and enable meaningful generalization to novel tasks.