Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method

Xinshuai Song1*, Weixing Chen1*, Yang Liu1†, Vincent Chan2, Guanbin Li1,3, Liang Lin1,3
1Sun Yat-sen University 2Independent Researcher 3Pengcheng Laboratory
*Equal Contribution
Corresponding Author
MY ALT TEXT

Abstract

Existing Vision-Language Navigation (VLN) methods primarily focus on single-stage navigation, limiting their effectiveness in multi-stage and long-horizon tasks within complex and dynamic environments. To address these limitations, we propose a novel VLN task, named Long-Horizon Vision-Language Navigation (LH-VLN), which emphasizes long-term planning and decision consistency across consecutive subtasks. Furthermore, to support LH-VLN, we develop an automated data generation platform NavGen, which constructs datasets with complex task structures and improves data utility through a bidirectional, multi-granularity generation approach. To accurately evaluate complex tasks, we construct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark consisting of 3,260 tasks with an average of 150 task steps, serving as the first dataset specifically designed for the long-horizon vision-language navigation task. Furthermore, we propose Independent Success Rate (ISR), Conditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics, to provide fine-grained assessments of task completion. To improve model adaptability in complex tasks, we propose a novel Multi-Granularity Dynamic Memory (MGDM) module that integrates short-term memory blurring with long-term memory retrieval to enable flexible navigation in dynamic environments. Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model, establishing a foundational framework for advancing LH-VLN.

Overview

MY ALT TEXT Framework Overview. Different from existing vision language navigation, object loco-navigation, and demand-driven navigation benchmarks, LH-VLN divides navigation into multiple subtasks, requiring the agent to complete these tasks sequentially within the scene. Our data generation framework provides a general LH-VLN task generation pipeline, and the newly built LHPR-VLN benchmark for multi-stage navigation tasks. Our navigation model, based on the chain-of-thought (CoT) feedback and adaptive memory design, achieves efficient navigation by utilizing CoT prompts and dynamic long-term and short-term memories.

Navgen: an automated data generation platform

MY ALT TEXT NavGen data generation platform. The forward generation generates LH-VLN complex tasks and corresponding subtasks by prompting GPT-4 with sampling asserts. The sampled assets are deployed on simulator to build the simulation environment. Based on the navigation model or expert decisions, corresponding trajectory data is generated. In the backward generation, the trajectory of each subtask is split into action-label pairs by trajectory splitting algorithm according to the trajectory type, these pairs are then input into GPT-4 to generate step-by-step tasks.

Model

MY ALT TEXT The framework of the Multi-Granularity Dynamic Memory (MGDM) model. The CoT feedback module receives task instructions and, based on historical observation of corresponding memory, generates a chain of thought and constructs language prompts. The short-term memory module aims to minimize the entropy of the confidence vector, using pooling operations to forget and blur the memory sequence. The long-term memory module selects and matches data from the dataset to weight the decisions of the LLM, ultimately determining the action to be executed by the agent.

Experiments

MY ALT TEXT Visualization of a successful long-horizon navigation of our MGDM. We highlight aligned landmarks by colored bounding boxes in images and words in the instruction using the same color. In the first navigation segment, the agent looks for a towel in the bathroom. It successfully finds both the bathroom and the towel but does not enter the bathroom or gets close enough to the towel for the task to be marked as successful. In the next phase, the agent successfully finds the box in the living room.

BibTeX

@article{song2024towards,
  title={Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method},
  author={Song, Xinshuai and Chen, Weixing and Liu, Yang and Chan, Vincent and Li, Guanbin and Lin, Liang},
  journal={arXiv preprint arXiv:2412.09082},
  year={2024}
}