NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks

Zhihao Luo1,2, Wentao Yan1, Jingyu Gong1, Min Wang3, Zhizhong Zhang1, Xuhong Wang1,2*, Yuan Xie1, Xin Tan1,2*
1East China Normal University, 2Shanghai AI Laboratory, 3SenseTime Research
*Corresponding Authors
Interpolate start reference image.

Comparison of NaviMaster and existing agents. Previous methods involve individual models for GUI and embodied navigation. Our NaviMaster is a unified learning framework.

Abstract

Recent advances in Graphical User Interface (GUI) and embodied navigation have driven significant progress, yet these domains have largely evolved in isolation, with disparate datasets and training paradigms. In this paper, we observe that both tasks can be formulated as Markov Decision Processes (MDP), suggesting a foundational principle for their unification. Hence, we present NaviMaster, the first unified agent capable of seamlessly integrating GUI navigation and embodied navigation within a single framework. Specifically, NaviMaster (i) proposes a visual-target trajectory collection pipeline that generates trajectories for both GUI and embodied tasks in one formulation. (ii) employs a unified reinforcement learning framework on the mix data for better generalization. (iii) designs a novel distance-aware reward to ensure efficient learning from the trajectories. Through extensive experiments on out-of-domain benchmarks, NaviMaster is shown to outperform state-of-the-art agents in GUI navigation, spatial affordance prediction, and embodied navigation. Ablation studies further confirm the efficacy of our unified training strategy, data mixing strategy, and reward design.

Visual-Target Trajectories Collection

Interpolate start reference image.

Visual-Target Trajectory Collection contains three parts. First, We unify the GUI and the Embodied action space by introducing a visual target at each step. Next, we initialize the trajectories using existing datasets or scenes. Last, we generate a first-person reasoning thought with GPT-4o. Finally, we get our visual-target trajectories.

Unified RL Training Framework

Interpolate start reference image.

Overview of unified reinforcement learning framework. MLLM policy is optimized using GRPO with format, type and grounding dense reward.

Unified RL Training Framework

Interpolate start reference image.

Overview of unified reinforcement learning framework. MLLM policy is optimized using GRPO with format, type and grounding dense reward.

Results

Interpolate start reference image.
Results on GUI tasks. The red background represents that the data source is in the training set of the corresponding model, while the green background represents that the test dataset is OOD for the model. Bold highlights the best results in the OOD setting, and underlined are the second-best.
Interpolate start reference image.
Results on spatial affordance prediction. These results demonstrate that NaviMaster’s fine- grained visual–spatial alignment significantly en- hances performance in both object-level and free- space referring.
Interpolate start reference image.
Results on embodied navigation. Since we are the first to train an agent model capable of generalizing in the VLMNav, there are no prior navigation mod- els trained under VLMNav for direct comparison. NaviMaster achieves the highest Success Rate and SPL, representing a substantial improvement over the base model.