NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks

Abstract

Recent advances in Graphical User Interface (GUI) and embodied navigation have driven significant progress, yet these domains have largely evolved in isolation, with disparate datasets and training paradigms. In this paper, we observe that both tasks can be formulated as Markov Decision Processes (MDP), suggesting a foundational principle for their unification. Hence, we present NaviMaster, the first unified agent capable of seamlessly integrating GUI navigation and embodied navigation within a single framework. Specifically, NaviMaster (i) proposes a visual-target trajectory collection pipeline that generates trajectories for both GUI and embodied tasks in one formulation. (ii) employs a unified reinforcement learning framework on the mix data for better generalization. (iii) designs a novel distance-aware reward to ensure efficient learning from the trajectories. Through extensive experiments on out-of-domain benchmarks, NaviMaster is shown to outperform state-of-the-art agents in GUI navigation, spatial affordance prediction, and embodied navigation. Ablation studies further confirm the efficacy of our unified training strategy, data mixing strategy, and reward design.

Visual-Target Trajectories Collection

Visual-Target Trajectory Collection contains three parts. First, We unify the GUI and the Embodied action space by introducing a visual target at each step. Next, we initialize the trajectories using existing datasets or scenes. Last, we generate a first-person reasoning thought with GPT-4o. Finally, we get our visual-target trajectories.

Unified RL Training Framework

Overview of unified reinforcement learning framework. MLLM policy is optimized using GRPO with format, type and grounding dense reward.

Results

Results on GUI tasks. The red background represents that the data source is in the training set of the corresponding model, while the green background represents that the test dataset is OOD for the model. Bold highlights the best results in the OOD setting, and underlined are the second-best.

Results on spatial affordance prediction. These results demonstrate that NaviMaster’s fine- grained visual–spatial alignment significantly en- hances performance in both object-level and free- space referring.

Results on embodied navigation. Since we are the first to train an agent model capable of generalizing in the VLMNav, there are no prior navigation mod- els trained under VLMNav for direct comparison. NaviMaster achieves the highest Success Rate and SPL, representing a substantial improvement over the base model.

NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks

Spatial Affordance Prediction

Embodied Navigation

GUI Navigation

Kill Zombie in MineCraft

Comparison of NaviMaster and existing agents. Previous methods involve individual models for GUI and embodied navigation. Our NaviMaster is a unified learning framework.

Abstract

Visual-Target Trajectories Collection

Unified RL Training Framework

Overview of unified reinforcement learning framework. MLLM policy is optimized using GRPO with format, type and grounding dense reward.

Results