VLIA teaser

TL;DR: We propose a novel learning-from-human framework that explicitly models intention to capture the causal structure of manipulation behavior.

Abstract

Embodied foundation models have achieved significant breakthroughs in robotic manipulation, but they heavily rely on large amounts of robot demonstrations. Although recent works have explored leveraging human data to alleviate this dependency, effectively extracting transferable knowledge remains a significant challenge due to the inherent human-robot embodiment gap. To address this, we argue that the intention underlying human actions can serve as a powerful intermediate representation to bridge this gap. In this paper, we introduce VLIA, a novel framework that explicitly learns and transfers human intention to facilitate robotic manipulation. Specifically, we model intention through gaze, as it naturally precedes physical actions and serves as an observable proxy for human intent. VLIA is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, sequentially predicting intention before executing the action. Extensive evaluations including simulations and real-world experiments, long-horizon and fine-grained tasks, as well as few-shot learning and robustness assessments, demonstrate that our method outperforms existing baselines, exhibits exceptional generalization, and achieves state-of-the-art performance.

Dataset

Dataset examples

We use large-scale human egocentric dataset for pretraining, which contains hand and gaze annotations with validity masks, unified coordinates, and diverse backgrounds, actions, and objects. Long videos are segmented into shorter clips, resulting in more than 150M frames.

Pipeline

Pipeline overview

Our model receives a task description, an egocentric observation, and the human or robot state as inputs. It first predicts discrete intention tokens, followed by continuous action generation via an intention–action reasoning chain. By explicitly modeling intention as an intermediate representation, the framework bridges high-level task understanding and low-level control. We instantiate intention as gaze, parameterized as 2D image coordinates.

Real-World Demonstrations

To evaluate cross-embodiment transfer and robustness in real-world deployment, we conduct extensive real-robot experiments covering gripper-based manipulation, dexterous manipulation, long-horizon tasks, and fine-grained operations. The green cross indicates the predicted intention.

Tighten the screw.

Stack the blocks.

Put the remote in the drawer.

Type "VLIA" on the keyboard.

Put the bottle on the plate.

Generalization

Put the fruit on the plate.

Object OOD

Background OOD

Quantitative Results

Real experiment success rate

Quantitative comparison between our method and baseline methods on the real-robot experiment.

Ablation Study

Method ID OOD-Position OOD-Object OOD-Scene
ACT16/201/104/100/10
DP13/202/103/100/10
π0.517/203/104/103/10
ours w/o CoT16/205/106/105/10
ours (robot only)14/202/104/100/10
ours (robot+human finetune)13/201/102/100/10
ours (robot+human pretrain)17/203/105/102/10
ours19/206/108/106/10

Quantitative evaluation of generalization performance.

Experiments

Task \ Method ID OOD-Distractors OOD-Lighting
lfa dp hrdt π0.5 ours lfa dp hrdt π0.5 ours lfa dp hrdt π0.5 ours
cube transfer8775899410012928253500123236
hook package23713203283011140001019
peg insertion10217151870269000716
pour test tube41232434391171424280032522
slot insertion4332425460231219475600114450
thread needle563046334321923192300122021
average43283941491471422280062327

Quantitative comparison between our method and baseline methods on AV-ALOHA benchmark.

Cube transfer

Hook package

Peg insertion

Pour test tube

Slot insertion

Thread needle

BibTeX