GazeVLA: Learning Human Intention for Robotic Manipulation

TL;DR: We propose a novel learning-from-human framework that explicitly models intention to capture the causal structure of manipulation behavior.

Abstract

Embodied foundation models have achieved significant breakthroughs in robotic manipulation, but they heavily rely on large amounts of robot demonstrations. Although recent works have explored leveraging human data to alleviate this dependency, effectively extracting transferable knowledge remains a significant challenge due to the inherent human-robot embodiment gap. To address this, we argue that the intention underlying human actions can serve as a powerful intermediate representation to bridge this gap. In this paper, we introduce VLIA, a novel framework that explicitly learns and transfers human intention to facilitate robotic manipulation. Specifically, we model intention through gaze, as it naturally precedes physical actions and serves as an observable proxy for human intent. VLIA is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, sequentially predicting intention before executing the action. Extensive evaluations including simulations and real-world experiments, long-horizon and fine-grained tasks, as well as few-shot learning and robustness assessments, demonstrate that our method outperforms existing baselines, exhibits exceptional generalization, and achieves state-of-the-art performance.

Dataset

We use large-scale human egocentric dataset for pretraining, which contains hand and gaze annotations with validity masks, unified coordinates, and diverse backgrounds, actions, and objects. Long videos are segmented into shorter clips, resulting in more than 150M frames.

Pipeline

Our model receives a task description, an egocentric observation, and the human or robot state as inputs. It first predicts discrete intention tokens, followed by continuous action generation via an intention–action reasoning chain. By explicitly modeling intention as an intermediate representation, the framework bridges high-level task understanding and low-level control. We instantiate intention as gaze, parameterized as 2D image coordinates.

Real-World Demonstrations

To evaluate cross-embodiment transfer and robustness in real-world deployment, we conduct extensive real-robot experiments covering gripper-based manipulation, dexterous manipulation, long-horizon tasks, and fine-grained operations. The green cross indicates the predicted intention.

Tighten the screw.

Stack the blocks.

Put the remote in the drawer.

Type "VLIA" on the keyboard.

Put the bottle on the plate.

Generalization

Put the fruit on the plate.

Object OOD

Background OOD

Quantitative Results

Quantitative comparison between our method and baseline methods on the real-robot experiment.

Ablation Study

Method	ID	OOD-Position	OOD-Object	OOD-Scene
ACT	16/20	1/10	4/10	0/10
DP	13/20	2/10	3/10	0/10
π_0.5	17/20	3/10	4/10	3/10
ours w/o CoT	16/20	5/10	6/10	5/10
ours (robot only)	14/20	2/10	4/10	0/10
ours (robot+human finetune)	13/20	1/10	2/10	0/10
ours (robot+human pretrain)	17/20	3/10	5/10	2/10
ours	19/20	6/10	8/10	6/10

Quantitative evaluation of generalization performance.

Experiments

Task \ Method	ID					OOD-Distractors					OOD-Lighting
	lfa	dp	hrdt	π_0.5	ours	lfa	dp	hrdt	π_0.5	ours	lfa	dp	hrdt	π_0.5	ours
cube transfer	87	75	89	94	100	12	9	28	25	35	0	0	12	32	36
hook package	23	7	13	20	32	8	3	0	11	14	0	0	0	10	19
peg insertion	10	2	17	15	18	7	0	2	6	9	0	0	0	7	16
pour test tube	41	23	24	34	39	11	7	14	24	28	0	0	3	25	22
slot insertion	43	32	42	54	60	23	12	19	47	56	0	0	11	44	50
thread needle	56	30	46	33	43	21	9	23	19	23	0	0	12	20	21
average	43	28	39	41	49	14	7	14	22	28	0	0	6	23	27