TL;DR: We propose a novel learning-from-human framework that explicitly models intention to capture the causal structure of manipulation behavior.
Embodied foundation models have achieved significant breakthroughs in robotic manipulation, but they heavily rely on large amounts of robot demonstrations. Although recent works have explored leveraging human data to alleviate this dependency, effectively extracting transferable knowledge remains a significant challenge due to the inherent human-robot embodiment gap. To address this, we argue that the intention underlying human actions can serve as a powerful intermediate representation to bridge this gap. In this paper, we introduce VLIA, a novel framework that explicitly learns and transfers human intention to facilitate robotic manipulation. Specifically, we model intention through gaze, as it naturally precedes physical actions and serves as an observable proxy for human intent. VLIA is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, sequentially predicting intention before executing the action. Extensive evaluations including simulations and real-world experiments, long-horizon and fine-grained tasks, as well as few-shot learning and robustness assessments, demonstrate that our method outperforms existing baselines, exhibits exceptional generalization, and achieves state-of-the-art performance.
We use large-scale human egocentric dataset for pretraining, which contains hand and gaze annotations with validity masks, unified coordinates, and diverse backgrounds, actions, and objects. Long videos are segmented into shorter clips, resulting in more than 150M frames.
Our model receives a task description, an egocentric observation, and the human or robot state as inputs. It first predicts discrete intention tokens, followed by continuous action generation via an intention–action reasoning chain. By explicitly modeling intention as an intermediate representation, the framework bridges high-level task understanding and low-level control. We instantiate intention as gaze, parameterized as 2D image coordinates.
To evaluate cross-embodiment transfer and robustness in real-world deployment, we conduct extensive real-robot experiments covering gripper-based manipulation, dexterous manipulation, long-horizon tasks, and fine-grained operations. The green cross indicates the predicted intention.
Tighten the screw.
Stack the blocks.
Put the remote in the drawer.
Type "VLIA" on the keyboard.
Put the bottle on the plate.
Put the fruit on the plate.
Object OOD
Background OOD
Quantitative comparison between our method and baseline methods on the real-robot experiment.
| Method | ID | OOD-Position | OOD-Object | OOD-Scene |
|---|---|---|---|---|
| ACT | 16/20 | 1/10 | 4/10 | 0/10 |
| DP | 13/20 | 2/10 | 3/10 | 0/10 |
| π0.5 | 17/20 | 3/10 | 4/10 | 3/10 |
| ours w/o CoT | 16/20 | 5/10 | 6/10 | 5/10 |
| ours (robot only) | 14/20 | 2/10 | 4/10 | 0/10 |
| ours (robot+human finetune) | 13/20 | 1/10 | 2/10 | 0/10 |
| ours (robot+human pretrain) | 17/20 | 3/10 | 5/10 | 2/10 |
| ours | 19/20 | 6/10 | 8/10 | 6/10 |
Quantitative evaluation of generalization performance.
| Task \ Method | ID | OOD-Distractors | OOD-Lighting | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| lfa | dp | hrdt | π0.5 | ours | lfa | dp | hrdt | π0.5 | ours | lfa | dp | hrdt | π0.5 | ours | |
| cube transfer | 87 | 75 | 89 | 94 | 100 | 12 | 9 | 28 | 25 | 35 | 0 | 0 | 12 | 32 | 36 |
| hook package | 23 | 7 | 13 | 20 | 32 | 8 | 3 | 0 | 11 | 14 | 0 | 0 | 0 | 10 | 19 |
| peg insertion | 10 | 2 | 17 | 15 | 18 | 7 | 0 | 2 | 6 | 9 | 0 | 0 | 0 | 7 | 16 |
| pour test tube | 41 | 23 | 24 | 34 | 39 | 11 | 7 | 14 | 24 | 28 | 0 | 0 | 3 | 25 | 22 |
| slot insertion | 43 | 32 | 42 | 54 | 60 | 23 | 12 | 19 | 47 | 56 | 0 | 0 | 11 | 44 | 50 |
| thread needle | 56 | 30 | 46 | 33 | 43 | 21 | 9 | 23 | 19 | 23 | 0 | 0 | 12 | 20 | 21 |
| average | 43 | 28 | 39 | 41 | 49 | 14 | 7 | 14 | 22 | 28 | 0 | 0 | 6 | 23 | 27 |
Quantitative comparison between our method and baseline methods on AV-ALOHA benchmark.
Cube transfer
Hook package
Peg insertion
Pour test tube
Slot insertion
Thread needle