Learning to use tools or objects in common scenes, particularly handling them in various ways as instructed, is a key challenge for developing interactive robots. Training models to generate such manipulation trajectories requires a large and diverse collection of detailed manipulation demonstrations for various objects, which is nearly unfeasible to gather at scale. In this paper, we propose a framework that leverages large-scale ego- and exo-centric video datasets --- constructed globally with substantial effort --- of Exo-Ego4D to extract diverse manipulation trajectories at scale. From these extracted trajectories with the associated textual action description, we develop trajectory generation models based on visual and point cloud-based language models. In the recently proposed egocentric vision-based in-a-quality trajectory dataset of HOT3D, we confirmed that our models successfully generate valid object trajectories, establishing a training dataset and baseline models for the novel task of generating 6DoF manipulation trajectories from action descriptions in egocentric vision. Our dataset and code is available upon acceptance.
Our framework consists of four stages. First, given an egocentric video, we determine the start and end timestamps of the action and identify the manipulated object within the scene. Second, we extract the position sequence of the manipulated object using an open-vocabulary segmentation model and a dense 3D point tracker. Third, we project the sequence into the camera coordinate system of the first frame using point cloud registration. Fourth, we extract a rotation sequence by computing the transformation between the two object point clouds using SVD.
Action Description: "Cut the red pepper seeds with the knife in his right hand."
(video source: Ego-Exo4D dataset)
Action Description: "Stir the garlic and oil in the skillet with the spatula in her right hand."
(video source: Ego-Exo4D dataset)
Action Description: "Move a plate on the countertop with her left hand."
(video source: Ego-Exo4D dataset)
This task aims to generate a sequence of 6DoF object poses from an action description and an initial state comprising the visual input and the object's initial pose.
Considering recent advancements in multi-modal language models, we develop object manipulation trajectory generation models based on visual and point cloud-based language models (VLMs), formalizing our task as next token prediction task. This is achieved by incorporating an extended vocabulary for trajectory tokenization into the VLMs.
Action Description: "Pick up the cellphone from the table with the right hand."
Action Description: "stirs the contents of the bowl with the wooden spoon using the right hand."
Action Description: "Pick up the bamboo plate from the table with both hands."
@inproceedings{yoshida_2025_CVPR,
author = {Tomoya, Yoshida and Shuhei, Kurita and Taichi, Nishimura and Shinsuke Mori},
title = {Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision},
booktitle = {CVPR},
year = {2025}
}