Video-Language Egocentric Dataset for EXpert-TRAinee assistance
We present Ego-EXTRA, a video-language Egocentric Dataset for EXpert-TRAinee assistance. Ego-EXTRA features 50 hours of unscripted egocentric videos of subjects performing procedural activities (the trainees) while guided by real-world experts who provide guidance and answer specific questions using natural language. Following a ``Wizard of OZ'' data collection paradigm, the expert enacts a wearable intelligent assistant, looking at the activities performed by the trainee exclusively from their egocentric point of view, answering questions when asked by the trainee, or proactively interacting with suggestions during the procedures. This unique data collection protocol enables Ego-EXTRA to capture a high-quality dialogue in which expert-level feedback is provided to the trainee. Two-way dialogues between experts and trainees are recorded, transcribed, and used to create a novel benchmark comprising more than 15k high-quality Visual Question Answer sets, which we use to evaluate Multimodal Large Language Models. The results show that Ego-EXTRA is challenging and highlight the limitations of current models when used to provide expert-level assistance to the user.
Unscripted egocentric videos captured using Aria glasses with rich multimodal signals including RGB, SLAM, eye gaze, IMU, and audio.
Real-time expert-trainee dialogues following "Wizard of Oz" paradigm, providing authentic guidance and assistance.
Bike Workshop, Kitchen, Bakery, and Assembly scenarios with 10 different procedural activities.
Diverse group of 33 trainees and 4 experts across 19 distinct occupations, ensuring representative data collection.
Distribution of video hours across different scenarios and activities, showing the comprehensive coverage of procedural tasks.
Analysis of conversation patterns between experts and trainees, highlighting the natural flow of instructional dialogue.
Analysis of trainee language usage, showing common verbs, nouns, and question patterns during procedural tasks.
Expert language analysis revealing instructional patterns, guidance strategies, and domain-specific terminology.
Trainee: "Which of the two wheels should I remove?"
Expert: "The front wheel."
Trainee: "What tool should I use to tighten the black bolt?"
Expert: "Use the wrench that's in the second chest on your left."
Trainee: "How much flour should I weigh on the scale?"
Expert: "You need to weigh 1.5 Kg of flour."
Trainee: "Is this fine or should I add more?"
Expert: "Add a little more."
Trainee: "What should I do with the wooden pegs?"
Expert: "You can insert them into the large holes."
Trainee: "How do I know that it's attached?"
Expert: "Try turning it slightly. You should hear a click."
Trainee: "Is the consistency of the mixture good now or does it need more breadcrumbs?"
Expert: "It needs a little more breadcrumbs"
Trainee: "Should I add more?"
Expert: "Yes, more"
For complete documentation and usage guidelines, please refer to the Ego-EXTRA Documentation.
Egocentric videos clips related to the extracted questions (5s, 15s, 30s)
Transcribed dialogues between experts and trainees
Eye gaze tracking data from Aria glasses
Visual Question Answering benchmark with ~15k QA sets. Including conversation turn and context
Egocentric videos captured using Aria glasses
Inertial Measurement Unit data (accelerometer, gyroscope)
Simultaneous Localization and Mapping trajectories
Department of Mathematics and Computer Science - University of Catania, Italy; Next Vision s.r.l. - Spinoff of the University of Catania, Italy
LinkedIn
Department of Mathematics and Computer Science - University of Catania, Italy; Next Vision s.r.l. - Spinoff of the University of Catania, Italy
LinkedIn
Department of Mathematics and Computer Science - University of Catania, Italy
LinkedIn
Department of Mathematics and Computer Science - University of Catania, Italy
LinkedIn
Department of Mathematics and Computer Science - University of Catania, Italy; Next Vision s.r.l. - Spinoff of the University of Catania, Italy
LinkedIn
Department of Mathematics and Computer Science - University of Catania, Italy; Next Vision s.r.l. - Spinoff of the University of Catania, Italy
LinkedIn@inproceedings{ragusa2026egoextra,
title={Ego-EXTRA: Video-Language Egocentric Dataset for EXpert-TRAinee assistance},
author={Ragusa, Francesco and Mazzamuto, Michele and Forte, Rosario and D'Ambra, Irene and Fort, James and Engel, Jakob and Furnari, Antonino and Farinella, Giovanni Maria},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2026}
}