Ego-EXTRA

Video-Language Egocentric Dataset for EXpert-TRAinee assistance

Francesco Ragusa*1,2, Michele Mazzamuto*1,2, Rosario Forte1, Irene D'Ambra1, James Fort3, Jakob Engel3, Antonino Furnari1,2, Giovanni Maria Farinella1,2
* Co-first authors
1 Department of Mathematics and Computer Science - University of Catania, Italy
2 Next Vision s.r.l. - Spinoff of the University of Catania, Italy
3 Meta Reality Labs Research, USA

We present Ego-EXTRA, a video-language Egocentric Dataset for EXpert-TRAinee assistance. Ego-EXTRA features 50 hours of unscripted egocentric videos of subjects performing procedural activities (the trainees) while guided by real-world experts who provide guidance and answer specific questions using natural language. Following a ``Wizard of OZ'' data collection paradigm, the expert enacts a wearable intelligent assistant, looking at the activities performed by the trainee exclusively from their egocentric point of view, answering questions when asked by the trainee, or proactively interacting with suggestions during the procedures. This unique data collection protocol enables Ego-EXTRA to capture a high-quality dialogue in which expert-level feedback is provided to the trainee. Two-way dialogues between experts and trainees are recorded, transcribed, and used to create a novel benchmark comprising more than 15k high-quality Visual Question Answer sets, which we use to evaluate Multimodal Large Language Models. The results show that Ego-EXTRA is challenging and highlight the limitations of current models when used to provide expert-level assistance to the user.

Dataset Overview

50 Hours of Video

Unscripted egocentric videos captured using Aria glasses with rich multimodal signals including RGB, SLAM, eye gaze, IMU, and audio.

Natural Conversations

Real-time expert-trainee dialogues following "Wizard of Oz" paradigm, providing authentic guidance and assistance.

4 Scenarios

Bike Workshop, Kitchen, Bakery, and Assembly scenarios with 10 different procedural activities.

Dataset Statistics

Participant Demographics

Participant Demographics

Diverse group of 33 trainees and 4 experts across 19 distinct occupations, ensuring representative data collection.

Video Statistics

Video Statistics

Distribution of video hours across different scenarios and activities, showing the comprehensive coverage of procedural tasks.

Conversation Turns

Conversation Turns Statistics

Analysis of conversation patterns between experts and trainees, highlighting the natural flow of instructional dialogue.

Trainee Language Patterns

Trainee Language Analysis

Analysis of trainee language usage, showing common verbs, nouns, and question patterns during procedural tasks.

Expert Language Patterns

Expert Language Analysis

Expert language analysis revealing instructional patterns, guidance strategies, and domain-specific terminology.

Dataset Examples

Sample Q&A

Trainee: "Which of the two wheels should I remove?"

Expert: "The front wheel."

Trainee: "What tool should I use to tighten the black bolt?"

Expert: "Use the wrench that's in the second chest on your left."

Sample Q&A

Trainee: "How much flour should I weigh on the scale?"

Expert: "You need to weigh 1.5 Kg of flour."

Trainee: "Is this fine or should I add more?"

Expert: "Add a little more."

Sample Q&A

Trainee: "What should I do with the wooden pegs?"

Expert: "You can insert them into the large holes."

Trainee: "How do I know that it's attached?"

Expert: "Try turning it slightly. You should hear a click."

Sample Q&A

Trainee: "Is the consistency of the mixture good now or does it need more breadcrumbs?"

Expert: "It needs a little more breadcrumbs"

Trainee: "Should I add more?"

Expert: "Yes, more"

Download Dataset

For complete documentation and usage guidelines, please refer to the Ego-EXTRA Documentation.

Video Clips

Egocentric videos clips related to the extracted questions (5s, 15s, 30s)

Transcripts

Transcribed dialogues between experts and trainees

Gaze Data

Eye gaze tracking data from Aria glasses

VQA Benchmark

Visual Question Answering benchmark with ~15k QA sets. Including conversation turn and context

Videos

Egocentric videos captured using Aria glasses

IMU Data

Inertial Measurement Unit data (accelerometer, gyroscope)

SLAM Data

Simultaneous Localization and Mapping trajectories

Research Team

Francesco Ragusa

Francesco Ragusa

Department of Mathematics and Computer Science - University of Catania, Italy; Next Vision s.r.l. - Spinoff of the University of Catania, Italy

LinkedIn
Michele Mazzamuto

Michele Mazzamuto

Department of Mathematics and Computer Science - University of Catania, Italy; Next Vision s.r.l. - Spinoff of the University of Catania, Italy

LinkedIn
Rosario Forte

Rosario Forte

Department of Mathematics and Computer Science - University of Catania, Italy

LinkedIn
Irene D'Ambra

Irene D'Ambra

Department of Mathematics and Computer Science - University of Catania, Italy

LinkedIn
James Fort

James Fort

Meta Reality Labs Research, USA

LinkedIn
Jakob Engel

Jakob Engel

Meta Reality Labs Research, USA

LinkedIn
Antonino Furnari

Antonino Furnari

Department of Mathematics and Computer Science - University of Catania, Italy; Next Vision s.r.l. - Spinoff of the University of Catania, Italy

LinkedIn
Giovanni Maria Farinella

Giovanni Maria Farinella

Department of Mathematics and Computer Science - University of Catania, Italy; Next Vision s.r.l. - Spinoff of the University of Catania, Italy

LinkedIn

Citation

Read the article on arxiv

@inproceedings{ragusa2026egoextra,
    title={Ego-EXTRA: Video-Language Egocentric Dataset for EXpert-TRAinee assistance},
    author={Ragusa, Francesco and Mazzamuto, Michele and Forte, Rosario and D'Ambra, Irene and Fort, James and Engel, Jakob and Furnari, Antonino and Farinella, Giovanni Maria},
    booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    year={2026}
}