EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

EgoInteract Banner

EgoInteract generates temporally coherent videos of humans interacting with diverse objects, enabling the study of egocentric interaction understanding at multiple levels.

Abstract

Collecting large-scale egocentric video datasets with dense spatial and temporal annotations is costly, slow, and often constrained by environmental biases, privacy constraints, and limited coverage of interaction patterns. While synthetic data has shown strong potential in several vision domains, its use for egocentric perception remains relatively underexplored, especially for tasks requiring temporally coherent human-object interactions. In this work, we introduce EgoInteract, a controllable simulator for egocentric video generation designed to model fine-grained egocentric interactions and their temporal dynamics. The simulator enables precise control over camera, human body and hand motion, object manipulation, and scene composition across diverse environments. Building on this framework, we generate a synthetic egocentric video dataset with dense spatial and temporal annotations for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection. We evaluate models trained with simulated data on multiple real-world egocentric benchmarks spanning diverse environments, object categories, and interaction patterns. Results show consistent improvements over strong baselines across tasks and datasets, demonstrating the effectiveness and transferability of our simulation-based approach.

First research result visualization

EgoInteract is a controllable simulation framework for generating synthetic egocentric video data with fine-grained spatial and temporal annotations. It supports precise modeling of camera motion, hand-object interactions, and scene dynamics, enabling large-scale dataset generation for tasks such as temporal action segmentation, next-active object detection, and interaction anticipation. Models trained on EgoInteract demonstrate consistent improvements and strong transferability across multiple real-world egocentric benchmarks.

Videos

BibTeX

@article{leonardi2026egointeract,
  title={EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation},
  author={Leonardi, Rosario and Ragusa, Francesco and Materia, Daniele and Passanisi, Alessandro and Fort, James and Engel, Jakob and Farinella, Giovanni Maria},
  journal={arXiv preprint arXiv:2605.18214},
  year={2026}
}

Acknowledgments

Research at University of Catania has been supported by Meta, Next Vision, and by the project Future Artificial Intelligence Research (FAIR) – PNRR MUR Cod. PE0000013 - CUP: E63C22001940006.

University of Catania
Meta
Next Vision
iplab