Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?

We investigate the effectiveness of synthetic data in enhancing egocentric hand-object interaction detection. Via extensive experiments and comparative analyses on three egocentric datasets, VISOR , EgoHOS, and ENIGMA-51, our findings reveal how to exploit synthetic data for the HOI detection task when real labeled data are scarce or unavailable. Specifically, by leveraging only 10% of real labeled data, we achieve improvements in Overall AP compared to baselines trained exclusively on real data of: +5.67% on EPIC-KITCHENS VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51. Our analysis is supported by a novel data generation pipeline and the newly introduced HOI-Synth benchmark which augments existing datasets with synthetic images of hand-object interactions automatically labeled with hand-object contact states, bounding boxes, and pixel-wise segmentation masks.

GitHub Paper

Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?

We investigate the effectiveness of synthetic data in enhancing egocentric hand-object interaction detection. Via extensive experiments and comparative analyses on three egocentric datasets, VISOR , EgoHOS, and ENIGMA-51, our findings reveal how to exploit synthetic data for the HOI detection task when real labeled data are scarce or unavailable. Specifically, by leveraging only 10% of real labeled data, we achieve improvements in Overall AP compared to baselines trained exclusively on real data of: +5.67% on EPIC-KITCHENS VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51. Our analysis is supported by a novel data generation pipeline and the newly introduced HOI-Synth benchmark which augments existing datasets with synthetic images of hand-object interactions automatically labeled with hand-object contact states, bounding boxes, and pixel-wise segmentation masks.

GitHub Paper

Data Generation Pipeline

Our pipeline relies on state-of-the-art datasets and components to enable an accurate generation of egocentric images of hand-object interactions. We first select a random hand-object grasp from the DexGraspNet dataset, which is fit to a randomly generated human model and integrated with the appropriate object mesh specified in the hand-object grasp. We then select a random environment from the HM3D dataset and place the human-object model in the environment. We finally place a virtual camera at human eye level to capture the scene from the first-person point of view.


HOI-Synth Benchmark

The HOI-Synth benchmark extends three established datasets of egocentric images designed to study hand-object interaction detection, EPIC-KITCHENS VISOR, EgoHOS, and ENIGMA-51, with automatically labeled synthetic data obtained through the proposed HOI generation pipeline.

RGB
Images

75460

Hand
annotations

141778

Object
annotations

101525

Interaction
frames

101625


Download
Data
Frames
Download
Data
Annotations
Download
Baselines
Baselines
GitHub
Pipeline
Simulator
GitHub


Paper

Leonardi, R., Furnari, A., Ragusa, F., & Farinella, G. M. (2023). Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection? An Investigation and the HOI-Synth Domain Adaptation Benchmark. arXiv preprint arXiv:2312.02672. Cite our paper: ArXiv.
[01/07/2024] Accepted at European Conference on Computer Vision (ECCV) 2024!

@inproceedings{leonardi2025synthetic,
  title={Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?},
  author={Leonardi, Rosario and Furnari, Antonino and Ragusa, Francesco and Farinella, Giovanni Maria},
  booktitle={European Conference on Computer Vision},
  pages={36--54},
  year={2025},
  organization={Springer}
}

Visit our page dedicated to First Person Vision Research for other related publications.


People
Rosario
Leonardi
FPV@IPLAB
Next Vision s.r.l.
Antonino
Furnari
FPV@IPLAB
Next Vision s.r.l.
Francesco
Ragusa
FPV@IPLAB
Next Vision s.r.l.
Giovanni Maria
Farinella
FPV@IPLAB
Next Vision s.r.l.