Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations

Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels.

In this paper, we propose to learn human-object interaction detection leveraging narrations – natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects (e.g., "I am pouring vegetables from the chopping board to the pan"). Narrations provide a form of weak supervision that is cheap to acquire and readily available in state-of-the-art egocentric datasets.

We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time.

We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models, showing the superiority of its design. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations.

1

NS-iHOS Task

We introduce NS-iHOS, the novel task of training models to segment in-hand objects using only narrations as supervision

2

WISH Architecture

We propose WISH, a new architecture able to leverage narrations at training time to learn in-hand object segmentation, and able to perform inference on images only

3

Curated Benchmark

We establish a curated benchmark for NS-iHOS, relying on existing hand-object segmentation datasets based on EPIC-Kitchens and Ego4D datasets

The Architecture of WISH: Our model operates in two stages sharing a common backbone. (a) An object segmenter and a CLIP-based backbone extract visual embeddings for all object and hand proposals. (b) In Stage 1, we learn a shared embedding space to align hand-specific noun phrases from narrations with their corresponding visual object embeddings. (c) In Stage 2, we generate pseudo-labels from this alignment to train two specialized heads: a Contactness head (C) and a Matching head (M). At test time, only the backbone and Stage 2 are used for narration-free in-hand object segmentation.

EPIC-Kitchens Dataset

Method	E	L	R	B
WISH + GSAM	27.66	19.49	15.63	13.55
Fully Supervised	50.29	37.86	37.34	19.25

55% performance recovery

Ego4D Dataset

Method	E	L	R	B
WISH + GSAM	23.61	14.23	20.18	9.96
Fully Supervised	55.29	34.55	39.43	30.18

43% performance recovery

The results demonstrate that WISH significantly outperforms all baselines, recovering more than 50% of the performance of fully supervised methods without using fine-grained pixel-wise annotations.

If you use this work in your research, please cite:

Coming soon!

Click to select and copy BibTeX

📄

Paper

Download PDF

💻

Code

View Code

This work was partially funded by: Spoke 8, Tuscany Health Ecosystem (THE) Project (CUP B83C22003930001), funded by the National Recovery and Resilience Plan (NRRP), within the NextGeneration Europe (NGEU) Program; SUN -- Social and hUman ceNtered XR (EC, Horizon Europe No. 101092612); the project Future Artificial Intelligence Research (FAIR) – PNRR MUR Cod. PE0000013 - CUP: E63C22001940006. We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources and support (GEPPETHO project).

Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations

Abstract

Key Contributions

NS-iHOS Task

WISH Architecture

Curated Benchmark

Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH)

Results

EPIC-Kitchens Dataset

Ego4D Dataset

Citation

Resources

Paper

Code

Acknowledgments