PROSKILL: Segment-Level Skill Assessment in Procedural Videos

Abstract

Skill assessment in procedural videos is crucial for the objective evaluation of human performance in settings such as manufacturing and procedural daily tasks. Current research on skill assessment has predominantly focused on sports and lacks large-scale datasets for complex procedural activities. Existing studies typically involve only a limited number of actions, focusing either on pairwise assessments (e.g., A is better than B) or on binary labels (e.g., good execution vs needs improvement). In response to these shortcomings, we introduce PROSKILL, the first benchmark dataset for action-level skill assessment in procedural tasks. PROSKILL provides absolute skill assessment annotations, along with pairwise ones. This is enabled by a novel and scalable annotation protocol that allows for the creation of an absolute skill assessment ranking starting from pairwise assessments. This protocol leverages a Swiss Tournament scheme for efficient pairwise comparisons, which are then aggregated into consistent, continuous global scores using an ELO-based rating system. We use our dataset to benchmark the main state-of-the-art skill assessment algorithms, including both ranking-based and pairwise paradigms. The suboptimal results achieved by the current state-of-the-art highlight the challenges and thus the value of PROSKILL in the context of skill assessment for procedural videos

The PROSKILL Dataset

PROSKILL spans multiple domains with segment-level annotations enabling fine-grained skill assessment.

Subset	Clips	Actions	Hours	AVG ± STD (s)
Ikea ASM	160	10	1.28	28.88 ± 19.69
Meccano	80	5	1.06	47.59 ± 21.45
Assembly101	560	35	5.49	35.30 ± 25.27
EgoExo4D	191	12	4.70	88.14 ± 90.93
EpicTent	144	9	1.59	39.71 ± 34.18
Total	1135	71	14.12	44.75 ± 48.46

Annotation Approach

Three-stage protocol to convert pairwise judgments into absolute skill scores with stability over rounds.

Annotation protocol overview: Swiss Tournament pairing, AMT pairwise judgments, and ELO-based absolute scoring.

Stage 1 — Pair Selection

We utilize a Swiss Tournament scheme to efficiently pair video segments, ensuring segments with similar current standings face each other. This avoids trivial comparisons and maximizes robustness of the collected judgments.

Stage 2 — Pairwise Ranking

Using a crowdsourcing platform (Amazon Mechanical Turk), qualified workers judge which of two performances demonstrates higher skill, producing pairwise labels. In total, we collected 16,372 unique comparisons across datasets and rounds.

Stage 3 — Absolute Scoring

We leverage an ELO-based rating system—originally designed for chess—to aggregate pairwise outcomes into consistent, continuous global scores and a final absolute ranking.

The protocol ran for R = 6 rounds, achieving stable absolute ratings with convergent rankings in subsets such as IKEA Assembly and EgoExo4D.

Results

We evaluate our models across Ikea, Meccano, Assembly101, EgoExo4D, and EpicTent. Global models generally outperform pairwise setups in rank correlation, with CoFInAl reaching ρ = 0.59 on Meccano. Pairwise tasks remain hardest on Assembly101 (≈0.60 accuracy), while textual conditioning with MiniLM brings consistent though moderate gains.

Spearman’s ρ for global ranking (*I3D* and *VideoMAE* features).
Method	Features	Ikea	Meccano	Assembly101	EgoExo4D	EpicTent
USDL	I3D	0.12	0.38	0.12	0.33	0.17
USDL	VideoMAE	0.19	0.43	0.13	0.39	0.23
DAE-AQA	I3D	0.20	0.24	0.20	0.16	0.23
DAE-AQA	VideoMAE	0.10	0.42	0.03	0.33	0.26
CoFInAl	I3D	0.26	0.59	0.14	0.20	0.21
CoFInAl	VideoMAE	0.26	0.31	0.11	0.28	0.23
AQA-TPT	I3D	0.14	0.12	-0.02	0.17	-0.01
AQA-TPT	VideoMAE	0.21	0.35	0.15	0.36	-0.01
CoRe	I3D	0.22	-0.12	0.22	0.33	0.04
CoRe	VideoMAE	0.19	0.24	0.06	0.35	0.12
Average	I3D	0.20	0.24	0.13	0.24	0.13
Average	VideoMAE	0.20	0.35	0.10	0.34	0.17

Single-action vs unified model (**USDL**).
Method	Features	Ikea	Meccano	Assembly101	EgoExo4D	EpicTent
USDL	I3D	0.12	0.38	0.12	0.33	0.17
USDL	VideoMAE	0.19	0.43	0.13	0.39	0.23
USDL-Single	I3D	-0.18	0.09	-0.09	0.19	0.17
USDL-Single	VideoMAE	0.08	0.01	-0.31	0.04	0.06

Textual grounding with *MiniLM* (**USDL**).
Method	Features	Ikea	Meccano	Assembly101	EgoExo4D	EpicTent
USDL	I3D	0.19	0.38	0.12	0.33	0.17
USDL	VideoMAE	0.22	0.43	0.13	0.39	0.23
USDL + Grounding	I3D + MiniLM	0.24	0.36	0.12	0.33	0.20
USDL + Grounding	VideoMAE + MiniLM	0.27	0.50	0.13	0.41	0.18

Download

Get the dataset, benchmark, documentation, and code.

Download Rankings Download Videos View GitHub Repository

People

Michele Mazzamuto

IPLAB, University of Catania

Daniele Di Mauro

Next Vision s.r.l., Italy

Gianpiero Francesca

Toyota Motor Europe, Belgium

Giovanni Maria Farinella

IPLAB, University of Catania

Antonino Furnari

IPLAB, University of Catania

Resources and Acknowledgements

Availability: Labels, code implementing the annotation protocol, and experimental pipelines will be publicly released.
Support: Supported by Toyota Motor Europe, Next Vision s.r.l., and the project Future Artificial Intelligence Research (FAIR).

Citation

If you find this work useful, please cite our paper:

@inproceedings{mazzamuto2025proskill,
  title={PROSKILL: Segment-Level Skill Assessment in Procedural Videos},
  author={Mazzamuto, Michele and Di Mauro, Daniele and Francesca, Gianpiero and Farinella, Giovanni Maria and Furnari, Antonino},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026}
}

PROSKILL

News

Abstract

The PROSKILL Dataset

Annotation Approach

Stage 1 — Pair Selection

Stage 2 — Pairwise Ranking

Stage 3 — Absolute Scoring

Results

Download

People

Michele Mazzamuto

Daniele Di Mauro

Gianpiero Francesca

Giovanni Maria Farinella

Antonino Furnari

Resources and Acknowledgements

Citation