Abstract
Skill assessment in procedural videos is crucial for the objective evaluation of human performance in settings such as manufacturing and procedural daily tasks. Current research on skill assessment has predominantly focused on sports and lacks large-scale datasets for complex procedural activities. Existing studies typically involve only a limited number of actions, focusing either on pairwise assessments (e.g., A is better than B) or on binary labels (e.g., good execution vs needs improvement). In response to these shortcomings, we introduce PROSKILL, the first benchmark dataset for action-level skill assessment in procedural tasks. PROSKILL provides absolute skill assessment annotations, along with pairwise ones. This is enabled by a novel and scalable annotation protocol that allows for the creation of an absolute skill assessment ranking starting from pairwise assessments. This protocol leverages a Swiss Tournament scheme for efficient pairwise comparisons, which are then aggregated into consistent, continuous global scores using an ELO-based rating system. We use our dataset to benchmark the main state-of-the-art skill assessment algorithms, including both ranking-based and pairwise paradigms. The suboptimal results achieved by the current state-of-the-art highlight the challenges and thus the value of PROSKILL in the context of skill assessment for procedural videos
The PROSKILL Dataset
PROSKILL spans multiple domains with segment-level annotations enabling fine-grained skill assessment.
| Subset | Clips | Actions | Hours | AVG ± STD (s) |
|---|---|---|---|---|
| Ikea ASM | 160 | 10 | 1.28 | 28.88 ± 19.69 |
| Meccano | 80 | 5 | 1.06 | 47.59 ± 21.45 |
| Assembly101 | 560 | 35 | 5.49 | 35.30 ± 25.27 |
| EgoExo4D | 191 | 12 | 4.70 | 88.14 ± 90.93 |
| EpicTent | 144 | 9 | 1.59 | 39.71 ± 34.18 |
| Total | 1135 | 71 | 14.12 | 44.75 ± 48.46 |
Annotation Approach
Three-stage protocol to convert pairwise judgments into absolute skill scores with stability over rounds.
Stage 1 — Pair Selection
We utilize a Swiss Tournament scheme to efficiently pair video segments, ensuring segments with similar current standings face each other. This avoids trivial comparisons and maximizes robustness of the collected judgments.
Stage 2 — Pairwise Ranking
Using a crowdsourcing platform (Amazon Mechanical Turk), qualified workers judge which of two performances demonstrates higher skill, producing pairwise labels. In total, we collected 16,372 unique comparisons across datasets and rounds.
Stage 3 — Absolute Scoring
We leverage an ELO-based rating system—originally designed for chess—to aggregate pairwise outcomes into consistent, continuous global scores and a final absolute ranking.
The protocol ran for R = 6 rounds, achieving stable absolute ratings with convergent rankings in subsets such as IKEA Assembly and EgoExo4D.
Results
We evaluate our models across Ikea, Meccano, Assembly101, EgoExo4D, and EpicTent. Global models generally outperform pairwise setups in rank correlation, with CoFInAl reaching ρ = 0.59 on Meccano. Pairwise tasks remain hardest on Assembly101 (≈0.60 accuracy), while textual conditioning with MiniLM brings consistent though moderate gains.
| Method | Features | Ikea | Meccano | Assembly101 | EgoExo4D | EpicTent |
|---|---|---|---|---|---|---|
| USDL | I3D | 0.12 | 0.38 | 0.12 | 0.33 | 0.17 |
| VideoMAE | 0.19 | 0.43 | 0.13 | 0.39 | 0.23 | |
| DAE-AQA | I3D | 0.20 | 0.24 | 0.20 | 0.16 | 0.23 |
| VideoMAE | 0.10 | 0.42 | 0.03 | 0.33 | 0.26 | |
| CoFInAl | I3D | 0.26 | 0.59 | 0.14 | 0.20 | 0.21 |
| VideoMAE | 0.26 | 0.31 | 0.11 | 0.28 | 0.23 | |
| AQA-TPT | I3D | 0.14 | 0.12 | -0.02 | 0.17 | -0.01 |
| VideoMAE | 0.21 | 0.35 | 0.15 | 0.36 | -0.01 | |
| CoRe | I3D | 0.22 | -0.12 | 0.22 | 0.33 | 0.04 |
| VideoMAE | 0.19 | 0.24 | 0.06 | 0.35 | 0.12 | |
| Average | I3D | 0.20 | 0.24 | 0.13 | 0.24 | 0.13 |
| VideoMAE | 0.20 | 0.35 | 0.10 | 0.34 | 0.17 |
| Method | Features | Ikea | Meccano | Assembly101 | EgoExo4D | EpicTent |
|---|---|---|---|---|---|---|
| USDL | I3D | 0.12 | 0.38 | 0.12 | 0.33 | 0.17 |
| VideoMAE | 0.19 | 0.43 | 0.13 | 0.39 | 0.23 | |
| USDL-Single | I3D | -0.18 | 0.09 | -0.09 | 0.19 | 0.17 |
| VideoMAE | 0.08 | 0.01 | -0.31 | 0.04 | 0.06 |
| Method | Features | Ikea | Meccano | Assembly101 | EgoExo4D | EpicTent |
|---|---|---|---|---|---|---|
| USDL | I3D | 0.19 | 0.38 | 0.12 | 0.33 | 0.17 |
| VideoMAE | 0.22 | 0.43 | 0.13 | 0.39 | 0.23 | |
| USDL + Grounding | I3D + MiniLM | 0.24 | 0.36 | 0.12 | 0.33 | 0.20 |
| VideoMAE + MiniLM | 0.27 | 0.50 | 0.13 | 0.41 | 0.18 |
Download
Get the dataset, benchmark, documentation, and code.
People

Michele Mazzamuto
IPLAB, University of Catania

Daniele Di Mauro
Next Vision s.r.l., Italy

Gianpiero Francesca
Toyota Motor Europe, Belgium

Giovanni Maria Farinella
IPLAB, University of Catania

Antonino Furnari
IPLAB, University of Catania
Resources and Acknowledgements
- Availability: Labels, code implementing the annotation protocol, and experimental pipelines will be publicly released.
- Support: Supported by Toyota Motor Europe, Next Vision s.r.l., and the project Future Artificial Intelligence Research (FAIR).
Citation
If you find this work useful, please cite our paper:
@inproceedings{mazzamuto2025proskill,
title={PROSKILL: Segment-Level Skill Assessment in Procedural Videos},
author={Mazzamuto, Michele and Di Mauro, Daniele and Francesca, Gianpiero and Farinella, Giovanni Maria and Furnari, Antonino},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2025}
}