PROSKILL

Segment-Level Skill Assessment in Procedural Videos

Michele Mazzamuto*1-2, Daniele Di Mauro*2, Gianpiero Francesca3, Giovanni Maria Farinella1, Antonino Furnari1
* Co-first authors
1 IPLAB, University of Catania
2 Next Vision s.r.l., Italy
3 Toyota Motor Europe, Belgium

The first benchmark dataset for action-level skill assessment with fine-grained segment-level annotations in complex procedural tasks. Provides absolute skill assessment annotations alongside pairwise ones.

PDF
PROSKILL Visual Overview

Abstract

Skill assessment in procedural videos is crucial for the objective evaluation of human performance in settings such as manufacturing and procedural daily tasks. Current research on skill assessment has predominantly focused on sports and lacks large-scale datasets for complex procedural activities. Existing studies typically involve only a limited number of actions, focusing either on pairwise assessments (e.g., A is better than B) or on binary labels (e.g., good execution vs needs improvement). In response to these shortcomings, we introduce PROSKILL, the first benchmark dataset for action-level skill assessment in procedural tasks. PROSKILL provides absolute skill assessment annotations, along with pairwise ones. This is enabled by a novel and scalable annotation protocol that allows for the creation of an absolute skill assessment ranking starting from pairwise assessments. This protocol leverages a Swiss Tournament scheme for efficient pairwise comparisons, which are then aggregated into consistent, continuous global scores using an ELO-based rating system. We use our dataset to benchmark the main state-of-the-art skill assessment algorithms, including both ranking-based and pairwise paradigms. The suboptimal results achieved by the current state-of-the-art highlight the challenges and thus the value of PROSKILL in the context of skill assessment for procedural videos

The PROSKILL Dataset

PROSKILL spans multiple domains with segment-level annotations enabling fine-grained skill assessment.

Subset Clips Actions Hours AVG ± STD (s)
Ikea ASM 160 10 1.28 28.88 ± 19.69
Meccano 80 5 1.06 47.59 ± 21.45
Assembly101 560 35 5.49 35.30 ± 25.27
EgoExo4D 191 12 4.70 88.14 ± 90.93
EpicTent 144 9 1.59 39.71 ± 34.18
Total 1135 71 14.12 44.75 ± 48.46

Annotation Approach

Three-stage protocol to convert pairwise judgments into absolute skill scores with stability over rounds.

Annotation protocol overview: Swiss Tournament pairing, AMT pairwise judgments, and ELO-based absolute scoring.

Stage 1 — Pair Selection

We utilize a Swiss Tournament scheme to efficiently pair video segments, ensuring segments with similar current standings face each other. This avoids trivial comparisons and maximizes robustness of the collected judgments.

Stage 2 — Pairwise Ranking

Using a crowdsourcing platform (Amazon Mechanical Turk), qualified workers judge which of two performances demonstrates higher skill, producing pairwise labels. In total, we collected 16,372 unique comparisons across datasets and rounds.

Stage 3 — Absolute Scoring

We leverage an ELO-based rating system—originally designed for chess—to aggregate pairwise outcomes into consistent, continuous global scores and a final absolute ranking.

The protocol ran for R = 6 rounds, achieving stable absolute ratings with convergent rankings in subsets such as IKEA Assembly and EgoExo4D.

Results

We evaluate our models across Ikea, Meccano, Assembly101, EgoExo4D, and EpicTent. Global models generally outperform pairwise setups in rank correlation, with CoFInAl reaching ρ = 0.59 on Meccano. Pairwise tasks remain hardest on Assembly101 (≈0.60 accuracy), while textual conditioning with MiniLM brings consistent though moderate gains.

Spearman’s ρ for global ranking (I3D and VideoMAE features).
Method Features Ikea Meccano Assembly101 EgoExo4D EpicTent
USDLI3D0.120.380.120.330.17
VideoMAE0.190.430.130.390.23
DAE-AQAI3D0.200.240.200.160.23
VideoMAE0.100.420.030.330.26
CoFInAlI3D0.260.590.140.200.21
VideoMAE0.260.310.110.280.23
AQA-TPTI3D0.140.12-0.020.17-0.01
VideoMAE0.210.350.150.36-0.01
CoReI3D0.22-0.120.220.330.04
VideoMAE0.190.240.060.350.12
AverageI3D0.200.240.130.240.13
VideoMAE0.200.350.100.340.17
Single-action vs unified model (USDL).
MethodFeaturesIkeaMeccanoAssembly101EgoExo4DEpicTent
USDLI3D0.120.380.120.330.17
VideoMAE0.190.430.130.390.23
USDL-SingleI3D-0.180.09-0.090.190.17
VideoMAE0.080.01-0.310.040.06
Textual grounding with MiniLM (USDL).
MethodFeaturesIkeaMeccanoAssembly101EgoExo4DEpicTent
USDLI3D0.190.380.120.330.17
VideoMAE0.220.430.130.390.23
USDL + GroundingI3D + MiniLM0.240.360.120.330.20
VideoMAE + MiniLM0.270.500.130.410.18

Download

Get the dataset, benchmark, documentation, and code.

People

Michele Mazzamuto

Michele Mazzamuto

IPLAB, University of Catania

Daniele Di Mauro

Daniele Di Mauro

Next Vision s.r.l., Italy

Gianpiero Francesca

Gianpiero Francesca

Toyota Motor Europe, Belgium

Giovanni Maria Farinella

Giovanni Maria Farinella

IPLAB, University of Catania

Antonino Furnari

Antonino Furnari

IPLAB, University of Catania

Resources and Acknowledgements

  • Availability: Labels, code implementing the annotation protocol, and experimental pipelines will be publicly released.
  • Support: Supported by Toyota Motor Europe, Next Vision s.r.l., and the project Future Artificial Intelligence Research (FAIR).

Citation

If you find this work useful, please cite our paper:

@inproceedings{mazzamuto2025proskill,
  title={PROSKILL: Segment-Level Skill Assessment in Procedural Videos},
  author={Mazzamuto, Michele and Di Mauro, Daniele and Francesca, Gianpiero and Farinella, Giovanni Maria and Furnari, Antonino},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2025}
}