SignIT

A Comprehensive Dataset and Multimodal Analysis for Italian Sign Language Recognition

This work presents SignIT, a novel dataset designed to facilitate the study of Italian Sign Language (LIS) recognition.

Abstract

In this work we present SignIT, a new dataset to study the task of Italian Sign Language (LIS) recognition. The dataset is composed of 644 videos covering 3.33 hours. We manually annotated videos considering a taxonomy of 94 distinct sign classes belonging to 5 macro-categories: Animals, Food, Colors, Emotions and Family. We also extracted 2D keypoints related to the hands, face and body of the users. With the dataset, we propose a benchmark for the sign recognition task, adopting several state-of-the-art models showing how temporal information, 2D keypoints and RGB frames can be influence the performance of these models. Results show the limitations of these models on this challenging LIS dataset.

Dataset Description

The SignIT dataset contains 94 Italian Sign Language (LIS) gesture classes, organized into five macro-categories: Animals, Food, Colors, Emotions, and Family. The dataset was collected from 644 publicly available videos, each featuring a single signer performing multiple LIS signs. These videos were manually segmented into individual clips, resulting in one clip per gesture instance.

Recordings were captured indoors across 37 different environments, with variations in background and lighting. The full dataset includes 3 hours and 34 minutes of footage, with video resolutions ranging from 426×240 to 1024×1024 pixels and frame rates between 24 and 30 FPS. To prevent models from exploiting textual cues (e.g., background labels), a blurring preprocessing step was applied to obscure these regions.

All videos were annotated according to the 94 sign classes and divided into the following splits:

Training: 311 videos (~1h 43m) — 48.5%
Validation: 138 videos (~46m) — 21.2%
Test: 195 videos (~1h 05m) — 30.3%

This distribution is consistent across both the number of videos and the total annotated frames (≈99k frames).

Class distributions vary by macro-category. Some categories (e.g., Animals, Food) exhibit a long-tail imbalance, with frequent classes such as dog or banana exceeding 1,200–1,500 samples, while rarer signs contain fewer than 200. Categories like Colors are more uniform, while Emotions include only five classes with strong frequency differences (e.g., anger being the most common).

For every frame, 2D keypoints were extracted using MediaPipe, including:

  • 21 hand keypoints per hand
  • 51 facial keypoints (lips, nose, eyes, eyebrows, contour)
  • 33 body keypoints (upper-body pose)

These multimodal annotations support both pose-based and appearance-based approaches to LIS recognition.

Dataset Description

Figure 1: Dataset visualization

Dataset Statistics

The SignIT dataset contains 94 sign categories across 5 macro-categories: Animals, Food, Colors, Emotions, and Family. These categories represent fundamental vocabulary commonly taught first in sign language education.

SignIT Dataset Distribution

Multi-level visualization: Inner ring shows macro-categories, outer ring shows individual sign classes

Interact with the chart to explore the dataset

Hover over segments to see detailed information about each category and sign class

Dataset Split Distribution

Training Set
311
videos (~1h 43m)
48.5%
Validation Set
138
videos (~46m)
21.2%
Test Set
195
videos (~1h 05m)
30.3%

Total: ~99,000 annotated frames across all splits

Number of videos per macro-category

180
Animals
34% (32)
75
Food
21.3% (20)
196
Colors
18.1% (17)
42
Emotions
5.3% (5)
151
Family
21.3% (20)

📦 Dataset Downloads

Download the different components of the SignIT dataset. All files are compressed in ZIP format. In CSV Macro and Micro you will find the frame names associated with the micro/macro label and the split in the optimal training validation and testing sets.

Benchmark and Results

We benchmarked several baseline models to evaluate gesture recognition performance across different input modalities: 2D keypoints, RGB appearance, and multimodal inputs.

K-NN: A standard K-Nearest Neighbors classifier operating on 2D hand, face, and body keypoints concatenated into a single normalized feature vector.
MLP: A three-layer fully connected network using the same keypoint features as K-NN.
ResNet18: A 2D convolutional network that predicts signs directly from RGB images.
I3D: A 3D convolutional network processing clips of 16 consecutive frames resized to 224×224 pixels.
LLaVA-OneVision: A multimodal large language model (Qwen2-7B) used to classify LIS signs using RGB frames, associated pose keypoints, and category-based prompts.

Evaluation Metrics: Performance was measured using accuracy, precision, recall, and F1-score to account for class imbalance.

Macro Results

Macro results
Food category results

Micro Results

Micro results
Food category results

Citation

📚 If you find our work useful, cite our paper!

@misc{micieli2025signitcomprehensivedatasetmultimodal,
                        title={SignIT: A Comprehensive Dataset and Multimodal Analysis for Italian Sign Language Recognition}, 
                        author={Alessia Micieli and Giovanni Maria Farinella and Francesco Ragusa},
                        year={2025},
                        eprint={2512.14489},
                        archivePrefix={arXiv},
                        primaryClass={cs.CV},
                        url={https://arxiv.org/abs/2512.14489}, 
                  }

Acknowledgements

This study has been supported by Next Vision s.r.l. and by the Research Program PIAno di inCEntivi per la Ricerca di Ateneo 2024/2026, project "Multi-Agent Simulator for Real-Time Decision-Making Strategies in Uncertain Egocentric Scenarios" - University of Catania.

Meet the Authors 🤝

Logo 1 Logo 2 Logo 3