A multi-view, synchronized dataset of surgical hand gestures captured from expert surgeons and medical students at Geneva University Hospital.
Geneva University Hospital · University of Geneva
Our method consists of three main components: (1) a multi-view synchronized dataset collection pipeline, (2) a gesture recognition framework with multi-view fusion and knowledge distillation for efficient single-view deployment, and (3) an LLM-based feedback module (SurgEventRAG) that generates actionable coaching feedback for medical students based on recognized gesture sequences.
Five cameras capturing simultaneous viewpoints, enabling multi-view and single-view recognition benchmarks.
Recorded at Geneva University Hospital with expert surgeons and medical students performing incision and suturing tasks.
Frame-level gesture labels, skill-level annotations, and synchronized temporal segments across all views.
Covers the full expertise spectrum — from novice medical students to board-certified expert surgeons.
Designed to support downstream tasks: gesture recognition, skill assessment, and LLM-based feedback generation.
Incision and suturing — covering distinct fine motor skills critical to surgical training curricula.
In surgical training for medical students, proficiency development relies on expert-led skill assessment, which is costly, time-limited, difficult to scale, and its exper- tise remains confined to institutions with available spe- cialists. Automated AI-based assessment offers a viable alternative, but progress is constrained by the lack of datasets containing realistic trainee errors and the multi- view variability needed to train robust computer vision ap- proaches. To address this gap, we present Surgical-Hands (SHANDS), a large-scale multi-view video dataset for sur- gical hand-gesture and error recognition for medical train- ing. SHANDS captures linear incision and suturing us- ing five RGB cameras from complementary viewpoints, per- formed by 52 participants (20 experts and 32 trainees) each completing three standardized trials per procedure. The videos are annotated at the frame level with 15 gesture primitives and include a validated taxonomy of 8 trainee error types, enabling both gesture recognition and error de- tection. We further define standardized evaluation protocols for single-view, multi-view, and cross-view generalization, and benchmark state-of-the-art deep learning models on the dataset. SHANDS will be publicly released to support the development of robust and scalable AI systems for surgical training grounded in clinically curated domain knowledge.
The dataset is available for academic and non-commercial research purposes. Please complete the data access agreement before downloading. We plan to make the SHANDS dataset available to the research community in June 2026.
If you use our dataset or models in your research, please cite:
@inproceedings{le2026surggest,
title = {SHANDS: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and
Error Recognition Toward Medical Training},
author = {Le, Ma; Thiago, Freitas dos Santos; Nadia, Magnenat-Thalmann; Katarzyna Wac},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR)},
year = {2026}
}
Funding: Supported by IDS (100.133 IP-ICT) and INDUX-R (GA No. 101135556; DOI: 10.3030/101135556). Funded by the European Union and the Swiss State Secretariat for Education, Research and Innovation (SERI). Disclaimer: Opinions expressed are the authors' alone and do not necessarily represent the EU or CINEA. Neither the EU nor the granting authority is responsible for the content.