--- license: apache-2.0 language: - es base_model: - pyannote/segmentation-3.0 library_name: pyannote-audio tags: - pyannote - pyannote-audio - audio - voice - speech - speaker - speaker-diarization - segmentation pipeline_tag: automatic-speech-recognition --- # pyannote-segmentation-3.0-RTVE-primary ## Model Details This system is a collection of three fine-tuned models, to be fused with [DOVER-Lap](https://github.com/desh2608/dover-lap). Each models is fine-tuned monitoring a different metric component of Diarization Error Rate (i.e., False Alarm, Missed Detection, and Speaker Confusion). More information about the fusion of these models can be found in this [paper](https://www.isca-archive.org/iberspeech_2024/souganidis24_iberspeech.html). Each model is a fine-tuned version of [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0) on [the RTVE database](https://catedrartve.unizar.es/rtvedatabase.html) used for Albayzin Evaluations of IberSPEECH 2024. On the RTVE2024 test set it achives the following results (two-decimal rounding), being the best-performing system of Albayzin Evaluations 2024: - Diarization Error Rate (DER): 14.98% - False Alarm: 2.64% - Missed Detection: 4.54% - Speaker Confusion: 7.80% ## Uses This system is intented to be used for speaker diarization of TV shows. ## Usage The instructions to obtain the RTTM output of each model can be found [here](https://huggingface.co/pyannote/speaker-diarization-3.1), using this [configuration file](config.yaml) Once obtained, [this script](primary_fusion.py) can be modified to obtain the fusion of each model's output. ## Training Details ### Training Data The [train.lst](train.lst) file includes the URIs of the training data. #### Training Hyperparameters **Model:** - duration: 10.0 - max_speakers_per_chunk: 3 - max_speakers_per_frame: 2 - train_batch_size: 32 - powerset_max_classes: 2 **Adam Optimizer:** - lr: 0.0001 **Early Stopping:** - direction: 'min' - max_epochs: 20 ### Development Data The [development.lst](development.lst) file includes the URIs of the development data. ## Evaluation - Forgiveness collar: 250ms - Skip overlap: False ### Testing Data & Metrics #### Testing Data The [test.lst](test.lst) file includes the URIs of the testing data. #### Metrics Diarization Error Rate, False Alarm, Missed Detection, Speaker Confusion. ## Citation If you use these models, please cite: **BibTeX:** ```bibtex @inproceedings{souganidis24_iberspeech, title = {HiTZ-Aholab Speaker Diarization System for Albayzin Evaluations of IberSPEECH 2024}, author = {Christoforos Souganidis and Gemma Meseguer and Asier Herranz and Inma {Hernáez Rioja} and Eva Navas and Ibon Saratxaga}, year = {2024}, booktitle = {IberSPEECH 2024}, pages = {327--330}, doi = {10.21437/IberSPEECH.2024-68}, } ```` ## Acknowledgments This project with reference 2022/TL22/00215335 has been parcially funded by the Ministerio de Transformación Digital and by the Plan de Recuperación, Transformación y Resiliencia – Funded by the European Union – NextGenerationEU [ILENIA](https://proyectoilenia.es/) and by the project [IkerGaitu](https://www.hitz.eus/iker-gaitu/) funded by the Basque Government.