--- tags: - mms language: - ab - af - ak - am - ar - as - av - ay - az - ba - bm - be - bn - bi - bo - sh - br - bg - ca - cs - ce - cv - ku - cy - da - de - dv - dz - el - en - eo - et - eu - ee - fo - fa - fj - fi - fr - fy - ff - ga - gl - gn - gu - zh - ht - ha - he - hi - sh - hu - hy - ig - ia - ms - is - it - jv - ja - kn - ka - kk - kr - km - ki - rw - ky - ko - kv - lo - la - lv - ln - lt - lb - lg - mh - ml - mr - ms - mk - mg - mt - mn - mi - my - zh - nl - 'no' - 'no' - ne - ny - oc - om - or - os - pa - pl - pt - ms - ps - qu - qu - qu - qu - qu - qu - qu - qu - qu - qu - qu - qu - qu - qu - qu - qu - qu - qu - qu - qu - qu - qu - ro - rn - ru - sg - sk - sl - sm - sn - sd - so - es - sq - su - sv - sw - ta - tt - te - tg - tl - th - ti - ts - tr - uk - ms - vi - wo - xh - ms - yo - ms - zu - za license: cc-by-nc-4.0 datasets: - google/fleurs metrics: - acc --- # Massively Multilingual Speech (MMS) - Finetuned LID This checkpoint is a model fine-tuned for speech language identification (LID) and part of Facebook's [Massive Multilingual Speech project](https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/). This checkpoint is based on the [Wav2Vec2 architecture](https://huggingface.co/docs/transformers/model_doc/wav2vec2) and classifies raw audio input to a probability distribution over 512 output classes (each class representing a language). The checkpoint consists of **1 billion parameters** and has been fine-tuned from [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) on 512 languages. ## Table Of Content - [Example](#example) - [Supported Languages](#supported-languages) - [Model details](#model-details) - [Additional links](#additional-links) ## Example This MMS checkpoint can be used with [Transformers](https://github.com/huggingface/transformers) to identify the spoken language of an audio. It can recognize the [following 512 languages](#supported-languages). Let's look at a simple example. First, we install transformers and some other libraries ``` pip install torch accelerate torchaudio datasets pip install --upgrade transformers ```` **Note**: In order to use MMS you need to have at least `transformers >= 4.30` installed. If the `4.30` version is not yet available [on PyPI](https://pypi.org/project/transformers/) make sure to install `transformers` from source: ``` pip install git+https://github.com/huggingface/transformers.git ``` Next, we load a couple of audio samples via `datasets`. Make sure that the audio data is sampled to 16000 kHz. ```py from datasets import load_dataset, Audio # English stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True) stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000)) en_sample = next(iter(stream_data))["audio"]["array"] # Arabic stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True) stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000)) ar_sample = next(iter(stream_data))["audio"]["array"] ``` Next, we load the model and processor ```py from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor import torch model_id = "facebook/mms-lid-512" processor = AutoFeatureExtractor.from_pretrained(model_id) model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id) ``` Now we process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as [ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition](https://huggingface.co/harshit345/xlsr-wav2vec-speech-emotion-recognition) ```py # English inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs).logits lang_id = torch.argmax(outputs, dim=-1)[0].item() detected_lang = model.config.id2label[lang_id] # 'eng' # Arabic inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs).logits lang_id = torch.argmax(outputs, dim=-1)[0].item() detected_lang = model.config.id2label[lang_id] # 'ara' ``` To see all the supported languages of a checkpoint, you can print out the language ids as follows: ```py processor.id2label.values() ``` For more details, about the architecture please have a look at [the official docs](https://huggingface.co/docs/transformers/main/en/model_doc/mms). ## Supported Languages This model supports 512 languages. Unclick the following to toogle all supported languages of this checkpoint in [ISO 639-3 code](https://en.wikipedia.org/wiki/ISO_639-3). You can find more details about the languages and their ISO 649-3 codes in the [MMS Language Coverage Overview](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html).
Click to toggle - ara - cmn - eng - spa - fra - mlg - swe - por - vie - ful - sun - asm - ben - zlm - kor - ind - hin - tuk - urd - aze - slv - mon - hau - tel - swh - bod - rus - tur - heb - mar - som - tgl - tat - tha - cat - ron - mal - bel - pol - yor - nld - bul - hat - afr - isl - amh - tam - hun - hrv - lit - cym - fas - mkd - ell - bos - deu - sqi - jav - kmr - nob - uzb - snd - lat - nya - grn - mya - orm - lin - hye - yue - pan - jpn - kaz - npi - kik - kat - guj - kan - tgk - ukr - ces - lav - bak - khm - cak - fao - glg - ltz - xog - lao - mlt - sin - aka - sna - che - mam - ita - quc - srp - mri - tuv - nno - pus - eus - kbp - ory - lug - bre - luo - nhx - slk - ewe - fin - rif - dan - yid - yao - mos - quh - hne - xon - new - quy - est - dyu - ttq - bam - pse - uig - sck - ngl - tso - mup - dga - seh - lis - wal - ctg - bfz - bxk - ceb - kru - war - khg - bbc - thl - vmw - zne - sid - tpi - nym - bgq - bfy - hlb - teo - fon - kfx - bfa - mag - ayr - any - mnk - adx - ava - hyw - san - kek - chv - kri - btx - nhy - dnj - lon - men - ium - nga - nsu - prk - kir - bom - run - hwc - mnw - ubl - kin - rkt - xmm - iba - gux - ses - wsg - tir - gbm - mai - nyy - nan - nyn - gog - ngu - hoc - nyf - sus - bcc - hak - grt - suk - nij - kaa - bem - rmy - nus - ach - awa - dip - rim - nhe - pcm - kde - tem - quz - bba - kbr - taj - dik - dgo - bgc - xnr - kac - laj - dag - ktb - mgh - shn - oci - zyb - alz - wol - guw - nia - bci - sba - kab - nnb - ilo - mfe - xpe - bcl - haw - mad - ljp - gmv - nyo - kxm - nod - sag - sas - myx - sgw - mak - kfy - jam - lgg - nhi - mey - sgj - hay - pam - heh - nhw - yua - shi - mrw - hil - pag - cce - npl - ace - kam - min - pko - toi - ncj - umb - hno - ban - syl - bxg - nse - xho - mkw - nch - mas - bum - mww - epo - tzm - zul - lrc - ibo - abk - azz - guz - ksw - lus - ckb - mer - pov - rhg - knc - tum - nso - bho - ndc - ijc - qug - lub - srr - mni - zza - dje - tiv - gle - lua - swk - ada - lic - skr - mfa - bto - unr - hdy - kea - glk - ast - nup - sat - ktu - bhb - sgc - dks - ncl - emk - urh - tsc - idu - igb - its - kng - kmb - tsn - bin - gom - ven - sef - sco - trp - glv - haq - kha - rmn - sot - sou - gno - igl - efi - nde - rki - kjg - fan - wci - bjn - pmy - bqi - ina - hni - the - nuz - ajg - ymm - fmu - nyk - snk - esg - thq - pht - wes - pnb - phr - mui - tkt - bug - mrr - kas - zgb - lir - vah - ssw - iii - brx - rwr - kmc - dib - pcc - zyn - hea - hms - thr - wbr - bfb - wtm - blk - dhd - swv - zzj - niq - mtr - gju - kjp - haz - shy - nbl - aii - sjp - bns - brh - msi - tsg - tcy - kbl - noe - tyz - ahr - aar - wuu - kbd - bca - pwr - hsn - kua - tdd - bgp - abs - zlj - ebo - bra - nhp - tts - zyj - lmn - cqd - dcc - cjk - bfr - bew - arg - drs - chw - bej - bjj - ibb - tig - nut - jax - tdg - nlv - pch - fvr - mlq - kfr - nhn - tji - hoj - cpx - cdo - bgn - btm - trf - daq - max - nba - mut - hnd - ryu - abr - sop - odk - nap - gbr - czh - vls - gdx - yaf - sdh - anw - ttj - nhg - cgg - ifm - mdh - scn - lki - luz - stv - kmz - nds - mtq - knn - mnp - bar - mzn - gsw - fry
## Model details - **Developed by:** Vineel Pratap et al. - **Model type:** Multi-Lingual Automatic Speech Recognition model - **Language(s):** 512 languages, see [supported languages](#supported-languages) - **License:** CC-BY-NC 4.0 license - **Num parameters**: 1 billion - **Audio sampling rate**: 16,000 kHz - **Cite as:** @article{pratap2023mms, title={Scaling Speech Technology to 1,000+ Languages}, author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli}, journal={arXiv}, year={2023} } ## Additional Links - [Blog post](https://ai.facebook.com/blog/multilingual-model-speech-recognition/) - [Transformers documentation](https://huggingface.co/docs/transformers/main/en/model_doc/mms). - [Paper](https://arxiv.org/abs/2305.13516) - [GitHub Repository](https://github.com/facebookresearch/fairseq/tree/main/examples/mms#asr) - [Other **MMS** checkpoints](https://huggingface.co/models?other=mms) - MMS base checkpoints: - [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) - [facebook/mms-300m](https://huggingface.co/facebook/mms-300m) - [Official Space](https://huggingface.co/spaces/facebook/MMS)