File size: 7,405 Bytes

---

license: mit
language:
- zh
- en
metrics:
- cer
- bleu
tags:
- asr
- automatic-speech-recognition
- automatic-speech-translation
- speech-translation
- speech-recognition
---



# MooER (摩耳): an LLM-based Speech Recognition and Translation Model from Moore Threads

**Online Demo**: https://mooer-speech.mthreads.com:10077/

## 🔥 Update

We release a new model *MooER-80K-v2* using 80K hours of data. Currently, *MooER-80K-v2* supports the ASR task. The AST and multi-task models will be released soon.

## 📖 Introduction

We introduce **MooER (摩耳)**: an LLM-based speech recognition and translation model developed by Moore Threads. With the *MooER* framework, you can transcribe the speech into text (speech recognition or, ASR), and  translate it into other languages (speech translation or, AST) in a end-to-end manner. The performance of *MooER* is demonstrated in the subsequent section, along with our insights into model configurations, training strategies, and more, provided in our [technical report](https://arxiv.org/abs/2408.05101).

For the usage of the model files, please refer to our [GitHub](https://github.com/MooreThreads/MooER)

<br>
<p align="center">
    <img src="assets/framework.png" width="600"/>

<p>

<br>


## 🥊 Evaluation Results

We demonstrate the training data and the evaluation results below. For more comprehensive information, please refer to our [report](https://arxiv.org/pdf/2408.05101).

### Training data

We utilize 5k hours of data (MT5K) to train our basic *MooER-5K* model. The data sources include:

| Dataset          | Duration          |
|---------------|---------------|
| aishell2 | 137h          |
| librispeech | 131h      |
| multi_cn | 100h          |

| wenetspeech  | 1361h     |

| in-house data | 3274h  |



Note that, data from the open-source datasets were randomly selected from the full training set. The in-house data, collected internally without text, were transcribed using a third-party ASR service.



Since all the above datasets were originally designed only for the speech recognition task, no translation results are available. To train our speech translation model, we used a third-party translation service to generate pseudo-labels. No data filtering techniques were applied.



At this moment, we are also developing a new model trained with 80K hours of data.



### Speech Recognition



The performance of speech recognition is evaluated using WER/CER.



<table>

  <tr>

    <th>Language</th>

    <th>Testset</th>

    <th>Paraformer-large</th>

    <th>SenseVoice-small</th>

    <th>Qwen-audio</th>

    <th>Whisper-large-v3</th>

    <th>SeamlessM4T-v2</th>

    <th>MooER-5K</th>

    <th>MooER-80K</th>

    <th>MooER-80K-v2</th>

  </tr>

  <tr>

    <td rowspan="7">Chinese</td>

    <td>aishell1</td>

    <td>1.93</td>

    <td>3.03</td>

    <td>1.43</td>

    <td>7.86</td>

    <td>4.09</td>

    <td>1.93</td>

    <td>1.25</td>

    <td>1.00</td>

  </tr>

  <tr>

    <td>aishell2_ios</td>
    <td>2.85</td>

    <td>3.79</td>

    <td>3.57</td>

    <td>5.38</td>

    <td>4.81</td>

    <td>3.17</td>

    <td>2.67</td>

    <td>2.62</td>

  </tr>

  <tr>

    <td>test_magicdata</td>

    <td>3.66</td>

    <td>3.81</td>

    <td>5.31</td>

    <td>8.36</td>

    <td>9.69</td>

    <td>3.48</td>

    <td>2.52</td>

    <td>2.17</td>

  </tr>

  <tr>

    <td>test_thchs</td>

    <td>3.99</td>

    <td>5.17</td>

    <td>4.86</td>

    <td>9.06</td>

    <td>7.14</td>

    <td>4.11</td>

    <td>3.14</td>

    <td>3.00</td>

  </tr>

  <tr>

    <td>fleurs cmn_dev</td>

    <td>5.56</td>

    <td>6.39</td>

    <td>10.54</td>

    <td>4.54</td>

    <td>7.12</td>

    <td>5.81</td>

    <td>5.23</td>

    <td>5.15</td>

  </tr>

  <tr>

    <td>fleurs cmn_test</td>

    <td>6.92</td>

    <td>7.36</td>

    <td>11.07</td>

    <td>5.24</td>

    <td>7.66</td>

    <td>6.77</td>

    <td>6.18</td>

    <td>6.14</td>

  </tr>

  <tr>

    <td>average</td>

    <td><strong>4.15</strong></td>

    <td><strong>4.93</strong></td>

    <td><strong>6.13</strong></td>

    <td><strong>6.74</strong></td>

    <td><strong>6.75</strong></td>

    <td><strong>4.21</strong></td>

    <td><strong>3.50</strong></td>

    <td><strong>3.35</strong></td>

  </tr>

  <tr>

    <td rowspan="7">English</td>

    <td>librispeech test_clean</td>

    <td>14.15</td>

    <td>4.07</td>

    <td>2.15</td>

    <td>3.42</td>

    <td>2.77</td>

    <td>7.78</td>

    <td>4.11</td>

    <td>3.57</td>

  </tr>

  <tr>

    <td>librispeech test_other</td>

    <td>22.99</td>

    <td>8.26</td>

    <td>4.68</td>

    <td>5.62</td>

    <td>5.25</td>

    <td>15.25</td>

    <td>9.99</td>

    <td>9.09</td>

  </tr>

  <tr>

    <td>fleurs eng_dev</td>

    <td>24.93</td>

    <td>12.92</td>

    <td>22.53</td>

    <td>11.63</td>

    <td>11.36</td>

    <td>18.89</td>

    <td>13.32</td>

    <td>13.12</td>

  </tr>

  <tr>

    <td>fleurs eng_test</td>

    <td>26.81</td>

    <td>13.41</td>

    <td>22.51</td>

    <td>12.57</td>

    <td>11.82</td>

    <td>20.41</td>

    <td>14.97</td>

    <td>14.74</td>

  </tr>

  <tr>

    <td>gigaspeech dev</td>

    <td>24.23</td>

    <td>19.44</td>

    <td>12.96</td>

    <td>19.18</td>

    <td>28.01</td>

    <td>23.46</td>

    <td>16.92</td>

    <td>17.34</td>

  </tr>

  <tr>

    <td>gigaspeech test</td>

    <td>23.07</td>

    <td>16.65</td>

    <td>13.26</td>

    <td>22.34</td>

    <td>28.65</td>

    <td>22.09</td>

    <td>16.64</td>

    <td>16.97</td>

  </tr>

  <tr>

    <td>average</td>

    <td><strong>22.70</strong></td>

    <td><strong>12.46</strong></td>

    <td><strong>13.02</strong></td>

    <td><strong>12.46</strong></td>

    <td><strong>14.64</strong></td>

    <td><strong>17.98</strong></td>

    <td><strong>12.66</strong></td>

    <td><strong>12.47</strong></td>

  </tr>

</table>


### Speech Translation (zh -> en)

For speech translation, the performanced is evaluated using BLEU score.

| Testset | Speech-LLaMA | Whisper-large-v3 | Qwen-audio | Qwen2-audio | SeamlessM4T-v2 | MooER-5K | MooER-5K-MTL |
|--------|-------------|-------------------|------------|-------------|-----------------|--------|--------------|
|CoVoST1 zh2en | - |  13.5 | 13.5 | - | 25.3 | - | **30.2** |
|CoVoST2 zh2en | 12.3 | 12.2 | 15.7 | 24.4 | 22.2 | 23.4 | **25.2** |
|CCMT2019 dev | -  | 15.9 | 12.0 | - | 14.8 | - | **19.6** |


## 🏁 Getting Started

Please visit our [GitHub](https://github.com/MooreThreads/MooER) for the setup and usage.


## 🧾 License

Please see the [LICENSE](LICENSE).


## 💖 Citation

If you find MooER useful for your research, please 🌟 this repo and cite our work using the following BibTeX:

```bibtex

@article{liang2024mooer,

  title   = {MooER: an LLM-based Speech Recognition and Translation Model from Moore Threads},

  author  = {Zhenlin Liang, Junhao Xu, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang},

  journal = {arXiv preprint arXiv:2408.05101},

  url     = {https://arxiv.org/abs/2408.05101}, 

  year    = {2024}

}

```

## 📧 Contact

If you encouter any problems, feel free to create a discussion.

Moore Threads Website: **https://www.mthreads.com/**

<br>
<p align="left">
    <img src="assets/MTLogo.png" width="300"/>

<p>

<br>