MooER-MTL-80K / README.md
Yi Liu
init commit
31656c7
|
raw
history blame
7.41 kB
---
license: mit
language:
- zh
- en
metrics:
- cer
- bleu
tags:
- asr
- automatic-speech-recognition
- automatic-speech-translation
- speech-translation
- speech-recognition
---
# MooER (ζ‘©θ€³): an LLM-based Speech Recognition and Translation Model from Moore Threads
**Online Demo**: https://mooer-speech.mthreads.com:10077/
## πŸ”₯ Update
We release a new model *MooER-80K-v2* using 80K hours of data. Currently, *MooER-80K-v2* supports the ASR task. The AST and multi-task models will be released soon.
## πŸ“– Introduction
We introduce **MooER (ζ‘©θ€³)**: an LLM-based speech recognition and translation model developed by Moore Threads. With the *MooER* framework, you can transcribe the speech into text (speech recognition or, ASR), and translate it into other languages (speech translation or, AST) in a end-to-end manner. The performance of *MooER* is demonstrated in the subsequent section, along with our insights into model configurations, training strategies, and more, provided in our [technical report](https://arxiv.org/abs/2408.05101).
For the usage of the model files, please refer to our [GitHub](https://github.com/MooreThreads/MooER)
<br>
<p align="center">
<img src="assets/framework.png" width="600"/>
<p>
<br>
## πŸ₯Š Evaluation Results
We demonstrate the training data and the evaluation results below. For more comprehensive information, please refer to our [report](https://arxiv.org/pdf/2408.05101).
### Training data
We utilize 5k hours of data (MT5K) to train our basic *MooER-5K* model. The data sources include:
| Dataset | Duration |
|---------------|---------------|
| aishell2 | 137h |
| librispeech | 131h |
| multi_cn | 100h |
| wenetspeech | 1361h |
| in-house data | 3274h |
Note that, data from the open-source datasets were randomly selected from the full training set. The in-house data, collected internally without text, were transcribed using a third-party ASR service.
Since all the above datasets were originally designed only for the speech recognition task, no translation results are available. To train our speech translation model, we used a third-party translation service to generate pseudo-labels. No data filtering techniques were applied.
At this moment, we are also developing a new model trained with 80K hours of data.
### Speech Recognition
The performance of speech recognition is evaluated using WER/CER.
<table>
<tr>
<th>Language</th>
<th>Testset</th>
<th>Paraformer-large</th>
<th>SenseVoice-small</th>
<th>Qwen-audio</th>
<th>Whisper-large-v3</th>
<th>SeamlessM4T-v2</th>
<th>MooER-5K</th>
<th>MooER-80K</th>
<th>MooER-80K-v2</th>
</tr>
<tr>
<td rowspan="7">Chinese</td>
<td>aishell1</td>
<td>1.93</td>
<td>3.03</td>
<td>1.43</td>
<td>7.86</td>
<td>4.09</td>
<td>1.93</td>
<td>1.25</td>
<td>1.00</td>
</tr>
<tr>
<td>aishell2_ios</td>
<td>2.85</td>
<td>3.79</td>
<td>3.57</td>
<td>5.38</td>
<td>4.81</td>
<td>3.17</td>
<td>2.67</td>
<td>2.62</td>
</tr>
<tr>
<td>test_magicdata</td>
<td>3.66</td>
<td>3.81</td>
<td>5.31</td>
<td>8.36</td>
<td>9.69</td>
<td>3.48</td>
<td>2.52</td>
<td>2.17</td>
</tr>
<tr>
<td>test_thchs</td>
<td>3.99</td>
<td>5.17</td>
<td>4.86</td>
<td>9.06</td>
<td>7.14</td>
<td>4.11</td>
<td>3.14</td>
<td>3.00</td>
</tr>
<tr>
<td>fleurs cmn_dev</td>
<td>5.56</td>
<td>6.39</td>
<td>10.54</td>
<td>4.54</td>
<td>7.12</td>
<td>5.81</td>
<td>5.23</td>
<td>5.15</td>
</tr>
<tr>
<td>fleurs cmn_test</td>
<td>6.92</td>
<td>7.36</td>
<td>11.07</td>
<td>5.24</td>
<td>7.66</td>
<td>6.77</td>
<td>6.18</td>
<td>6.14</td>
</tr>
<tr>
<td>average</td>
<td><strong>4.15</strong></td>
<td><strong>4.93</strong></td>
<td><strong>6.13</strong></td>
<td><strong>6.74</strong></td>
<td><strong>6.75</strong></td>
<td><strong>4.21</strong></td>
<td><strong>3.50</strong></td>
<td><strong>3.35</strong></td>
</tr>
<tr>
<td rowspan="7">English</td>
<td>librispeech test_clean</td>
<td>14.15</td>
<td>4.07</td>
<td>2.15</td>
<td>3.42</td>
<td>2.77</td>
<td>7.78</td>
<td>4.11</td>
<td>3.57</td>
</tr>
<tr>
<td>librispeech test_other</td>
<td>22.99</td>
<td>8.26</td>
<td>4.68</td>
<td>5.62</td>
<td>5.25</td>
<td>15.25</td>
<td>9.99</td>
<td>9.09</td>
</tr>
<tr>
<td>fleurs eng_dev</td>
<td>24.93</td>
<td>12.92</td>
<td>22.53</td>
<td>11.63</td>
<td>11.36</td>
<td>18.89</td>
<td>13.32</td>
<td>13.12</td>
</tr>
<tr>
<td>fleurs eng_test</td>
<td>26.81</td>
<td>13.41</td>
<td>22.51</td>
<td>12.57</td>
<td>11.82</td>
<td>20.41</td>
<td>14.97</td>
<td>14.74</td>
</tr>
<tr>
<td>gigaspeech dev</td>
<td>24.23</td>
<td>19.44</td>
<td>12.96</td>
<td>19.18</td>
<td>28.01</td>
<td>23.46</td>
<td>16.92</td>
<td>17.34</td>
</tr>
<tr>
<td>gigaspeech test</td>
<td>23.07</td>
<td>16.65</td>
<td>13.26</td>
<td>22.34</td>
<td>28.65</td>
<td>22.09</td>
<td>16.64</td>
<td>16.97</td>
</tr>
<tr>
<td>average</td>
<td><strong>22.70</strong></td>
<td><strong>12.46</strong></td>
<td><strong>13.02</strong></td>
<td><strong>12.46</strong></td>
<td><strong>14.64</strong></td>
<td><strong>17.98</strong></td>
<td><strong>12.66</strong></td>
<td><strong>12.47</strong></td>
</tr>
</table>
### Speech Translation (zh -> en)
For speech translation, the performanced is evaluated using BLEU score.
| Testset | Speech-LLaMA | Whisper-large-v3 | Qwen-audio | Qwen2-audio | SeamlessM4T-v2 | MooER-5K | MooER-5K-MTL |
|--------|-------------|-------------------|------------|-------------|-----------------|--------|--------------|
|CoVoST1 zh2en | - | 13.5 | 13.5 | - | 25.3 | - | **30.2** |
|CoVoST2 zh2en | 12.3 | 12.2 | 15.7 | 24.4 | 22.2 | 23.4 | **25.2** |
|CCMT2019 dev | - | 15.9 | 12.0 | - | 14.8 | - | **19.6** |
## 🏁 Getting Started
Please visit our [GitHub](https://github.com/MooreThreads/MooER) for the setup and usage.
## 🧾 License
Please see the [LICENSE](LICENSE).
## πŸ’– Citation
If you find MooER useful for your research, please 🌟 this repo and cite our work using the following BibTeX:
```bibtex
@article{liang2024mooer,
title = {MooER: an LLM-based Speech Recognition and Translation Model from Moore Threads},
author = {Zhenlin Liang, Junhao Xu, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang},
journal = {arXiv preprint arXiv:2408.05101},
url = {https://arxiv.org/abs/2408.05101},
year = {2024}
}
```
## πŸ“§ Contact
If you encouter any problems, feel free to create a discussion.
Moore Threads Website: **https://www.mthreads.com/**
<br>
<p align="left">
<img src="assets/MTLogo.png" width="300"/>
<p>
<br>