Updated README
Browse files
README.md
CHANGED
@@ -12,9 +12,9 @@ base_model:
|
|
12 |
pipeline_tag: text-to-speech
|
13 |
---
|
14 |
|
15 |
-
# Model Card for indri-0.1-
|
16 |
|
17 |
-
Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (
|
18 |
|
19 |
1. English
|
20 |
2. Hindi
|
@@ -29,7 +29,7 @@ We have open-sourced our training scripts, inference, and other details.
|
|
29 |
|
30 |
### Model Description
|
31 |
|
32 |
-
`indri-0.1-
|
33 |
It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
|
34 |
|
35 |
### Key features
|
@@ -42,7 +42,7 @@ It models audio as tokens and can generate high-quality audio with consistent st
|
|
42 |
### Details
|
43 |
|
44 |
1. Model Type: GPT-2 based language model
|
45 |
-
2. Size:
|
46 |
3. Language Support: English, Hindi
|
47 |
4. License: CC BY 4.0
|
48 |
|
@@ -52,7 +52,7 @@ Here's a brief of how the model works:
|
|
52 |
|
53 |
1. Converts input text into tokens
|
54 |
2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
|
55 |
-
3. Decodes audio tokens (
|
56 |
|
57 |
Please read our blog [here](#TODO) for more technical details on how it was built.
|
58 |
|
@@ -65,11 +65,11 @@ import torch
|
|
65 |
import torchaudio
|
66 |
from transformers import pipeline
|
67 |
|
|
|
68 |
task = 'indri-tts'
|
69 |
-
model_id = '11mlabs/indri-0.1-125m-tts'
|
70 |
|
71 |
pipe = pipeline(
|
72 |
-
|
73 |
model=model_id,
|
74 |
device=torch.device('cuda:0'), # Update this based on your hardware,
|
75 |
trust_remote_code=True
|
@@ -80,22 +80,59 @@ output = pipe(['Hi, my name is Indri and I like to talk.'])
|
|
80 |
torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
|
81 |
```
|
82 |
|
83 |
-
## Credits
|
84 |
-
|
85 |
-
1. [Kyutai/mimi](https://huggingface.co/kyutai/mimi)
|
86 |
-
2. [nanoGPT](https://github.com/karpathy/nanoGPT)
|
87 |
-
|
88 |
## Citation
|
89 |
|
90 |
-
|
91 |
|
92 |
-
```
|
93 |
-
@misc{indri-
|
94 |
author = {11mlabs},
|
95 |
-
title = {
|
96 |
-
year = 2024,
|
97 |
-
publisher = {
|
98 |
journal = {GitHub Repository},
|
99 |
howpublished = {\url{https://github.com/cmeraki/indri}},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
}
|
101 |
```
|
|
|
12 |
pipeline_tag: text-to-speech
|
13 |
---
|
14 |
|
15 |
+
# Model Card for indri-0.1-124m-tts
|
16 |
|
17 |
+
Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (124M) in our series and supports TTS tasks in 2 languages:
|
18 |
|
19 |
1. English
|
20 |
2. Hindi
|
|
|
29 |
|
30 |
### Model Description
|
31 |
|
32 |
+
`indri-0.1-124m-tts` is a novel, ultra-small, and lightweight TTS model based on the transformer architecture.
|
33 |
It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
|
34 |
|
35 |
### Key features
|
|
|
42 |
### Details
|
43 |
|
44 |
1. Model Type: GPT-2 based language model
|
45 |
+
2. Size: 124M parameters
|
46 |
3. Language Support: English, Hindi
|
47 |
4. License: CC BY 4.0
|
48 |
|
|
|
52 |
|
53 |
1. Converts input text into tokens
|
54 |
2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
|
55 |
+
3. Decodes audio tokens (using [Kyutai/mimi](https://huggingface.co/kyutai/mimi)) to audio
|
56 |
|
57 |
Please read our blog [here](#TODO) for more technical details on how it was built.
|
58 |
|
|
|
65 |
import torchaudio
|
66 |
from transformers import pipeline
|
67 |
|
68 |
+
model_id = '11mlabs/indri-0.1-124m-tts'
|
69 |
task = 'indri-tts'
|
|
|
70 |
|
71 |
pipe = pipeline(
|
72 |
+
task,
|
73 |
model=model_id,
|
74 |
device=torch.device('cuda:0'), # Update this based on your hardware,
|
75 |
trust_remote_code=True
|
|
|
80 |
torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
|
81 |
```
|
82 |
|
|
|
|
|
|
|
|
|
|
|
83 |
## Citation
|
84 |
|
85 |
+
If you use this model in your research, please cite:
|
86 |
|
87 |
+
```bibtex
|
88 |
+
@misc{indri-multimodal-alm,
|
89 |
author = {11mlabs},
|
90 |
+
title = {Indri: Multimodal audio language model},
|
91 |
+
year = {2024},
|
92 |
+
publisher = {GitHub},
|
93 |
journal = {GitHub Repository},
|
94 |
howpublished = {\url{https://github.com/cmeraki/indri}},
|
95 |
+
email = {[email protected]}
|
96 |
+
}
|
97 |
+
```
|
98 |
+
|
99 |
+
## BibTex
|
100 |
+
1. [nanoGPT](https://github.com/karpathy/nanoGPT)
|
101 |
+
2. [Kyutai/mimi](https://huggingface.co/kyutai/mimi)
|
102 |
+
```bibtex
|
103 |
+
@techreport{kyutai2024moshi,
|
104 |
+
title={Moshi: a speech-text foundation model for real-time dialogue},
|
105 |
+
author={Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and
|
106 |
+
Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
|
107 |
+
year={2024},
|
108 |
+
eprint={2410.00037},
|
109 |
+
archivePrefix={arXiv},
|
110 |
+
primaryClass={eess.AS},
|
111 |
+
url={https://arxiv.org/abs/2410.00037},
|
112 |
+
}
|
113 |
+
```
|
114 |
+
3. [Whisper](https://github.com/openai/whisper)
|
115 |
+
```bibtex
|
116 |
+
@misc{radford2022whisper,
|
117 |
+
doi = {10.48550/ARXIV.2212.04356},
|
118 |
+
url = {https://arxiv.org/abs/2212.04356},
|
119 |
+
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
|
120 |
+
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
|
121 |
+
publisher = {arXiv},
|
122 |
+
year = {2022},
|
123 |
+
copyright = {arXiv.org perpetual, non-exclusive license}
|
124 |
+
}
|
125 |
+
```
|
126 |
+
4. [silero-vad](https://github.com/snakers4/silero-vad)
|
127 |
+
```bibtex
|
128 |
+
@misc{Silero VAD,
|
129 |
+
author = {Silero Team},
|
130 |
+
title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
|
131 |
+
year = {2024},
|
132 |
+
publisher = {GitHub},
|
133 |
+
journal = {GitHub repository},
|
134 |
+
howpublished = {\url{https://github.com/snakers4/silero-vad}},
|
135 |
+
commit = {insert_some_commit_here},
|
136 |
+
email = {[email protected]}
|
137 |
}
|
138 |
```
|