cmeraki commited on
Commit
215033b
·
1 Parent(s): 6cb714c

Updated README

Browse files
Files changed (1) hide show
  1. README.md +55 -18
README.md CHANGED
@@ -12,9 +12,9 @@ base_model:
12
  pipeline_tag: text-to-speech
13
  ---
14
 
15
- # Model Card for indri-0.1-125m-tts
16
 
17
- Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (125M) in our series and supports TTS tasks in 2 languages:
18
 
19
  1. English
20
  2. Hindi
@@ -29,7 +29,7 @@ We have open-sourced our training scripts, inference, and other details.
29
 
30
  ### Model Description
31
 
32
- `indri-0.1-125m-tts` is a novel, ultra-small, and lightweight TTS model based on the transformer architecture.
33
  It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
34
 
35
  ### Key features
@@ -42,7 +42,7 @@ It models audio as tokens and can generate high-quality audio with consistent st
42
  ### Details
43
 
44
  1. Model Type: GPT-2 based language model
45
- 2. Size: 125M parameters
46
  3. Language Support: English, Hindi
47
  4. License: CC BY 4.0
48
 
@@ -52,7 +52,7 @@ Here's a brief of how the model works:
52
 
53
  1. Converts input text into tokens
54
  2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
55
- 3. Decodes audio tokens (from [Kyutai/mimi](https://huggingface.co/kyutai/mimi)) to audio
56
 
57
  Please read our blog [here](#TODO) for more technical details on how it was built.
58
 
@@ -65,11 +65,11 @@ import torch
65
  import torchaudio
66
  from transformers import pipeline
67
 
 
68
  task = 'indri-tts'
69
- model_id = '11mlabs/indri-0.1-125m-tts'
70
 
71
  pipe = pipeline(
72
- task,
73
  model=model_id,
74
  device=torch.device('cuda:0'), # Update this based on your hardware,
75
  trust_remote_code=True
@@ -80,22 +80,59 @@ output = pipe(['Hi, my name is Indri and I like to talk.'])
80
  torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
81
  ```
82
 
83
- ## Credits
84
-
85
- 1. [Kyutai/mimi](https://huggingface.co/kyutai/mimi)
86
- 2. [nanoGPT](https://github.com/karpathy/nanoGPT)
87
-
88
  ## Citation
89
 
90
- To cite our work
91
 
92
- ```
93
- @misc{indri-0.1-125m-tts,
94
  author = {11mlabs},
95
- title = {indri-0.1-125m-tts},
96
- year = 2024,
97
- publisher = {Hugging Face},
98
  journal = {GitHub Repository},
99
  howpublished = {\url{https://github.com/cmeraki/indri}},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  }
101
  ```
 
12
  pipeline_tag: text-to-speech
13
  ---
14
 
15
+ # Model Card for indri-0.1-124m-tts
16
 
17
+ Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the smallest model (124M) in our series and supports TTS tasks in 2 languages:
18
 
19
  1. English
20
  2. Hindi
 
29
 
30
  ### Model Description
31
 
32
+ `indri-0.1-124m-tts` is a novel, ultra-small, and lightweight TTS model based on the transformer architecture.
33
  It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
34
 
35
  ### Key features
 
42
  ### Details
43
 
44
  1. Model Type: GPT-2 based language model
45
+ 2. Size: 124M parameters
46
  3. Language Support: English, Hindi
47
  4. License: CC BY 4.0
48
 
 
52
 
53
  1. Converts input text into tokens
54
  2. Runs autoregressive decoding on GPT-2 based transformer model and generates audio tokens
55
+ 3. Decodes audio tokens (using [Kyutai/mimi](https://huggingface.co/kyutai/mimi)) to audio
56
 
57
  Please read our blog [here](#TODO) for more technical details on how it was built.
58
 
 
65
  import torchaudio
66
  from transformers import pipeline
67
 
68
+ model_id = '11mlabs/indri-0.1-124m-tts'
69
  task = 'indri-tts'
 
70
 
71
  pipe = pipeline(
72
+ task,
73
  model=model_id,
74
  device=torch.device('cuda:0'), # Update this based on your hardware,
75
  trust_remote_code=True
 
80
  torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
81
  ```
82
 
 
 
 
 
 
83
  ## Citation
84
 
85
+ If you use this model in your research, please cite:
86
 
87
+ ```bibtex
88
+ @misc{indri-multimodal-alm,
89
  author = {11mlabs},
90
+ title = {Indri: Multimodal audio language model},
91
+ year = {2024},
92
+ publisher = {GitHub},
93
  journal = {GitHub Repository},
94
  howpublished = {\url{https://github.com/cmeraki/indri}},
95
+ email = {[email protected]}
96
+ }
97
+ ```
98
+
99
+ ## BibTex
100
+ 1. [nanoGPT](https://github.com/karpathy/nanoGPT)
101
+ 2. [Kyutai/mimi](https://huggingface.co/kyutai/mimi)
102
+ ```bibtex
103
+ @techreport{kyutai2024moshi,
104
+ title={Moshi: a speech-text foundation model for real-time dialogue},
105
+ author={Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and
106
+ Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
107
+ year={2024},
108
+ eprint={2410.00037},
109
+ archivePrefix={arXiv},
110
+ primaryClass={eess.AS},
111
+ url={https://arxiv.org/abs/2410.00037},
112
+ }
113
+ ```
114
+ 3. [Whisper](https://github.com/openai/whisper)
115
+ ```bibtex
116
+ @misc{radford2022whisper,
117
+ doi = {10.48550/ARXIV.2212.04356},
118
+ url = {https://arxiv.org/abs/2212.04356},
119
+ author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
120
+ title = {Robust Speech Recognition via Large-Scale Weak Supervision},
121
+ publisher = {arXiv},
122
+ year = {2022},
123
+ copyright = {arXiv.org perpetual, non-exclusive license}
124
+ }
125
+ ```
126
+ 4. [silero-vad](https://github.com/snakers4/silero-vad)
127
+ ```bibtex
128
+ @misc{Silero VAD,
129
+ author = {Silero Team},
130
+ title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
131
+ year = {2024},
132
+ publisher = {GitHub},
133
+ journal = {GitHub repository},
134
+ howpublished = {\url{https://github.com/snakers4/silero-vad}},
135
+ commit = {insert_some_commit_here},
136
+ email = {[email protected]}
137
  }
138
  ```