kohei0209 commited on
Commit
a58e3da
·
verified ·
1 Parent(s): c582091

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +405 -3
README.md CHANGED
@@ -1,3 +1,405 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ ---
4
+
5
+
6
+ ## ESPnet2 ENH model
7
+
8
+ ### `kohei0209/tfgridnet_urgent25`
9
+
10
+ This model was trained by Kohei Saijo using the [urgent25](https://github.com/kohei0209/espnet/tree/urgent2025/egs2/urgent25/enh1) recipe based on [espnet](https://github.com/espnet/espnet/).
11
+
12
+ Note that **the recipe has not merged to the ESPnet main branch yet and the code is in the [fork repository](https://github.com/kohei0209/espnet/tree/urgent2025/egs2/urgent25/enh1)**.
13
+
14
+ This model is provided as a pre-trained baseline model for the [URGENT 2025 Challenge](https://urgent-challenge.github.io/urgent2025).
15
+
16
+ ### Demo: How to use in ESPnet2
17
+
18
+ Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html)
19
+ if you haven't done that already.
20
+
21
+ <!--
22
+ ```bash
23
+ cd espnet
24
+ pip install -e .
25
+ cd egs2/urgent25/enh1
26
+ ./run.sh --skip_data_prep false --skip_train true --is_tse_task true --download_model kohei0209/tfgridnet_urgent25
27
+ ```
28
+
29
+ To use the model in the Python interface, you could use the following code:
30
+ > Please make sure you are using the latest ESPnet by installing from the source:
31
+ > ```
32
+ > python -m pip install git+https://github.com/espnet/espnet
33
+ > ```
34
+ -->
35
+
36
+ ```python
37
+ import soundfile as sf
38
+ from espnet2.bin.enh_inference import SeparateSpeech
39
+ # For model downloading + loading
40
+ model = SeparateSpeech.from_pretrained(
41
+ model_tag="kohei0209/tfgridnet_urgent25",
42
+ normalize_output_wav=True,
43
+ device="cuda",
44
+ )
45
+ # For loading a downloaded model
46
+ # model = SeparateSpeech(
47
+ # train_config="exp/xxx/config.yaml",
48
+ # model_file="exp/xx/valid.loss.best.pth",
49
+ # normalize_output_wav=True,
50
+ # device="cuda",
51
+ # )
52
+ audio, fs = sf.read("/path/to/noisy/utt1.flac")
53
+ enhanced = model(audio[None, :], fs=fs)[0]
54
+ ```
55
+
56
+
57
+ ## ENH config
58
+
59
+ <details><summary>expand</summary>
60
+
61
+ ```
62
+ config: conf/tuning/train_enh_tfgridnet_dm.yaml
63
+ print_config: false
64
+ log_level: INFO
65
+ drop_last_iter: false
66
+ dry_run: false
67
+ iterator_type: chunk
68
+ valid_iterator_type: null
69
+ output_dir: exp/enh_train_enh_tfgridnet_dm_raw
70
+ ngpu: 1
71
+ seed: 0
72
+ num_workers: 4
73
+ num_att_plot: 3
74
+ dist_backend: nccl
75
+ dist_init_method: env://
76
+ dist_world_size: null
77
+ dist_rank: null
78
+ local_rank: 0
79
+ dist_master_addr: null
80
+ dist_master_port: null
81
+ dist_launcher: null
82
+ multiprocessing_distributed: false
83
+ unused_parameters: false
84
+ sharded_ddp: false
85
+ use_deepspeed: false
86
+ deepspeed_config: null
87
+ cudnn_enabled: true
88
+ cudnn_benchmark: false
89
+ cudnn_deterministic: true
90
+ use_tf32: false
91
+ collect_stats: false
92
+ write_collected_feats: false
93
+ max_epoch: 30
94
+ patience: 5
95
+ val_scheduler_criterion:
96
+ - valid
97
+ - loss
98
+ early_stopping_criterion:
99
+ - valid
100
+ - loss
101
+ - min
102
+ best_model_criterion:
103
+ - - valid
104
+ - loss
105
+ - min
106
+ keep_nbest_models: 5
107
+ nbest_averaging_interval: 0
108
+ grad_clip: 1.0
109
+ grad_clip_type: 2.0
110
+ grad_noise: false
111
+ accum_grad: 1
112
+ no_forward_run: false
113
+ resume: true
114
+ train_dtype: float32
115
+ use_amp: false
116
+ log_interval: null
117
+ use_matplotlib: true
118
+ use_tensorboard: true
119
+ create_graph_in_tensorboard: false
120
+ use_wandb: false
121
+ wandb_project: null
122
+ wandb_id: null
123
+ wandb_entity: null
124
+ wandb_name: null
125
+ wandb_model_log_interval: -1
126
+ detect_anomaly: false
127
+ use_adapter: false
128
+ adapter: lora
129
+ save_strategy: all
130
+ adapter_conf: {}
131
+ pretrain_path: null
132
+ init_param:
133
+ - exp/enh_train_enh_tfgridnet_raw_1stchallenge/21epoch.pth
134
+ ignore_init_mismatch: false
135
+ freeze_param: []
136
+ num_iters_per_epoch: 4000
137
+ batch_size: 2
138
+ valid_batch_size: 4
139
+ batch_bins: 1000000
140
+ valid_batch_bins: null
141
+ category_sample_size: 10
142
+ train_shape_file:
143
+ - exp/enh_stats_16k/train/speech_mix_shape
144
+ - exp/enh_stats_16k/train/speech_ref1_shape
145
+ valid_shape_file:
146
+ - exp/enh_stats_16k/valid/speech_mix_shape
147
+ - exp/enh_stats_16k/valid/speech_ref1_shape
148
+ batch_type: folded
149
+ valid_batch_type: null
150
+ fold_length:
151
+ - 80000
152
+ - 80000
153
+ sort_in_batch: descending
154
+ shuffle_within_batch: false
155
+ sort_batch: descending
156
+ multiple_iterator: false
157
+ chunk_length: 200
158
+ chunk_shift_ratio: 0.5
159
+ num_cache_chunks: 128
160
+ chunk_excluded_key_prefixes: []
161
+ chunk_default_fs: 50
162
+ chunk_max_abs_length: 144000
163
+ chunk_discard_short_samples: true
164
+ train_data_path_and_name_and_type:
165
+ - - dump/raw/speech_train_track1/wav.scp
166
+ - speech_mix
167
+ - sound
168
+ - - dump/raw/speech_train_track1/spk1.scp
169
+ - speech_ref1
170
+ - sound
171
+ - - dump/raw/speech_train_track1/utt2category
172
+ - category
173
+ - text
174
+ - - dump/raw/speech_train_track1/utt2fs
175
+ - fs
176
+ - text_int
177
+ valid_data_path_and_name_and_type:
178
+ - - dump/raw/validation/wav.scp
179
+ - speech_mix
180
+ - sound
181
+ - - dump/raw/validation/spk1.scp
182
+ - speech_ref1
183
+ - sound
184
+ - - dump/raw/validation/utt2category
185
+ - category
186
+ - text
187
+ - - dump/raw/validation/utt2fs
188
+ - fs
189
+ - text_int
190
+ multi_task_dataset: false
191
+ allow_variable_data_keys: false
192
+ max_cache_size: 0.0
193
+ max_cache_fd: 32
194
+ allow_multi_rates: true
195
+ valid_max_cache_size: null
196
+ exclude_weight_decay: false
197
+ exclude_weight_decay_conf: {}
198
+ optim: adam
199
+ optim_conf:
200
+ lr: 0.0001
201
+ eps: 1.0e-08
202
+ weight_decay: 1.0e-05
203
+ scheduler: warmupsteplr
204
+ scheduler_conf:
205
+ step_size: 1
206
+ gamma: 0.98
207
+ warmup_steps: 4000
208
+ init: null
209
+ model_conf:
210
+ normalize_variance_per_ch: true
211
+ categories:
212
+ - 1ch_8000Hz
213
+ - 1ch_16000Hz
214
+ - 1ch_22050Hz
215
+ - 1ch_24000Hz
216
+ - 1ch_32000Hz
217
+ - 1ch_44100Hz
218
+ - 1ch_48000Hz
219
+ - 1ch_8000Hz_reverb
220
+ - 1ch_16000Hz_reverb
221
+ - 1ch_22050Hz_reverb
222
+ - 1ch_24000Hz_reverb
223
+ - 1ch_32000Hz_reverb
224
+ - 1ch_44100Hz_reverb
225
+ - 1ch_48000Hz_reverb
226
+ criterions:
227
+ - name: mr_l1_tfd
228
+ conf:
229
+ window_sz:
230
+ - 256
231
+ - 512
232
+ - 768
233
+ - 1024
234
+ hop_sz: null
235
+ eps: 1.0e-08
236
+ time_domain_weight: 0.5
237
+ normalize_variance: true
238
+ wrapper: fixed_order
239
+ wrapper_conf:
240
+ weight: 1.0
241
+ - name: si_snr
242
+ conf:
243
+ eps: 1.0e-07
244
+ wrapper: fixed_order
245
+ wrapper_conf:
246
+ weight: 0.0
247
+ speech_volume_normalize: null
248
+ rir_scp: null
249
+ rir_apply_prob: 1.0
250
+ noise_scp: null
251
+ noise_apply_prob: 1.0
252
+ noise_db_range: '13_15'
253
+ short_noise_thres: 0.5
254
+ use_reverberant_ref: false
255
+ num_spk: 1
256
+ num_noise_type: 1
257
+ sample_rate: 8000
258
+ force_single_channel: false
259
+ channel_reordering: false
260
+ categories: []
261
+ speech_segment: null
262
+ avoid_allzero_segment: true
263
+ flexible_numspk: false
264
+ dynamic_mixing: false
265
+ utt2spk: null
266
+ dynamic_mixing_gain_db: 0.0
267
+ encoder: stft
268
+ encoder_conf:
269
+ n_fft: 256
270
+ hop_length: 128
271
+ use_builtin_complex: true
272
+ default_fs: 8000
273
+ separator: tfgridnetv3
274
+ separator_conf:
275
+ n_srcs: 1
276
+ n_imics: 1
277
+ n_layers: 6
278
+ lstm_hidden_units: 200
279
+ attn_n_head: 4
280
+ attn_qk_output_channel: 2
281
+ emb_dim: 48
282
+ emb_ks: 4
283
+ emb_hs: 1
284
+ activation: prelu
285
+ eps: 1.0e-05
286
+ decoder: stft
287
+ decoder_conf:
288
+ n_fft: 256
289
+ hop_length: 128
290
+ default_fs: 8000
291
+ mask_module: multi_mask
292
+ mask_module_conf: {}
293
+ preprocessor: enh
294
+ preprocessor_conf:
295
+ speech_volume_normalize: 0.5_1.0
296
+ rir_scp: dump/raw/rir_train.scp
297
+ rir_apply_prob: 0.5
298
+ noise_scp: dump/raw/noise_train.scp
299
+ noise_apply_prob: 1.0
300
+ noise_db_range: '-5_15'
301
+ force_single_channel: true
302
+ channel_reordering: true
303
+ categories:
304
+ - 1ch_8000Hz
305
+ - 1ch_16000Hz
306
+ - 1ch_22050Hz
307
+ - 1ch_24000Hz
308
+ - 1ch_32000Hz
309
+ - 1ch_44100Hz
310
+ - 1ch_48000Hz
311
+ - 1ch_8000Hz_reverb
312
+ - 1ch_16000Hz_reverb
313
+ - 1ch_22050Hz_reverb
314
+ - 1ch_24000Hz_reverb
315
+ - 1ch_32000Hz_reverb
316
+ - 1ch_44100Hz_reverb
317
+ - 1ch_48000Hz_reverb
318
+ data_aug_effects:
319
+ - - 1.0
320
+ - bandwidth_limitation
321
+ - res_type: random
322
+ - - 1.0
323
+ - clipping
324
+ - min_quantile: 0.1
325
+ max_quantile: 0.9
326
+ - - 1.0
327
+ - - - 0.5
328
+ - codec
329
+ - format: mp3
330
+ encoder: null
331
+ qscale:
332
+ - 1
333
+ - 10
334
+ - - 0.5
335
+ - codec
336
+ - format: ogg
337
+ encoder:
338
+ - vorbis
339
+ - opus
340
+ qscale:
341
+ - -1
342
+ - 10
343
+ - - 1.0
344
+ - packet_loss
345
+ - packet_duration_ms: 20
346
+ packet_loss_rate:
347
+ - 0.05
348
+ - 0.25
349
+ max_continuous_packet_loss: 10
350
+ data_aug_num:
351
+ - 1
352
+ - 3
353
+ data_aug_prob: 0.75
354
+ diffusion_model: null
355
+ diffusion_model_conf: {}
356
+ required:
357
+ - output_dir
358
+ version: '202409'
359
+ distributed: false
360
+
361
+ ```
362
+
363
+ </details>
364
+
365
+
366
+ ### Citing ESPnet
367
+
368
+ ```BibTex
369
+ @inproceedings{watanabe2018espnet,
370
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
371
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
372
+ year={2018},
373
+ booktitle={Proceedings of Interspeech},
374
+ pages={2207--2211},
375
+ doi={10.21437/Interspeech.2018-1456},
376
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
377
+ }
378
+ @inproceedings{ESPnet-SE,
379
+ author = {Chenda Li and Jing Shi and Wangyou Zhang and Aswin Shanmugam Subramanian and Xuankai Chang and
380
+ Naoyuki Kamo and Moto Hira and Tomoki Hayashi and Christoph B{"{o}}ddeker and Zhuo Chen and Shinji Watanabe},
381
+ title = {ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for {ASR} Integration},
382
+ booktitle = {{IEEE} Spoken Language Technology Workshop, {SLT} 2021, Shenzhen, China, January 19-22, 2021},
383
+ pages = {785--792},
384
+ publisher = {{IEEE}},
385
+ year = {2021},
386
+ url = {https://doi.org/10.1109/SLT48900.2021.9383615},
387
+ doi = {10.1109/SLT48900.2021.9383615},
388
+ timestamp = {Mon, 12 Apr 2021 17:08:59 +0200},
389
+ biburl = {https://dblp.org/rec/conf/slt/Li0ZSCKHHBC021.bib},
390
+ bibsource = {dblp computer science bibliography, https://dblp.org}
391
+ }
392
+ ```
393
+
394
+ or arXiv:
395
+
396
+ ```bibtex
397
+ @misc{watanabe2018espnet,
398
+ title={ESPnet: End-to-End Speech Processing Toolkit},
399
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
400
+ year={2018},
401
+ eprint={1804.00015},
402
+ archivePrefix={arXiv},
403
+ primaryClass={cs.CL}
404
+ }
405
+ ```