|
--- |
|
license: mit |
|
tags: |
|
- AudioMAE |
|
- PyTorch |
|
--- |
|
|
|
# AudioMAE |
|
This model card provides an easy-to-use API for a *pretrained encoder of AudioMAE* [1] whose weights are from [its original reposotiry](https://github.com/facebookresearch/AudioMAE). |
|
The provided model is specifically designed to easily obtain learned representations given an input audio file. |
|
The resulting representation $z$ has a dimension of $(d, h, w)$ for a single audio file, where $d$, $h$, and $w$ denote a latent dimension size, latent frequency dim, and latent temporal dim, respectively. |
|
[2] indicates that both frequency and temporal dimensional semantics are preserved in $z$. |
|
|
|
# Dependency |
|
See `requirements.txt` |
|
|
|
|
|
# Usage |
|
```python |
|
from transformers import AutoModel |
|
|
|
device = 'cpu' # 'cpu' or 'cuda' |
|
model = AutoModel.from_pretrained("hance-ai/audiomae", trust_remote_code=True).to(device) # load the pretrained model |
|
z = model('path/audio_fname.wav') # (768, 8, 64) = (latent_dim_size, latent_freq_dim, latent_temporal_dim) |
|
``` |
|
|
|
Depending on a task, a different pooling strategy should be facilitated. |
|
For instance, a global average pooling can be used for a classification task. [2] uses an adaptive pooling. |
|
|
|
⚠️ AudioMAE accepts audio with maximum length of 10s (as described in [1]). Any audio longer than 10s will be clipped to 10s, meaning the excess beyond 10s will be discarded. |
|
|
|
|
|
# Sanity Check Result |
|
In the following, a spectrogram of an input audio and corresponding $z$ are visualized. |
|
The input audio is 10s, containing baby coughing, hiccuping, and adult sneezing. |
|
The latent dimension size of $z$ is reduced to 8 using PCA for visualization. |
|
|
|
<p align="center"> |
|
<img src=".fig/sanity_check_result_audiomae.png" alt="" width=100%> |
|
</p> |
|
|
|
The result shows that the presence of labeled sound is clearly captured in the 3rd principal component (PC). |
|
While the baby coughing and hiccuping sounds are not so distinugisable up to the 5th PC, they are in the 6th PC. |
|
This result briefly shows the effectiveness of the pretrained AudioMAE. |
|
|
|
|
|
# References |
|
|
|
[1] Huang, Po-Yao, et al. "Masked autoencoders that listen." Advances in Neural Information Processing Systems 35 (2022): 28708-28720. |
|
|
|
[2] Liu, Haohe, et al. "Audioldm 2: Learning holistic audio generation with self-supervised pretraining." IEEE/ACM Transactions on Audio, Speech, and Language Processing (2024). |
|
|