--- license: mit tags: - AudioMAE - PyTorch --- # AudioMAE This model card provides an easy-to-use API for a pretrained AudioMAE [1] whose weights are from [its original reposotiry](https://github.com/facebookresearch/AudioMAE). The provided model is specifically designed to easily obtain learned representations given an input audio file. The resulting representation $z$ has a dimension of $(d, h, w)$ for a single audio file, where $d$, $h$, and $w$ denote a latent dimension size, latent frequency dim, and latent temporal dim, respectively. [2] indicates that both frequency and temporal dimensional semantics are preserved in $z$. # Usage ```python from transformers import AutoModel model = AutoModel.from_pretrained("hance-ai/audiomae") # load the pretrained model z = model('path/audio_fname.wav') # (768, 8, 64) = (latent_dim_size, latent_freq_dim, latent_temporal_dim) ``` Depending on a task, a different pooling strategy should be facilitated. For instance, a global average pooling can be used for a classification task. [2] uses an adaptive pooling. ⚠️ AudioMAE accepts audio with maximum length of 10s (as described in [1]). Any audio longer than 10s will be clipped to 10s, meaning the excess beyond 10s will be discarded. # Sanity Check Result In the following, a spectrogram of an input audio and corresponding $z$ are visualized. The input audio is 10s, containing baby coughing, hiccuping, and adult sneezing. The latent dimension size of $z$ is reduced to 8 using PCA for visualization.