metadata
license: apache-2.0
language:
- en
tags:
- mechanistic interpretability
- sparse autoencoder
- llama
- llama-3
Model Information
A SAE (Sparse Autoencoder) for meta-llama/Llama-3.2-1B-Instruct.
It is trained specifically on layer 9 of Llama 3.2 1B and achieves a final L0 of 63 during training.
This model is used to decompose Llama's activations into interpretable features.
The SAE weights are released under Apache, however Llama 3.2 1B is to be used under Meta's Llama 3.2 License.
How to use
A Jupyter Notebook is provided to test the model
Training
Our SAE was trained using LMSYS-Chat-1M dataset, on a single RTX 3090. The training script will be provided soon in the following repository: https://github.com/qrsch/SAE
Acknowledgements
This release wouldn't have been possible without the work of Goodfire and Anthropic
.x+=:.
z` ^% .uef^"
.u . . <k .u . :d88E
.u@u .d88B :@8c .u .@8Ned8" .u u .d88B :@8c . `888E
.zWF8888bx ="8888f8888r ud8888. .@^%8888" ud8888. us888u. ="8888f8888r .udR88N 888E .z8k
.888 9888 4888>'88" :888'8888. x88: `)8b. :888'8888. .@88 "8888" 4888>'88" <888'888k 888E~?888L
I888 9888 4888> ' d888 '88%" 8888N=*8888 d888 '88%" 9888 9888 4888> ' 9888 'Y" 888E 888E
I888 9888 4888> 8888.+" %8" R88 8888.+" 9888 9888 4888> 9888 888E 888E
I888 9888 .d888L .+ 8888L @8Wou 9% 8888L 9888 9888 .d888L .+ 9888 888E 888E
`888Nx?888 ^"8888*" '8888c. .+ .888888P` '8888c. .+ 9888 9888 ^"8888*" ?8888u../ 888E 888E
"88" '888 "Y" "88888% ` ^"F "88888% "888*""888" "Y" "8888P' m888N= 888>
88E "YP' "YP' ^Y" ^Y' "P' `Y" 888
98> J88"
'8 @%
` :"