README.md · qresearch/Llama-3.2-1B-Instruct-SAE-l9 at main

metadata

license: apache-2.0
language:
  - en
tags:
  - mechanistic interpretability
  - sparse autoencoder
  - llama
  - llama-3

Model Information

A SAE (Sparse Autoencoder) for meta-llama/Llama-3.2-1B-Instruct.

It is trained specifically on layer 9 of Llama 3.2 1B and achieves a final L0 of 63 during training.

This model is used to decompose Llama's activations into interpretable features.

The SAE weights are released under Apache, however Llama 3.2 1B is to be used under Meta's Llama 3.2 License.

How to use

A Jupyter Notebook is provided to test the model

Training

Our SAE was trained using LMSYS-Chat-1M dataset, on a single RTX 3090. The training script will be provided soon in the following repository: https://github.com/qrsch/SAE

Acknowledgements

This release wouldn't have been possible without the work of Goodfire and Anthropic

                                       .x+=:.                                                             
                                      z`    ^%                                                  .uef^"    
               .u    .                   .   <k                           .u    .             :d88E       
    .u@u     .d88B :@8c       .u       .@8Ned8"      .u          u      .d88B :@8c        .   `888E       
 .zWF8888bx ="8888f8888r   ud8888.   .@^%8888"    ud8888.     us888u.  ="8888f8888r  .udR88N   888E .z8k  
.888  9888    4888>'88"  :888'8888. x88:  `)8b. :888'8888. .@88 "8888"   4888>'88"  <888'888k  888E~?888L 
I888  9888    4888> '    d888 '88%" 8888N=*8888 d888 '88%" 9888  9888    4888> '    9888 'Y"   888E  888E 
I888  9888    4888>      8888.+"     %8"    R88 8888.+"    9888  9888    4888>      9888       888E  888E 
I888  9888   .d888L .+   8888L        @8Wou 9%  8888L      9888  9888   .d888L .+   9888       888E  888E 
`888Nx?888   ^"8888*"    '8888c. .+ .888888P`   '8888c. .+ 9888  9888   ^"8888*"    ?8888u../  888E  888E 
 "88" '888      "Y"       "88888%   `   ^"F      "88888%   "888*""888"     "Y"       "8888P'  m888N= 888> 
       88E                  "YP'                   "YP'     ^Y"   ^Y'                  "P'     `Y"   888  
       98>                                                                                          J88"  
       '8                                                                                           @%    
        `                                                                                         :"