tilde-research/sieve_coding

This repository contains sparse autoencoders trained to analyze the internal representations of the Llama 3.1 8B Instruct model. The autoencoders are trained on the residual stream activations when processing code-related instruction data.

We apply these specialized, lightweight SAEs on a coding task in our blog post Sieve.

Model Details

Model Type: TopK Sparse Autoencoder
Base Model: Llama 3.1 8B Instruct
Training Data: 1B tokens of code data from:
- StackOverflow Python dataset
- Tested-143k Python Alpaca dataset
Architecture: Linear encoder-decoder with ReLU and TopK activation (k=64, 512)
File Format: PyTorch .pt files containing:
- W_enc_DF: Encoder weight matrix
- b_enc_F: Encoder bias vector
- W_dec_FD: Decoder weight matrix
- b_dec_D: Decoder bias vector

Usage

The autoencoders can be used to analyze and interpret the internal representations formed by Llama 3.1 8B Instruct when processing code. Since these autoencoders are trained on a very specific sub data mixture, they are not recommended for general purpose. They can be used to reproduce the result of Sieve evaluation for Llama 3.1 8B Instruct.

Example usage can be found in the Sieve repo

Training Details

Training Data Size: 1B tokens
Domain: Python code and code-related instructions
Target: Residual stream activations from Llama 3.1 8B Instruct from layers 8, 10, and 12
Compute: Around 9 A100 hours

License

MIT

Citation

If you use these models in your research, please cite:

@article{karvonen2024sieve,
    title={Sieve: SAEs Beat Baselines on a Real-World Task (A Code Generation Case Study)},
    author={Karvonen, Adam and Pai, Dhruv and Wang, Mason and Keigwin, Ben},
    journal={Tilde Research Blog},
    year={2024},
    month={12},
    url={https://www.tilderesearch.com/blog/sieve},
    note={Blog post}
}

tilde-research
/

sieve_coding

You need to agree to share your contact information to access this model

Model Details

Usage

Training Details

License

Citation

Model tree for tilde-research/sieve_coding

Datasets used to train tilde-research/sieve_coding