AxBench Release
Collection
Open supervised dictionary learning models and datasets for Gemma 2 2B and 9B.
•
9 items
•
Updated
•
2
AxBench evaluates interpretability methods in terms of concept detection and model steering. AxBench releases two supervised dictionary learning methods that outperforms existing methods including SAEs. These dictionaries contain 1D subspaces that map to high-level concepts.
gemma-reft-9b-it-res
?
gemma-
: Refer to Gemma 2 modelsreft-
: The dictionary learning model is trained by using representation finetuning (ReFT) (see ReFT paper for details)9b-it-
: The dictionary is for Gemma 2 9B instruction-tuning modelres
: The dictionary is trained on the model's residual stream.import pyvene as pv
Point of contact: Zhengxuan Wu or Aryaman Arora
Contact by email:
{wuzhengx, aryamana}@stanford.edu
Paper: