zhengxuanzenwu
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,36 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-4.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
tags:
|
4 |
+
- ReFT
|
5 |
+
---
|
6 |
+
|
7 |
+
# 1. AxBench
|
8 |
+
|
9 |
+
AxBench evaluates interpretability methods in terms of concept detection and model steering. AxBench releases two supervised dictionary learning methods that outperforms existing methods including SAEs. These dictionaries contain 1D subspaces that map to high-level concepts.
|
10 |
+
|
11 |
+
# 2. What is `gemma-reft-2b-it-res`?
|
12 |
+
|
13 |
+
- `gemma-`: Refer to Gemma 2 models
|
14 |
+
- `reft-` : The dictionary learning model is trained by using representation finetuning (ReFT) (see [ReFT paper](https://arxiv.org/abs/2404.03592) for details)
|
15 |
+
- `2b-it-`: The dictionary is for Gemma 2 2B instruction-tuning model
|
16 |
+
- `res` : The dictionary is trained on the model's residual stream.
|
17 |
+
- We release the weights as well as the annotated concepts for all subspaces.
|
18 |
+
|
19 |
+
# 3. How can I use these dictionaries straight away?
|
20 |
+
|
21 |
+
```python
|
22 |
+
import pyvene as pv
|
23 |
+
|
24 |
+
```
|
25 |
+
|
26 |
+
# 4. Point of Contact
|
27 |
+
|
28 |
+
Point of contact: Zhengxuan Wu or Aryaman Arora
|
29 |
+
|
30 |
+
Contact by email:
|
31 |
+
|
32 |
+
{wuzhengx, aryamana}@stanford.edu
|
33 |
+
|
34 |
+
# 5. Citation
|
35 |
+
|
36 |
+
Paper:
|