RichardErkhov commited on
Commit
a7b49bf
·
verified ·
1 Parent(s): d21d873

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ activation-beacon-llama2-7b-chat - bnb 4bits
11
+ - Model creator: https://huggingface.co/namespace-Pt/
12
+ - Original model: https://huggingface.co/namespace-Pt/activation-beacon-llama2-7b-chat/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ license: mit
20
+ datasets:
21
+ - togethercomputer/RedPajama-Data-1T-Sample
22
+ - Yukang/LongAlpaca-12k
23
+ pipeline_tag: text-generation
24
+ ---
25
+ <div align="center">
26
+ <h1>Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon</h1>
27
+
28
+ [<a href="https://arxiv.org/abs/2401.03462">Paper</a>] [<a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon">Github</a>]
29
+
30
+ <img src="imgs/impress.png" width="80%" class="center">
31
+ </div>
32
+
33
+ We introduce Activation Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM by **x100** times. Currently we only apply activation beacon to [Llama-2-chat-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). More LLMs will be supported in the future.
34
+
35
+ ## Features
36
+ - **Effectiveness**
37
+ - significantly improves the performance of Llama-2 on long-context generation (language modeling) and long-context understanding (e.g. long-document QA).
38
+ - **Efficiency**
39
+ - low memory usage; low inference latency (compeitive against FlashAttention2); inference latency increases linearly with the input length.
40
+ - **Compatibility**
41
+ - preserve the short-context capability of Llama-2;
42
+ - can be combined with context window extension techniques for futher context extension (e.g. 1M with NTK-Aware);
43
+ - can be combined with retrieval for higher memory accuracy (*ongoing*).
44
+ - **Low-Cost Training**
45
+ - train with 80000 texts within 9 hours;
46
+ - most training samples are shorter than 4096.
47
+
48
+ ## Environment
49
+ The main dependencies are:
50
+ ```
51
+ pytorch==2.1.2 transformers==4.36.1 accelerate==0.25.0 datasets==2.14.7 numpy==1.26.2 flash-attn==2.4.2
52
+ ```
53
+
54
+ ## Usage
55
+ ```python
56
+ import json
57
+ import torch
58
+ from transformers import AutoModelForCausalLM, AutoTokenizer
59
+
60
+ model_id = "namespace-Pt/activation-beacon-llama2-7b-chat"
61
+
62
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
63
+ model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype=torch.bfloat16)
64
+
65
+ model = model.cuda().eval()
66
+
67
+ with torch.no_grad():
68
+ # short context
69
+ text = "Tell me about yourself."
70
+ inputs = tokenizer(text, return_tensors="pt").to("cuda")
71
+ outputs = model.generate(**inputs, max_new_tokens=20)
72
+ print(f"Input Length: {inputs['input_ids'].shape[1]}")
73
+ print(f"Output: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
74
+
75
+ # reset memory before new generation task
76
+ model.memory.reset()
77
+
78
+ # long context
79
+ with open("data/toy/narrativeqa.json", encoding="utf-8") as f:
80
+ example = json.load(f)
81
+ inputs = tokenizer(example["context"], return_tensors="pt").to("cuda")
82
+ outputs = model.generate(**inputs, do_sample=False, top_p=1, temperature=1, max_new_tokens=20)[:, inputs["input_ids"].shape[1]:]
83
+ print("*"*20)
84
+ print(f"Input Length: {inputs['input_ids'].shape[1]}")
85
+ print(f"Answer: {example['answer']}")
86
+ print(f"Prediction: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
87
+ ```
88
+ **NOTE**: It's okay to see warnings like `This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.` Just ignore it.
89
+
90
+ ## Training
91
+ *coming soon*
92
+
93
+ ## Evaluation
94
+ See [evaluation section](https://github.com/FlagOpen/FlagEmbedding/blob/master/Long_LLM/activation_beacon/docs/evaluation.md).
95
+
96
+ ## Citation
97
+ If you find this model useful, please give us a like ❤️.
98
+
99
+ To cite our work:
100
+ ```
101
+ @misc{zhang2024soaring,
102
+ title={Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon},
103
+ author={Peitian Zhang and Zheng Liu and Shitao Xiao and Ninglu Shao and Qiwei Ye and Zhicheng Dou},
104
+ year={2024},
105
+ eprint={2401.03462},
106
+ archivePrefix={arXiv},
107
+ primaryClass={cs.CL}
108
+ }
109
+ ```
110
+