|
--- |
|
license: mit |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
<div align="center"> |
|
<h1>Llama-3-8B-Instruct-80K-QLoRA-Merged</h1> |
|
|
|
<a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora">[Data&Code]</a> |
|
</div> |
|
|
|
We extend the context length of Llama-3-8B-Instruct to 80K using QLoRA and 3.5K long-context training data synthesized from GPT-4. The entire training cycle is super efficient, which takes 8 hours on a 8xA800 (80G) machine. Yet, the resulted model achieves remarkable performance on a series of downstream long-context evaluation benchmarks. |
|
|
|
**NOTE**: This repo contains the quantized model of [namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged](https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged). The quantization is conducted with [llama.cpp](https://github.com/ggerganov/llama.cpp) (Q4_K_M and Q8_0). |
|
|
|
All the following evaluation results are based on the [UNQUANTIZED MODEL](https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged). They can be reproduced following instructions [here](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora). However, after quantization, you may observe **quality degradation**. |
|
|
|
## Needle in a Haystack |
|
We evaluate the model on the Needle-In-A-HayStack task using the official setting. The blue vertical line indicates the training context length, i.e. 80K. |
|
|
|
<img src="data/needle.png"></img> |
|
|
|
|
|
## LongBench |
|
We evaluate the model on [LongBench](https://arxiv.org/abs/2308.14508) using 32K context length and the official prompt template. For [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), we use 8K context length. |
|
|
|
|Model|Single-Doc QA|Multi-Doc QA|Summarization|Few-Shot Learning|Synthetic|Code|Avg| |
|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:| |
|
|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)|37.33|36.04|26.83|**69.56**|37.75|53.24|43.20| |
|
|[gradientai/Llama-3-8B-Instruct-262k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k)|37.29|31.20|26.18|67.25|44.25|**62.71**|43.73| |
|
|Llama-3-8B-Instruct-80K-QLoRA-Merged|**43.57**|**43.07**|**28.93**|69.15|**48.50**|51.95|**47.19**| |
|
|
|
## InfiniteBench |
|
We evaluate the model on [InfiniteBench](https://arxiv.org/pdf/2402.13718.pdf) using 80K context length and the official prompt template. The results of GPT-4 is copied from the [paper](https://arxiv.org/pdf/2402.13718.pdf). For [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), we use 8K context length. |
|
|
|
|Model|LongBookQA Eng|LongBookSum Eng| |
|
|:-:|:-:|:-:| |
|
|GPT-4|22.22|14.73| |
|
|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)|7.00|**16.40**| |
|
|[gradientai/Llama-3-8B-Instruct-262k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k)|20.30|10.34| |
|
|Llama-3-8B-Instruct-80K-QLoRA-Merged|**30.92**|14.73| |
|
|
|
## Topic Retrieval |
|
We evaluate the model on [Topic Retrieval](https://lmsys.org/blog/2023-06-29-longchat/) task with `[5,10,15,20,25,30,40,50,60,70]` topics. |
|
|
|
<img src="data/topic.png"></img> |
|
|
|
|
|
## MMLU |
|
We evaluate the model's zero-shot performance on MMLU benchmark as a reflection of its short-context capability. |
|
|
|
|Model|STEM|Social Sciences|Humanities|Others|Avg| |
|
|:-:|:-:|:-:|:-:|:-:|:-:| |
|
|[Llama-2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)|35.92|54.37|51.74|51.42|47.22| |
|
|[Mistral-7B-v0.2-Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)|48.79|69.95|64.99|61.64|60.10| |
|
|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)|**53.87**|**75.66**|**69.44**|69.75|**65.91**| |
|
|[gradientai/Llama-3-8B-Instruct-262k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k)|52.10|73.26|67.15|**69.80**|64.34| |
|
|Llama-3-8B-Instruct-80K-QLoRA-Merged|53.10|73.24|67.32|68.79|64.44| |
|
|
|
# Environment |
|
```bash |
|
llama_cpp |
|
torch==2.1.2 |
|
transformers==4.39.3 |
|
``` |
|
|
|
# Usage |
|
```bash |
|
huggingface-cli download namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged-GGUF --local-dir . --local-dir-use-symlinks False |
|
``` |
|
|
|
In python, |
|
```python |
|
from llama_cpp import Llama |
|
|
|
llm = Llama( |
|
model_path="./Llama-3-8B-Instruct-80K-QLoRA-Merged-Q4_K_M.gguf", # path to GGUF file |
|
n_ctx=81920, |
|
n_threads=96, |
|
n_gpu_layers=32, |
|
) |
|
|
|
with open("./data/needle.txt") as f: |
|
text = f.read() |
|
inputs = f"{text}\n\nWhat is the best thing to do in San Francisco?" |
|
|
|
print( |
|
llm.create_chat_completion( |
|
messages = [ |
|
{ |
|
"role": "user", |
|
"content": inputs |
|
} |
|
], |
|
temperature=0, |
|
max_tokens=50 |
|
) |
|
) |
|
|
|
# The best thing to do in San Francisco is sitting in Helmer Dolores Park on a sunny day, eating a double cheeseburger with ketchup, and watching kids playing around. |
|
``` |