Update README.md

12d942d verified 8 months ago

4.81 kB

	---
	license: mit
	pipeline_tag: text-generation
	---

	<div align="center">
	<h1>Llama-3-8B-Instruct-80K-QLoRA-Merged</h1>

	<a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora">[Data&Code]</a>
	</div>

	We extend the context length of Llama-3-8B-Instruct to 80K using QLoRA and 3.5K long-context training data synthesized from GPT-4. The entire training cycle is super efficient, which takes 8 hours on a 8xA800 (80G) machine. Yet, the resulted model achieves remarkable performance on a series of downstream long-context evaluation benchmarks.

	NOTE: This repo contains the quantized model of [namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged](https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged). The quantization is conducted with [llama.cpp](https://github.com/ggerganov/llama.cpp) (Q4_K_M and Q8_0).

	All the following evaluation results are based on the [UNQUANTIZED MODEL](https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged). They can be reproduced following instructions [here](https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/longllm_qlora). However, after quantization, you may observe quality degradation.

	## Needle in a Haystack
	We evaluate the model on the Needle-In-A-HayStack task using the official setting. The blue vertical line indicates the training context length, i.e. 80K.

	<img src="data/needle.png"></img>


	## LongBench
	We evaluate the model on [LongBench](https://arxiv.org/abs/2308.14508) using 32K context length and the official prompt template. For [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), we use 8K context length.

	\|Model\|Single-Doc QA\|Multi-Doc QA\|Summarization\|Few-Shot Learning\|Synthetic\|Code\|Avg\|
	\|:-:\|:-:\|:-:\|:-:\|:-:\|:-:\|:-:\|:-:\|
	\|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)\|37.33\|36.04\|26.83\|69.56\|37.75\|53.24\|43.20\|
	\|[gradientai/Llama-3-8B-Instruct-262k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k)\|37.29\|31.20\|26.18\|67.25\|44.25\|62.71\|43.73\|
	\|Llama-3-8B-Instruct-80K-QLoRA-Merged\|43.57\|43.07\|28.93\|69.15\|48.50\|51.95\|47.19\|

	## InfiniteBench
	We evaluate the model on [InfiniteBench](https://arxiv.org/pdf/2402.13718.pdf) using 80K context length and the official prompt template. The results of GPT-4 is copied from the [paper](https://arxiv.org/pdf/2402.13718.pdf). For [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), we use 8K context length.

	\|Model\|LongBookQA Eng\|LongBookSum Eng\|
	\|:-:\|:-:\|:-:\|
	\|GPT-4\|22.22\|14.73\|
	\|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)\|7.00\|16.40\|
	\|[gradientai/Llama-3-8B-Instruct-262k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k)\|20.30\|10.34\|
	\|Llama-3-8B-Instruct-80K-QLoRA-Merged\|30.92\|14.73\|

	## Topic Retrieval
	We evaluate the model on [Topic Retrieval](https://lmsys.org/blog/2023-06-29-longchat/) task with `[5,10,15,20,25,30,40,50,60,70]` topics.

	<img src="data/topic.png"></img>


	## MMLU
	We evaluate the model's zero-shot performance on MMLU benchmark as a reflection of its short-context capability.

	\|Model\|STEM\|Social Sciences\|Humanities\|Others\|Avg\|
	\|:-:\|:-:\|:-:\|:-:\|:-:\|:-:\|
	\|[Llama-2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)\|35.92\|54.37\|51.74\|51.42\|47.22\|
	\|[Mistral-7B-v0.2-Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)\|48.79\|69.95\|64.99\|61.64\|60.10\|
	\|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)\|53.87\|75.66\|69.44\|69.75\|65.91\|
	\|[gradientai/Llama-3-8B-Instruct-262k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k)\|52.10\|73.26\|67.15\|69.80\|64.34\|
	\|Llama-3-8B-Instruct-80K-QLoRA-Merged\|53.10\|73.24\|67.32\|68.79\|64.44\|

	# Environment
	```bash
	llama_cpp
	torch==2.1.2
	transformers==4.39.3
	```

	# Usage
	```bash
	huggingface-cli download namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged-GGUF --local-dir . --local-dir-use-symlinks False
	```

	In python,
	```python
	from llama_cpp import Llama

	llm = Llama(
	model_path="./Llama-3-8B-Instruct-80K-QLoRA-Merged-Q4_K_M.gguf", # path to GGUF file
	n_ctx=81920,
	n_threads=96,
	n_gpu_layers=32,
	)

	with open("./data/needle.txt") as f:
	text = f.read()
	inputs = f"{text}\n\nWhat is the best thing to do in San Francisco?"

	print(
	llm.create_chat_completion(
	messages = [
	{
	"role": "user",
	"content": inputs
	}
	],
	temperature=0,
	max_tokens=50
	)
	)

	# The best thing to do in San Francisco is sitting in Helmer Dolores Park on a sunny day, eating a double cheeseburger with ketchup, and watching kids playing around.
	```