namespace-Pt's picture
12d942d verified
license: mit
pipeline_tag: text-generation
<div align="center">
<a href="">[Data&Code]</a>
We extend the context length of Llama-3-8B-Instruct to 80K using QLoRA and 3.5K long-context training data synthesized from GPT-4. The entire training cycle is super efficient, which takes 8 hours on a 8xA800 (80G) machine. Yet, the resulted model achieves remarkable performance on a series of downstream long-context evaluation benchmarks.
**NOTE**: This repo contains the quantized model of [namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged]( The quantization is conducted with [llama.cpp]( (Q4_K_M and Q8_0).
All the following evaluation results are based on the [UNQUANTIZED MODEL]( They can be reproduced following instructions [here]( However, after quantization, you may observe **quality degradation**.
## Needle in a Haystack
We evaluate the model on the Needle-In-A-HayStack task using the official setting. The blue vertical line indicates the training context length, i.e. 80K.
<img src="data/needle.png"></img>
## LongBench
We evaluate the model on [LongBench]( using 32K context length and the official prompt template. For [meta-llama/Meta-Llama-3-8B-Instruct](, we use 8K context length.
|Model|Single-Doc QA|Multi-Doc QA|Summarization|Few-Shot Learning|Synthetic|Code|Avg|
## InfiniteBench
We evaluate the model on [InfiniteBench]( using 80K context length and the official prompt template. The results of GPT-4 is copied from the [paper]( For [meta-llama/Meta-Llama-3-8B-Instruct](, we use 8K context length.
|Model|LongBookQA Eng|LongBookSum Eng|
## Topic Retrieval
We evaluate the model on [Topic Retrieval]( task with `[5,10,15,20,25,30,40,50,60,70]` topics.
<img src="data/topic.png"></img>
We evaluate the model's zero-shot performance on MMLU benchmark as a reflection of its short-context capability.
|Model|STEM|Social Sciences|Humanities|Others|Avg|
# Environment
# Usage
huggingface-cli download namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged-GGUF --local-dir . --local-dir-use-symlinks False
In python,
from llama_cpp import Llama
llm = Llama(
model_path="./Llama-3-8B-Instruct-80K-QLoRA-Merged-Q4_K_M.gguf", # path to GGUF file
with open("./data/needle.txt") as f:
text =
inputs = f"{text}\n\nWhat is the best thing to do in San Francisco?"
messages = [
"role": "user",
"content": inputs
# The best thing to do in San Francisco is sitting in Helmer Dolores Park on a sunny day, eating a double cheeseburger with ketchup, and watching kids playing around.