File size: 1,813 Bytes

1ae19b5
a9918ea
 
 
 
 
 
 
1ae19b5
 
a9918ea
1ae19b5
02457f4
 
 
a9918ea
a0cda08
a9918ea
a0cda08
a9918ea
1ae19b5
a9918ea
 
1ae19b5
a9918ea
1ae19b5
a9918ea
1ae19b5
a9918ea
1ae19b5
a9918ea
1ae19b5
a9918ea
1ae19b5
a9918ea
1ae19b5
a9918ea
1ae19b5
a9918ea
 
95d8600
a9918ea
95d8600
a9918ea
1ae19b5
a9918ea
1ae19b5
a9918ea
1ae19b5
a9918ea
1ae19b5
a9918ea
1ae19b5
a9918ea
1ae19b5
a9918ea

---
language:
- en
pipeline_tag: text-generation
tags:
- meta
- llama-3
license: llama3
---

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png)


![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/VcZWbW_eZkJAZZ5ricL4B.png)

# Llama-3-Giraffe-70B

Abacus.AI presents our longer-necked variant of Llama 3 70B!

This model has an effective context length of approximately 128k.

We have currently trained on ~1B tokens.
This is an initial release and we are hoping to improve the heatmap below further as we continue training.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/_NVEuQ2ZT-sBtDBNjgmbt.png)

## Training Methodology

The methodology for training uses [PoSE](https://arxiv.org/abs/2309.10400) and dynamic-NTK interpolation. 

### NTK-scaling

The scale factor for NTK is 4. Note that we also tried theta-scaling but this did not work as well as NTK scaling in our experiments.

### PoSE

We utilise Positional Skip-wise Training (PoSE) with the following parameters:

- **Number of Chunks**: 5
- **Max position ID**: 32768

### Data

We use on average ~8K long samples from [RedPajama](https://github.com/togethercomputer/RedPajama-Data).

### Hardware

We train on 8xH100 GPUs with Deepspeed Zero Stage 3.

## Evaluation Methodology

We use the [EasyContext](https://github.com/abacusai/EasyContext/blob/eval_runs/eval_needle.py) implementation of Needle-in-a-Haystack to evaluate Llama-3-Giraffe-70B.

We evaluate with the following parameters:

- **Min context length**: 2000
- **Max context length**: 128000
- **Context interval**: 4000
- **Depth interval**: 0.1
- **Num samples**: 2
- **Rnd number digits**: 7
- **Haystack dir**: PaulGrahamEssays