File size: 10,525 Bytes

---
license: llama3
library_name: transformers
datasets:
- aqua_rat
- microsoft/orca-math-word-problems-200k
- m-a-p/CodeFeedback-Filtered-Instruction
model-index:
- name: Smaug-Llama-3-70B-Instruct
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: ENEM Challenge (No Images)
      type: eduagarcia/enem_challenge
      split: train
      args:
        num_few_shot: 3
    metrics:
    - type: acc
      value: 77.89
      name: accuracy
    source:
      url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct
      name: Open Portuguese LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: BLUEX (No Images)
      type: eduagarcia-temp/BLUEX_without_images
      split: train
      args:
        num_few_shot: 3
    metrics:
    - type: acc
      value: 69.54
      name: accuracy
    source:
      url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct
      name: Open Portuguese LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: OAB Exams
      type: eduagarcia/oab_exams
      split: train
      args:
        num_few_shot: 3
    metrics:
    - type: acc
      value: 63.64
      name: accuracy
    source:
      url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct
      name: Open Portuguese LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Assin2 RTE
      type: assin2
      split: test
      args:
        num_few_shot: 15
    metrics:
    - type: f1_macro
      value: 93.62
      name: f1-macro
    source:
      url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct
      name: Open Portuguese LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Assin2 STS
      type: eduagarcia/portuguese_benchmark
      split: test
      args:
        num_few_shot: 15
    metrics:
    - type: pearson
      value: 78.52
      name: pearson
    source:
      url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct
      name: Open Portuguese LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: FaQuAD NLI
      type: ruanchaves/faquad-nli
      split: test
      args:
        num_few_shot: 15
    metrics:
    - type: f1_macro
      value: 80.01
      name: f1-macro
    source:
      url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct
      name: Open Portuguese LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: HateBR Binary
      type: ruanchaves/hatebr
      split: test
      args:
        num_few_shot: 25
    metrics:
    - type: f1_macro
      value: 91.78
      name: f1-macro
    source:
      url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct
      name: Open Portuguese LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: PT Hate Speech Binary
      type: hate_speech_portuguese
      split: test
      args:
        num_few_shot: 25
    metrics:
    - type: f1_macro
      value: 68.36
      name: f1-macro
    source:
      url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct
      name: Open Portuguese LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: tweetSentBR
      type: eduagarcia/tweetsentbr_fewshot
      split: test
      args:
        num_few_shot: 25
    metrics:
    - type: f1_macro
      value: 70.29
      name: f1-macro
    source:
      url: https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct
      name: Open Portuguese LLM Leaderboard
---

# Smaug-Llama-3-70B-Instruct

### Built with Meta Llama 3


![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/ZxYuHKmU_AtuEJbGtuEBC.png)

This model was built using a new Smaug recipe  for improving performance on real world multi-turn conversations applied to 
[meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct).

The model outperforms Llama-3-70B-Instruct substantially, and is on par with GPT-4-Turbo, on MT-Bench (see below).

EDIT: Smaug-Llama-3-70B-Instruct is the top open source model on Arena-Hard currently! It is also nearly on par with Claude Opus - see below.

We are conducting additional benchmark evaluations and will add those when available.

### Model Description

- **Developed by:** [Abacus.AI](https://abacus.ai)
- **License:** https://llama.meta.com/llama3/license/
- **Finetuned from model:** [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct).

## How to use

The prompt format is unchanged from Llama 3 70B Instruct.

### Use with transformers

See the snippet below for usage with Transformers:

```python
import transformers
import torch

model_id = "abacusai/Smaug-Llama-3-70B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
		messages, 
		tokenize=False, 
		add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])
```


## Evaluation

### Arena-Hard

Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology. 

| Model | Score | 95% Confidence Interval | Average Tokens |
| :---- | ---------: | ----------: | ------: |
| GPT-4-Turbo-2024-04-09 | 82.6  | (-1.8, 1.6)  | 662 |
| GPT-4o | 78.3  | (-2.4, 2.1)  | 685 |
| Gemini-1.5-pro-latest | 72.1  | (-2.3, 2.2)  | 630 |
| Claude-3-Opus-20240229 | 60.4  | (-3.3, 2.4)  | 541 |
| **Smaug-Llama-3-70B-Instruct** | 56.7  | (-2.2, 2.6)  | 661 |
| GPT-4-0314 | 50.0  | (-0.0, 0.0)  | 423 |
| Claude-3-Sonnet-20240229 | 46.8  | (-2.1, 2.2)  | 552 |
| Llama-3-70B-Instruct | 41.1  | (-2.5, 2.4)  | 583 |
| GPT-4-0613 | 37.9  | (-2.2, 2.0)  | 354 |
| Mistral-Large-2402 | 37.7 | (-1.9, 2.6)  | 400 |
| Mixtral-8x22B-Instruct-v0.1 | 36.4  | (-2.7, 2.9)  | 430 |
| Qwen1.5-72B-Chat | 36.1 | (-2.5, 2.2)  | 474 |
| Command-R-Plus | 33.1 | (-2.1, 2.2)  | 541 |
| Mistral-Medium | 31.9  | (-2.3, 2.4)  | 485 |
| GPT-3.5-Turbo-0613 | 24.8 | (-1.6, 2.0)  | 401 |

### MT-Bench

```
########## First turn ##########
                   score
model             turn
Smaug-Llama-3-70B-Instruct         1     9.40000                                                                                                                            
GPT-4-Turbo                        1     9.37500
Meta-Llama-3-70B-Instruct          1     9.21250 
########## Second turn ##########
                   score
model             turn
Smaug-Llama-3-70B-Instruct         2     9.0125
GPT-4-Turbo                        2     9.0000
Meta-Llama-3-70B-Instruct          2     8.8000
########## Average ##########
                 score
model
Smaug-Llama-3-70B-Instruct          9.206250
GPT-4-Turbo                         9.187500
Meta-Llama-3-70B-Instruct           9.006250
```

| Model | First turn | Second Turn | Average |
| :---- | ---------: | ----------: | ------: |
| **Smaug-Llama-3-70B-Instruct**  | 9.40 | 9.01 | 9.21 |
| GPT-4-Turbo | 9.38 |  9.00 | 9.19 |
| Meta-Llama-3-70B-Instruct | 9.21 |  8.80 | 9.01 |

### OpenLLM Leaderboard Manual Evaluation

| Model | ARC  | Hellaswag | MMLU | TruthfulQA | Winogrande | GSM8K* | Average |
| :---- | ---: | ------:   | ---: | ---:       | ---:       | ---:   | ---:   |
| Smaug-Llama-3-70B-Instruct | 70.6 | 86.1 | 79.2 | 62.5 | 83.5 | 90.5 | 78.7 |
| Llama-3-70B-Instruct | 71.4 | 85.7 | 80.0 | 61.8 | 82.9 | 91.1 | 78.8 |

**GSM8K** The GSM8K numbers quoted here are computed using a recent release
of the [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/).
The commit used by the leaderboard has a significant issue that impacts models that
tend to use `:` in their responses due to a bug in the stop word configuration for
GSM8K. The issue is covered in more detail in this
[GSM8K evaluation discussion](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/770).
The score for both Llama-3 and this model are significantly different when evaluated
with the updated harness as the issue with stop words has been addressed.


This version of Smaug uses new techniques and new data compared to [Smaug-72B](https://huggingface.co/abacusai/Smaug-72B-v0.1), and more information will be released later on. For now, see the previous Smaug paper: https://arxiv.org/abs/2402.13228.


# Open Portuguese LLM Leaderboard Evaluation Results  

Detailed results can be found [here](https://huggingface.co/datasets/eduagarcia-temp/llm_pt_leaderboard_raw_results/tree/main/abacusai/Smaug-Llama-3-70B-Instruct) and on the [🚀 Open Portuguese LLM Leaderboard](https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard)

|          Metric          |  Value  |
|--------------------------|---------|
|Average                   |**77.07**|
|ENEM Challenge (No Images)|    77.89|
|BLUEX (No Images)         |    69.54|
|OAB Exams                 |    63.64|
|Assin2 RTE                |    93.62|
|Assin2 STS                |    78.52|
|FaQuAD NLI                |    80.01|
|HateBR Binary             |    91.78|
|PT Hate Speech Binary     |    68.36|
|tweetSentBR               |    70.29|