Quantized Octopus V2: On-device language model for super agent

This repo includes two types of quantized models: GGUF and AWQ, for our Octopus V2 model at NexaAIDev/Octopus-v2

nexa-octopus

GGUF Qauntization

To run the models, please download them to your local machine using either git clone or Hugging Face Hub

git clone https://huggingface.co/NexaAIDev/Octopus-v2-gguf-awq

Run with llama.cpp (Recommended)

  1. Clone and compile:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Compile the source code:
make
  1. Execute the Model:

Run the following command in the terminal:

./main -m ./path/to/octopus-v2-Q4_K_M.gguf -n 256 -p "Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Take a selfie for me with front camera\n\nResponse:"

Run with Ollama

Since our models have not been uploaded to the Ollama server, please download the models and manually import them into Ollama by following these steps:

  1. Install Ollama on your local machine. You can also following the guide from Ollama GitHub repository
git clone https://github.com/ollama/ollama.git ollama
  1. Locate the local Ollama directory:
cd ollama
  1. Create a Modelfile in your directory
touch Modelfile
  1. In the Modelfile, include a FROM statement with the path to your local model, and the default parameters:
FROM ./path/to/octopus-v2-Q4_K_M.gguf
  1. Use the following command to add the model to Ollama:
ollama create octopus-v2-Q4_K_M -f Modelfile
  1. Verify that the model has been successfully imported:
ollama ls
  1. Run the mode
ollama run octopus-v2-Q4_K_M "Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Take a selfie for me with front camera\n\nResponse:"

AWQ Quantization

Python example:

from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM
import torch
import time
import numpy as np
def inference(input_text):
    start_time = time.time()
    input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
    input_length = input_ids["input_ids"].shape[1]
    generation_output = model.generate(
        input_ids["input_ids"],
        do_sample=False,
        max_length=1024
    )
    end_time = time.time()
    # Decode only the generated part
    generated_sequence = generation_output[:, input_length:].tolist()
    res = tokenizer.decode(generated_sequence[0])
    latency = end_time - start_time
    num_output_tokens = len(generated_sequence[0])
    throughput = num_output_tokens / latency
    return {"output": res, "latency": latency, "throughput": throughput}
# Initialize tokenizer and model
model_id = "/path/to/Octopus-v2-AWQ-NexaAIDev"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)
prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]
avg_throughput = []
for prompt in prompts:
    out = inference(prompt)
    avg_throughput.append(out["throughput"])
    print("nexa model result:\n", out["output"])
print("avg throughput:", np.mean(avg_throughput))

Quantized GGUF & AWQ Models Benchmark

Name Quant method Bits Size Response (t/s) Use Cases
Octopus-v2-AWQ AWQ 4 3.00 GB 63.83 fast, high quality, recommended
Octopus-v2-Q2_K.gguf Q2_K 2 1.16 GB 57.81 fast but high loss, not recommended
Octopus-v2-Q3_K.gguf Q3_K 3 1.38 GB 57.81 extremely not recommended
Octopus-v2-Q3_K_S.gguf Q3_K_S 3 1.19 GB 52.13 extremely not recommended
Octopus-v2-Q3_K_M.gguf Q3_K_M 3 1.38 GB 58.67 moderate loss, not very recommended
Octopus-v2-Q3_K_L.gguf Q3_K_L 3 1.47 GB 56.92 not very recommended
Octopus-v2-Q4_0.gguf Q4_0 4 1.55 GB 68.80 moderate speed, recommended
Octopus-v2-Q4_1.gguf Q4_1 4 1.68 GB 68.09 moderate speed, recommended
Octopus-v2-Q4_K.gguf Q4_K 4 1.63 GB 64.70 moderate speed, recommended
Octopus-v2-Q4_K_S.gguf Q4_K_S 4 1.56 GB 62.16 fast and accurate, very recommended
Octopus-v2-Q4_K_M.gguf Q4_K_M 4 1.63 GB 64.74 fast, recommended
Octopus-v2-Q5_0.gguf Q5_0 5 1.80 GB 64.80 fast, recommended
Octopus-v2-Q5_1.gguf Q5_1 5 1.92 GB 63.42 very big, prefer Q4
Octopus-v2-Q5_K.gguf Q5_K 5 1.84 GB 61.28 big, recommended
Octopus-v2-Q5_K_S.gguf Q5_K_S 5 1.80 GB 62.16 big, recommended
Octopus-v2-Q5_K_M.gguf Q5_K_M 5 1.71 GB 61.54 big, recommended
Octopus-v2-Q6_K.gguf Q6_K 6 2.06 GB 55.94 very big, not very recommended
Octopus-v2-Q8_0.gguf Q8_0 8 2.67 GB 56.35 very big, not very recommended
Octopus-v2-f16.gguf f16 16 5.02 GB 36.27 extremely big
Octopus-v2.gguf 10.00 GB

Quantized with llama.cpp

Acknowledgement:
We sincerely thank our community members, Mingyuan, Zoey, Brian, Perry, Qi, David for their extraordinary contributions to this quantization effort.

Downloads last month
1,869
Safetensors
Model size
1.31B params
Tensor type
I32
·
FP16
·
Inference Examples
Inference API (serverless) has been turned off for this model.

Model tree for NexaAIDev/Octopus-v2-gguf-awq

Base model

google/gemma-2b
Quantized
(31)
this model