Quantized Octo-planner: On-device Language Model for Planner-Action Agents Framework

This repo includes GGUF quantized models, for our Octo-planner model at NexaAIDev/octopus-planning

GGUF Quantization

To run the models, please download them to your local machine using either git clone or Hugging Face Hub

git clone https://huggingface.co/NexaAIDev/octo-planner-gguf

Run with llama.cpp (Recommended)

  1. Clone and compile:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Compile the source code:
make
  1. Execute the Model:

Run the following command in the terminal:

./llama-cli -m ./path/to/octopus-planning-Q4_K_M.gguf -p "<|user|>Find my presentation for tomorrow's meeting, connect to the conference room projector via Bluetooth, increase the screen brightness, take a screenshot of the final summary slide, and email it to all participants<|end|><|assistant|>"

Run with Ollama

Since our models have not been uploaded to the Ollama server, please download the models and manually import them into Ollama by following these steps:

  1. Install Ollama on your local machine. You can also following the guide from Ollama GitHub repository
git clone https://github.com/ollama/ollama.git ollama
  1. Locate the local Ollama directory:
cd ollama
  1. Create a Modelfile in your directory
touch Modelfile
  1. In the Modelfile, include a FROM statement with the path to your local model, and the default parameters:
FROM ./path/to/octopus-planning-Q4_K_M.gguf
  1. Use the following command to add the model to Ollama:
ollama create octopus-planning-Q4_K_M -f Modelfile
  1. Verify that the model has been successfully imported:
ollama ls
  1. Run the mode
ollama run octopus-planning-Q4_K_M "<|user|>Find my presentation for tomorrow's meeting, connect to the conference room projector via Bluetooth, increase the screen brightness, take a screenshot of the final summary slide, and email it to all participants<|end|><|assistant|>"

Quantized GGUF Models Benchmark

Name Quant method Bits Size Use Cases
octopus-planning-Q2_K.gguf Q2_K 2 1.42 GB fast but high loss, not recommended
octopus-planning-Q3_K.gguf Q3_K 3 1.96 GB extremely not recommended
octopus-planning-Q3_K_S.gguf Q3_K_S 3 1.68 GB extremely not recommended
octopus-planning-Q3_K_M.gguf Q3_K_M 3 1.96 GB moderate loss, not very recommended
octopus-planning-Q3_K_L.gguf Q3_K_L 3 2.09 GB not very recommended
octopus-planning-Q4_0.gguf Q4_0 4 2.18 GB moderate speed, recommended
octopus-planning-Q4_1.gguf Q4_1 4 2.41 GB moderate speed, recommended
octopus-planning-Q4_K.gguf Q4_K 4 2.39 GB moderate speed, recommended
octopus-planning-Q4_K_S.gguf Q4_K_S 4 2.19 GB fast and accurate, very recommended
octopus-planning-Q4_K_M.gguf Q4_K_M 4 2.39 GB fast, recommended
octopus-planning-Q5_0.gguf Q5_0 5 2.64 GB fast, recommended
octopus-planning-Q5_1.gguf Q5_1 5 2.87 GB very big, prefer Q4
octopus-planning-Q5_K.gguf Q5_K 5 2.82 GB big, recommended
octopus-planning-Q5_K_S.gguf Q5_K_S 5 2.64 GB big, recommended
octopus-planning-Q5_K_M.gguf Q5_K_M 5 2.82 GB big, recommended
octopus-planning-Q6_K.gguf Q6_K 6 3.14 GB very big, not very recommended
octopus-planning-Q8_0.gguf Q8_0 8 4.06 GB very big, not very recommended
octopus-planning-F16.gguf F16 16 7.64 GB extremely big

Quantized with llama.cpp

Downloads last month
32
GGUF
Model size
3.82B params
Architecture
phi3

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.