๐Ÿš€ al-baka-llama3-8b (Quantized 4bit)

Al Baka is an Fine Tuned Model based on the new released LLAMA3-8B Model on the Stanford Alpaca dataset Arabic version Yasbok/Alpaca_arabic_instruct. ** The model is directly quantized 4bit model with bitsandbytes

Model Summary

Model Details

  • The model was fine-tuned in 4-bit precision using unsloth

How to Get Started with the Model

Setup

# Install packages
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

First, Load the Model

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Omartificial-Intelligence-Space/al-baka-4bit-llama3-8b",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

Second, Try the model

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
       "ุงุณุชุฎุฏู… ุงู„ุจูŠุงู†ุงุช ุงู„ู…ุนุทุงุฉ ู„ุญุณุงุจ ุงู„ูˆุณูŠุท.", # instruction
        "[2 ุŒ 3 ุŒ 7 ุŒ 8 ุŒ 10]", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

Recommendations

  • unsloth for finetuning models. You can get a 2x faster finetuned model which can be exported to any format or uploaded to Hugging Face.
Downloads last month
10
Safetensors
Model size
4.65B params
Tensor type
BF16
ยท
F32
ยท
U8
ยท
Inference Examples
Inference API (serverless) has been turned off for this model.

Collection including Omartificial-Intelligence-Space/al-baka-4bit-llama3-8b