I wanted to be able to go from the Meta model weights to an AWQ quantised model myself, rather than grab the weights from casperhansen or elsewhere.

First Attempt

Initially I tried running autoawq on an aws g5.12xlarge instance (4xA10), ubuntu 22, cuda 12.2, nvidia 535.113.01 drivers.

I tried different combinations of torch (2.1.2, 2.2.2), autoawq (0.2.4, 0.2.5) and transformers (4.38.2, 4.41.2), but I couldnt get it to work, even with the 8B model, (which all below errors are for). I kept getting errors like:

  • 0.2.4 4.38.2 2.1.2, No device map, failed at 3% RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)
  • 0.2.4 4.38.2 2.1.2, device map "auto" failed at 16% & index < sizes[i] && "index out of bounds" failed.`
  • 0.2.5 4.40.0 2.1.2, No device map failed at 3% File "{redacted}/.venv/lib/python3.11/site-packages/awq/quantize/quantizer.py"", line 69, in pseudo_quantize_tensor assert torch.isnan(w).sum() == 0"
  • 0.2.4 4.38.2 2.2.2, No device map failed at 3% RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)

The only thing that worked was setting CUDA_VISIBLE_DEVICES=0 or to a single device, but this would not work for the 70B model (vram) Though the comment from casper here makes me think quantising llama 3 70B with multiple GPUs should be possible.

Working Approach

The following worked for me:

Machine: vast.ai 2xA100 PCIE instance with AMD EPYC 9554, CUDA 12.2 (~ half the price of the g5.12x large!) Container: pytorch:2.2.0-cuda12.1-cudnn8-devel image

AutoAWQ @ 5f3785dc

Followed commands in the readme:

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Installed vim to edit the example script: apt install vim, vi examples/quantize.py

Changed model path to: meta-llama/Meta-Llama-3-70B-Instruct

Changed output path to: Meta-Llama-3-70B-Instruct-awq

Used a script to set the token so we can pull llama 3

#!/usr/bin/env bash

export HF_TOKEN=${your token here - used to grab llama weights}

python quantize.py

This worked, took ~ 100 mins for the 70B model to quantise. Not sure if the second A100 was used, once I set the thing running I couldnt figure out how to open a second ssh session to run nvidia-smi or similar without joining the same tmux session running the quantisation, so just left it to it.

Downloads last month
20
Safetensors
Model size
11.3B params
Tensor type
I32
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.