Text Generation
Transformers
PyTorch
Safetensors
English
hf_olmo
custom_code

16-bit version?

#13
by saattrupdan - opened

Do you have plans to upload a 16bit version of your model? That would make it a lot more accessible for inference on smaller GPUs.

@dirkgr Can correct me but I am not aware of such plans. You should be able to load the model and then call, say,model = model.bfloat16() to convert the weights to 16 bits. You may need to load the model on the CPU, downcast to 16 bits, and then move the model to GPU. An alternative with a higher memory requirements (that we used while training the model) is to use torch.autocast with a 16 bit type.

@shanearora I completely get that, but if I’m loading in the model with vLLM then I get OOM errors before any conversion can happen. I guess I could convert it and upload it myself, but it would just be a bit more official if you all had a 16bit version uploaded. Same thing with quantised and GGUF versions for that matter, as these are required by other applications like llama.cpp and LM Studio. But it’s up to you - feel free to close this issue if you’re not planning on it 🙂

@akshitab Do you know about OLMo plans in relation to vLLM?

vLLM integration for OLMo is currently in progress here: https://github.com/vllm-project/vllm/issues/2763

Sign up or log in to comment