NeMo

Add to HF Inference APIs

#4
by mrfakename - opened

Hi,
It would be really useful to be able to use this through the Hugging Face Inference APIs (which would require this model to be compatible with Transformers). Are there any plans to add Transformers support to the model?
Thanks!

cc @reach-vb

Second this. Please have this model transformer-fied. I would like to release gptq quant for this but need a hf transformer compatible model.

Someone made the Tokenizer Hugging Face compatible but not sure what this helps if the weights itself are only available in the NeMo format: https://huggingface.co/Xenova/Nemotron-4-340B-Instruct-Tokenizer

Working on this here: https://huggingface.co/failspy/Nemotron-4-340B-Instruct-SafeTensors

Lacking a HF Transformers class for it as of now -- still working on that part if anyone wants to help, but the weights are ported to be similar to Llama-3's arch (though not perfect, for example QKV proj is not split), and plausible hypothetical config.json. Also includes the tokenizer from @Xenova

NVIDIA org

Hi all -- regarding inference APIs, you can use the model on https://build.nvidia.com/nvidia/nemotron-4-340b-instruct. There's an interactive widget there as well as an API you can use.

@nealv I think one of the main reasons people would like the model released in HF format is to more easily create quantizations with the intent of running inference on a local stack.

Anything in the works from your team that might assist with that effort?

NVIDIA org

@ZQ-Dev yep, we're working on it. As @failspy pointed out we'd need to modify and upstream the model class as well.
We're looking at fp8 quantization too. Hopefully that will make it easier to deploy.

@nealv +1 to fp8, as 8xA100 nodes are much more readily available than 16x at this time.

There's now a paid bounty for this to get closed ASAP. $175 and growing.
https://x.com/natolambert/status/1814735390877884823

Sign up or log in to comment