--- base_model: HuggingFaceH4/zephyr-7b-beta inference: true model_type: mistral quantized_by: robertgshaw2 tags: - nm-vllm - marlin - int4 --- ## zephyr-7b-beta-marlin This repo contains model files for [zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) optimized for [nm-vllm](https://github.com/neuralmagic/nm-vllm), a high-throughput serving engine for compressed LLMs. This model was quantized with [GPTQ](https://arxiv.org/abs/2210.17323) and saved in the Marlin format for efficient 4-bit inference. Marlin is a highly optimized inference kernel for 4 bit models. ## Inference Install [nm-vllm](https://github.com/neuralmagic/nm-vllm) for fast inference and low memory-usage: ```bash pip install nm-vllm[sparse] ``` Run in a Python pipeline for local inference: ```python from transformers import AutoTokenizer from vllm import LLM, SamplingParams model_id = "neuralmagic/zephyr-7b-beta-marlin" model = LLM(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "user", "content": "What is quantization in maching learning?"}, ] formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) sampling_params = SamplingParams(max_tokens=200) outputs = model.generate(formatted_prompt, sampling_params=sampling_params) print(outputs[0].outputs[0].text) """ Sure! Here's a simple recipe for banana bread: Ingredients: - 3-4 ripe bananas,mashed - 1 large egg - 2 Tbsp. Flour - 2 tsp. Baking powder - 1 tsp. Baking soda - 1/2 tsp. Ground cinnamon - 1/4 tsp. Salt - 1/2 cup butter, melted - 3 Cups All-purpose flour - 1/2 tsp. Ground cinnamon Instructions: 1. Preheat your oven to 350 F (175 C). """ ``` ## Quantization For details on how this model was quantized and converted to marlin format, run the `quantization/apply_gptq_save_marlin.py` script: ```bash pip install -r quantization/requirements.txt python3 quantization/apply_gptq_save_marlin.py --model-id HuggingFaceH4/zephyr-7b-beta --save-dir ./zephyr-marlin ``` ## Slack For further support, and discussions on these models and AI in general, join [Neural Magic's Slack Community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)