--- base_model: Writer/Palmyra-Med-70B tags: - fp8 - vllm - medical - med license: other license_name: writer-open-model-license license_link: https://writer.com/legal/open-model-license/ language: - en quantized_by: bprice9 base_model_relation: quantized pipeline_tag: text-generation --- # Palmyra-Medical-70B-FP8 This is a quantized version of [Palmyra-Med-70B](https://huggingface.co/Writer/Palmyra-Med-70B), which was developed by Writer. The original model performance on biomedical benchmarks is 85.87%. **This quantized version acheives an average score of 85.62%.** ## Model Overview: - **Model:** Llama based model finetuned to form Palmyra-X-004 and then again to form Palmyra-Med-70B. - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP8 - **Activation quantization:** FP8 - **Intended Use Cases:** Palmyra-Medical-70B-FP8 is intended for non-commercial and research use in English. Instruction tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - **License(s):** [writer-open-model-license](https://writer.com/legal/open-model-license/) ### Writer Resources and Technical Documentation: + [Writer Blog](https://writer.com/blog/palmyra-med-fin-models/) + [Writer Developer Website](https://dev.writer.com/home/models) + [Writer AI Studio](https://writer.com/product/ai-studio/) + [Palmyra Model API](https://dev.writer.com/api-guides/chat-completion) ### Model Optimizations [LLM_Compressor](https://github.com/vllm-project/llm-compressor) library. Using this optimization, the original FP16 weights and linear activations within the transformer blocks are adjusted to FP8, which decreases the model size and VRAM requirements by 50% overall. ## Deployment with vLLM This model can be deployed using the [vLLM](https://docs.vllm.ai/en/latest/) library, as shown in the example below. ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "bprice9/Palmyra-Medical-70B-FP8" number_gpus = 2 sampling_params = SamplingParams(temperature=0.0, top_p=0.9, max_tokens=512, stop_token_ids=[128001, 128009]) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "user", "content": "Give a differential for an intrahepatic lesion with early arterial phase enhancement and rapid washout."}, ] prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ``` ## Creation This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/sa/big_model_support/examples/big_model_offloading/big_model_w8a8_calibrate.py), as presented in the code below. ```python import torch from datasets import load_dataset from transformers import AutoTokenizer from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot from llmcompressor.transformers.compression.helpers import ( calculate_offload_device_map, custom_offload_device_map, ) recipe = """ quant_stage: quant_modifiers: QuantizationModifier: ignore: ["lm_head"] config_groups: group_0: weights: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true input_activations: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true targets: ["Linear"] """ model_stub = "Writer/Palmyra-Med-70B" model_name = model_stub.split("/")[-1] device_map = calculate_offload_device_map( model_stub, reserve_for_hessians=False, num_gpus=2, torch_dtype=torch.float16 ) model = SparseAutoModelForCausalLM.from_pretrained( model_stub, torch_dtype=torch.float16, device_map=device_map ) tokenizer = AutoTokenizer.from_pretrained(model_stub) output_dir = f"./{model_name}-FP8" DATASET_ID = "HuggingFaceH4/ultrachat_200k" DATASET_SPLIT = "train_sft" NUM_CALIBRATION_SAMPLES = 128 MAX_SEQUENCE_LENGTH = 4096 ds = load_dataset(DATASET_ID, split=DATASET_SPLIT) ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) def preprocess(example): return { "text": tokenizer.apply_chat_template( example["messages"], tokenize=False, ) } ds = ds.map(preprocess) def tokenize(sample): return tokenizer( sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False, ) ds = ds.map(tokenize, remove_columns=ds.column_names) oneshot( model=model, output_dir=output_dir, dataset=ds, recipe=recipe, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, save_compressed=True, ) ``` ## Evaluation
Biomedical Benchmark | Med-PaLM-2 (5-shot) | GPT-4 | Palmyra-Med-70B (Original FP16) | Palmyra-Medical-70B-FP8 (This Model) |
MMLU Clincal Knowledge | 88.3 | 86.0 | 90.9 | 90.2 |
MMLU Medical Genetics | 90.0 | 91.0 | 94.0 | 93.0 |
MMLU Anatomy | 77.8 | 80.0 | 83.7 | 83.7 |
MMLU Professional Medicine | 95.2 | 93.0 | 92.7 | 92.3 |
MMLU College Biology | 94.4 | 95.1 | 94.4 | 93.8 |
MMLU College Medicine | 80.9 | 76.9 | 84.4 | 84.4 |
MedQA 4-options | 79.9 | 78.9 | 78.6 | 79.5 |
PubMed QA | 79.2 | 75.2 | 79.6 | 78.0 |
MedMCQA | 71.3 | 69.5 | 74.4 | 75.7 |
Average | 84.1 | 82.8 | 85.9 | 85.6 |