|
This is an MOE of Llama-3-8b with 4 experts. This does not use semantic routing, as this utilizes the deepseek-moe architecture. There is no routing, and there is no gate - all experts are active on every token. |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM |
|
|
|
model_path = "Crystalcareai/llama-3-4x8b" |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_path, |
|
device_map="auto", |
|
low_cpu_mem_usage=True, |
|
torch_dtype=torch.bfloat16, |
|
trust_remote_code=True, |
|
attn_implementation="flash_attention_2", |
|
) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) |
|
|
|
# Modify the prompt to match the Alpaca instruction template |
|
prompt = """ |
|
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. |
|
|
|
### Instruction: |
|
Sam is faster than Joe. Joe is faster than Jane. Is Sam faster than Jane? Explain your reasoning step by step. |
|
|
|
### Input: |
|
|
|
### Response: |
|
""" |
|
|
|
tokens = tokenizer( |
|
prompt, |
|
return_tensors='pt' |
|
).input_ids.cuda() |
|
|
|
generation_output = model.generate( |
|
tokens, |
|
streamer=streamer, |
|
max_new_tokens=512, |
|
) |
|
``` |