|
--- |
|
license: mit |
|
--- |
|
### environment |
|
optimum-neuron 0.0.25.dev0(deperated) |
|
|
|
optimum-neuron 0.0.25 |
|
|
|
neuron 2.20.0 |
|
|
|
transformers-neuronx 0.12.313 |
|
|
|
transformers 4.43.2 |
|
|
|
|
|
### export |
|
``` |
|
optimum-cli export neuron --model NousResearch/Meta-Llama-3.1-8B-Instruct --batch_size 1 --sequence_length 4096 --num_cores 2 --auto_cast_type fp16 ./models-hf/NousResearch/Meta-Llama-3.1-8B-Instruct |
|
|
|
``` |
|
|
|
### run |
|
``` |
|
docker run -it --name llama-31 --rm \ |
|
-p 8080:80 \ |
|
-v /home/ec2-user/models-hf/:/models \ |
|
-e HF_MODEL_ID=/models/NousResearch/Meta-Llama-3.1-8B-Instruct \ |
|
-e MAX_INPUT_TOKENS=256 \ |
|
-e MAX_TOTAL_TOKENS=4096 \ |
|
-e MAX_BATCH_SIZE=1 \ |
|
-e LOG_LEVEL="info,text_generation_router=debug,text_generation_launcher=debug" \ |
|
--device=/dev/neuron0 \ |
|
neuronx-tgi:latest \ |
|
--model-id /models/NousResearch/Meta-Llama-3.1-8B-Instruct \ |
|
--max-batch-size 1 \ |
|
--max-input-tokens 256 \ |
|
--max-total-tokens 4096 |
|
|
|
``` |
|
|
|
### test |
|
``` |
|
curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json' |
|
``` |