--- license: mit --- ### environment optimum-neuron 0.0.25 neuron 2.20.0 transformers-neuronx 0.12.313 transformers 4.45.2 ### export ``` optimum-cli export neuron --model meta-llama/Llama-3.2-1B-Instruct --batch_size 1 --sequence_length 1024 --num_cores 2 --auto_cast_type fp16 ./models-hf/meta-llama/Llama-3.2-1B-Instruct ``` ### run ``` docker run -it --name llama-31 --rm \ -p 8080:80 \ -v /home/ec2-user/models-hf/:/models \ -e HF_MODEL_ID=/models/meta-llama/Llama-3.2-1B-Instruct \ -e MAX_INPUT_TOKENS=256 \ -e MAX_TOTAL_TOKENS=4096 \ -e MAX_BATCH_SIZE=1 \ -e LOG_LEVEL="info,text_generation_router=debug,text_generation_launcher=debug" \ --device=/dev/neuron0 \ neuronx-tgi:latest \ --model-id /models/meta-llama/Llama-3.2-1B-Instruct \ --max-batch-size 1 \ --max-input-tokens 256 \ --max-total-tokens 1024 ``` ### test ``` curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json' ```