Given setup scripts don't work
Hi, thanks for making this resource available. I've been trying to get inference for this model ready (linux machine). The huggingface documentation doesn't say anything about setting up the docker container except for the container name, but I do the following to get it spinning:
docker pull nvcr.io/nvidia/nemo:24.01.framework
(I keep an external mount to store huggingface hub and models)docker run \ --gpus all \ -it \ --rm \ --shm-size=16g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -v /mnt/nvme/:/mnt/nvme/ \ -v /mnt/efs/:/mnt/efs/ \ --net=host \ nvcr.io/nvidia/nemo:24.01.framework
(inside container)pip install nemo-aligner
For the actual run-script, just passing in Llama3-70B-SteerLM-RM for rm_model_file didn't work, so I had to dogit clone https://huggingface.co/nvidia/Llama3-70B-SteerLM-RM
(with git lfs installed) to get the model into my hub, and then I had to set the HF_HOME and HF_TOKEN as exports. From there I set the actual path to the huggingface hub download in the run script:python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \ rm_model_file=/mnt/nvme/prasann/huggingface/hub/Llama3-70B-SteerLM-RM/ \ trainer.num_nodes=1 \ trainer.devices=8 \ ++model.tensor_model_parallel_size=8 \ ++model.pipeline_model_parallel_size=1 \ inference.micro_batch_size=1 \ inference.port=1424
This series of steps were required for me to get the server running in my docker container (8 80GB gpu machine).
The server log seems to start up fine, is using gpus, and outputs this last line:I0621 15:26:05.340455 4623 model_lifecycle.cc:818] successfully loaded 'reward_model'
From here I check that the port and address are exposed, but when I actually run the given calling scripts
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst
(runs fine)python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \ --input-file=data/oasst/train.jsonl \ --output-file=data/oasst/train_labeled.jsonl \ --port=1424
The second script leads to a timeout, and the pytriton server logs don't show any sign of receiving a request.pytriton.client.exceptions.PyTritonClientTimeoutError: Timeout occurred during inference request. Timeout: 60.0 s Message: timed out
This snippet
import requests
try:
response = requests.get("http://localhost:1424/v2")
response.raise_for_status() # Raises a HTTPError if the status is 4xx, 5xx
print("Available endpoints:", response.json())
except requests.exceptions.HTTPError as errh:
print ("HTTP Error:",errh)
outputs fine:Available endpoints: {'name': 'triton', 'version': '2.39.0', 'extensions': ['classification', 'sequence', 'model_repository', 'model_repository(unload_dependents)', 'schedule_policy', 'model_configuration', 'system_shared_memory', 'cuda_shared_memory', 'binary_tensor_data', 'parameters', 'statistics', 'trace', 'logging']}
As does this snippet
try:
response = requests.get("http://localhost:1424/v2/health/ready")
response.raise_for_status() # Raises a HTTPError if the status is 4xx, 5xx
print("Server is accessible and ready.")
except requests.exceptions.HTTPError as errh:
print ("HTTP Error:",errh)
This makes it seem that the server is accessible. I was wondering if someone knew what may be going wrong? I had to make several jumps in these setup steps since the given starter scripts didn't seem to work, so I was wondering if the setup I've detailed above sounds ok? Alternatively, if anyone knows of any more detailed up-to-date instructions on running / querying an inference server for this model anywhere in nvidia's documentation?
Actually I think I resolved the issue, it seems like this was just a GPU allocation problem. Regardless I'd be curious if anyone knows whether the setup procedure I followed is ok (especially loading in the weights using git clone from the huggingface repo). If so other people may be able to use this procedure to get it running.
Hi @PrasannSinghal thank you for your interest in this model and appreciate your patience.
I see a few issues in your docker command specifically
--shm-size=16g
- this seems too little for a 70B model - please increase this to say 200GB if possible.--net=host
- this shouldn't be used as the use case is to have the server/client be both running in the same container. If you need to call it outside of the container (i.e. have your client script outside), you can use port mapping directly instead of--net=host
, which might have unexpected issues.- No need to do
pip install nemo-aligner
- it is already installed in container
We tested our scripts in a SLURM environment, which is why we didn't share the specific as it's slightly different for every user depending on their setup but if you have further questions, I'm happy to follow up here or through email listed on the contact section of the README.
Sounds good, thanks for the info!