NeMo

Apex Error

#10
by RoshanJoe - opened

We have already installed apex,then also we are getting this import error. Could you please look into this and suggest a solution for solving this issue

Error executing job with overrides: ['gpt_model_file=/home/new_env/Nemotron-4-340B-Instruct', 'pipeline_model_parallel_split_rank=0', 'server=True', 'tensor_model_parallel_size=8', 'trainer.precision=bf16', 'pipeline_model_parallel_size=2', 'trainer.devices=8', 'trainer.num_nodes=2', 'web_server=False', 'port=1424']
Traceback (most recent call last):
File "/home/new_env/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py", line 178, in main
strategy=NLPDDPStrategy(timeout=datetime.timedelta(seconds=18000)),
File "home/new_env/lib/python3.10/site-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 172, in init
raise ImportError(
ImportError: Apex was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/NeMo#megatron-gpt.

Apex Details

Name: apex
Version: 0.1
Summary: PyTorch Extensions written by NVIDIA
Home-page: UNKNOWN
Author:
Author-email:
License: UNKNOWN
Location: /home/new_env/lib/python3.10/site-packages
Requires: packaging
Required-by:

Would be helpful to have some additional info:

  1. Are you using a Docker container? If yes, what is the Dockerfile? If not, which version of NeMo are you using?
  2. Can you manually run python, try those various import statements and report which one(s) fail(s)?
import apex
from apex.transformer.pipeline_parallel.utils import get_num_microbatches
from nemo.core.optim.distributed_adam import MegatronDistributedFusedAdam

We have tried with above import statements as mentioned ,but still we are facing some isssues. Could you please check.

NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5
python 3.10
open-clip-torch 2.24.0
pytorch-lightning 2.0.7
torch 2.3.1
torchdiffeq 0.2.4
torchmetrics 1.4.0.post0
torchsde 0.2.6
torchvision 0.18.1

These is our env.

Traceback (most recent call last):
File "/home/Setup/new_env/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py", line 26, in
from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/init.py", line 15, in
from nemo.collections.nlp import data, losses, models, modules
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/data/init.py", line 42, in
from nemo.collections.nlp.data.zero_shot_intent_recognition.zero_shot_intent_dataset import (
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/data/zero_shot_intent_recognition/init.py", line 16, in
from nemo.collections.nlp.data.zero_shot_intent_recognition.zero_shot_intent_dataset import (
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/data/zero_shot_intent_recognition/zero_shot_intent_dataset.py", line 30, in
from nemo.collections.nlp.parts.utils_funcs import tensor2list
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/parts/init.py", line 17, in
from nemo.collections.nlp.parts.utils_funcs import list2str, tensor2list
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/parts/utils_funcs.py", line 37, in
from nemo.collections.nlp.modules.common.megatron.utils import erf_gelu
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/modules/init.py", line 16, in
from nemo.collections.nlp.modules.common import (
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/modules/common/init.py", line 36, in
from nemo.collections.nlp.modules.common.tokenizer_utils import get_tokenizer, get_tokenizer_list
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/modules/common/tokenizer_utils.py", line 29, in
from nemo.collections.nlp.parts.nlp_overrides import HAVE_MEGATRON_CORE
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 23, in
from nemo.core.optim.distributed_adam import MegatronDistributedFusedAdam
File "/home/Setup/new_env/lib/python3.10/site-packages/nemo/core/optim/distributed_adam.py", line 19, in
from apex.contrib.optimizers.distributed_fused_adam import (
File "/home/Setup/new_env/lib/python3.10/site-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 31, in
import amp_C
ModuleNotFoundError: No module named 'amp_C'

Ok thanks, looks like your Apex install is broken somehow. Maybe try what's mentioned in https://github.com/NVIDIA/apex/issues/1757 (or search for more related issues).
It's highly recommended to use the container from the model card as manual setup can indeed be tricky.

Could you please help us by sharing the link of the container in the model card

The link is in the model card as far as I can tell (pull command: docker pull nvcr.io/nvidia/nemo:24.05) -- is that not sufficient?

I have executed the
1.docker pull nvcr.io/nvidia/nemo:24.05
2. docker run --gpus all -it --rm -v :/NeMo --shm-size=8g
-p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit
stack=67108864 --device=/dev/snd nvcr.io/nvidia/pytorch:23.10-py3 command

After that how can I connect to my Nemotron-4-340B-Instruct model

It's likely that you won't be able to run inference on a single node for this model (you'd need an FP8 checkpoint, which hasn't been released yet). This makes things a bit more complex, and is the reason why the model card "Usage" section relies on SLURM for two-node inference.

I'm not actually sure how to run two-node inference manually, but you'd need to execute megatron_gpt_eval.py on both nodes, something like

/usr/bin/python3 /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py \
        gpt_model_file=$NEMO_FILE \
        pipeline_model_parallel_split_rank=0 \
        server=True tensor_model_parallel_size=8 \
        trainer.precision=bf16 pipeline_model_parallel_size=2 \
        trainer.devices=8 \
        trainer.num_nodes=2 \
        web_server=False \
        port=1424

(the part I'm not sure about is how to get the two nodes to know about each other -- the underlying NeMo code is based on PyTorch Lightning so you may need to check its docs on how to do multi-node, maybe with torchrun?)

One you manage to launch the model server, it should be easy to call it by doing something like the call_server.py script in the model card.

RoshanJoe changed discussion status to closed
RoshanJoe changed discussion status to open

Sign up or log in to comment