Experiencing issue with 'Can't load tokenizer'

#1
by metagenix-ai - opened

Hi there,

first thank you for this wonderful project!

Unfortunately, I expercienced problems to execute the code "pipe = pipeline("text-generation", model="Esperanto/Protein-Llama-3-8B")" at the beginning.

As result of that, the follwing error code prompted:
OSError: Can't load tokenizer for 'Esperanto/Protein-Llama-3-8B'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'Esperanto/Protein-Llama-3-8B' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

I tried to solve this by upgrading transformers (pip install --upgrade transformers) but this did not help. Moreover, I downloaded the large files, too. Still the same. Do you have any suggestions?

Thanks in advance!

Esperanto Technologies org

Hey,
Thanks for reporting this bug! This should be fixed now, the tokenizer has been uploaded.
If it works we'll close this issue!

Thank you for your swift update!
This bug still exists.

Here is the complete error message:


OSError Traceback (most recent call last)
Input In [1], in
15 from transformers import pipeline
17 messages = [
18 {"role": "user", "content": "Who are you?"},
19 ]
---> 20 pipe = pipeline("text-generation", model="Esperanto/Protein-Llama-3-8B")
21 pipe(messages)

File /opt/miniconda3/lib/python3.9/site-packages/transformers/pipelines/init.py:1033, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
1030 tokenizer_kwargs = model_kwargs.copy()
1031 tokenizer_kwargs.pop("torch_dtype", None)
-> 1033 tokenizer = AutoTokenizer.from_pretrained(
1034 tokenizer_identifier, use_fast=use_fast, _from_pipeline=task, **hub_kwargs, **tokenizer_kwargs
1035 )
1037 if load_image_processor:
1038 # Try to infer image processor from model or config name (if provided as str)
1039 if image_processor is None:

File /opt/miniconda3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:939, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
936 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
938 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 939 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
940 else:
941 if tokenizer_class_py is not None:

File /opt/miniconda3/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2197, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
2194 # If one passes a GGUF file path to gguf_file there is no need for this check as the tokenizer will be
2195 # loaded directly from the GGUF file.
2196 if all(full_file_name is None for full_file_name in resolved_vocab_files.values()) and not gguf_file:
-> 2197 raise EnvironmentError(
2198 f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
2199 "'https://huggingface.co/models', make sure you don't have a local directory with the same name. "
2200 f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory "
2201 f"containing all relevant files for a {cls.name} tokenizer."
2202 )
2204 for file_id, file_path in vocab_files.items():
2205 if file_id not in resolved_vocab_files:

OSError: Can't load tokenizer for 'Esperanto/Protein-Llama-3-8B'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'Esperanto/Protein-Llama-3-8B' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

Esperanto Technologies org

Hi,
Could you share the version of the transformers library you are currently using? Instead of upgrading, you might try downgrading to an earlier version (e.g., 4.38.0), as this could resolve the issue.

I used transformers version 4.42.0 (current) and downgraded to 4.38.0 (pip install --upgrade transformers==4.38.0).
The error remains unchanged after downgrade. Any further suggestions?

Esperanto Technologies org

Can you share your notebook , we can try to take a look at it..

Sure.
Jupyter notebook with miniconda3 using Python 3.9.5 as Kernel:

import transformers
print(transformers.version)


Output: 4.38.0


Use a pipeline as a high-level helper

from transformers import pipeline
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="Esperanto/Protein-Llama-3-8B")
pipe(messages)


Output: 2024-11-22 06:26:21.652864: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

OSError
Input In [2], in
16 from transformers import pipeline
18 messages = [
19 {"role": "user", "content": "Who are you?"},
20 ]
---> 21 pipe = pipeline("text-generation", model="Esperanto/Protein-Llama-3-8B")
22 pipe(messages)

File /opt/miniconda3/lib/python3.9/site-packages/transformers/pipelines/init.py:1004, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
1001 tokenizer_kwargs = model_kwargs.copy()
1002 tokenizer_kwargs.pop("torch_dtype", None)
-> 1004 tokenizer = AutoTokenizer.from_pretrained(
1005 tokenizer_identifier, use_fast=use_fast, _from_pipeline=task, **hub_kwargs, **tokenizer_kwargs
1006 )
1008 if load_image_processor:
1009 # Try to infer image processor from model or config name (if provided as str)
1010 if image_processor is None:

File /opt/miniconda3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:843, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
841 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
842 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 843 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
844 else:
...
2037 )
2039 for file_id, file_path in vocab_files.items():
2040 if file_id not in resolved_vocab_files:

OSError: Can't load tokenizer for 'Esperanto/Protein-Llama-3-8B'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'Esperanto/Protein-Llama-3-8B' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.


#Uncontrollable generation can be handled via prompting the model with the phrase 'Seq=<'.
generator = pipeline('text-generation', model="Esperanto/Protein-Llama-3-8B")

sequences = generator("Seq=<",temperature=0.2,
top_k=40,
top_p=0.9,
do_sample=True,
repetition_penalty=1.2,
max_new_tokens=30,
num_return_sequences=500)

for sequence in sequences:
print(sequence['generated_text'])


#Controllable generation can be done by prompting the model with '[Generate xxx protein] Seq=<'. Here, xxx can be any family from the 10 classes supported by this model.
generator = pipeline('text-generation', model="Esperanto/Protein-Llama-3-8B")

sequences = generator("[Generate Ligase enzyme protein] Seq=<",temperature=0.2,
top_k=40,
top_p=0.9,
do_sample=True,
repetition_penalty=1.2,
max_new_tokens=30,
num_return_sequences=500)

for sequence in sequences:
print(sequence['generated_text'])

Problem is solved.
The Jupyter notebook file was not in the right directory. After moving it into 'Esperanto/Protein-Llama-3-8B', the script works fine now.

screenshot1.png

Thanks for your support!!

metagenix-ai changed discussion status to closed

Sign up or log in to comment