Crash while loading tokenizer

#1
by legraphista - opened
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('THUDM/LongCite-llama3.1-8b', trust_remote_code=True)

results in

FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 1
----> 1 tokenizer = AutoTokenizer.from_pretrained('THUDM/LongCite-llama3.1-8b', trust_remote_code=True)

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:847, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    845     if os.path.isdir(pretrained_model_name_or_path):
    846         tokenizer_class.register_for_auto_class()
--> 847     return tokenizer_class.from_pretrained(
    848         pretrained_model_name_or_path, *inputs, trust_remote_code=trust_remote_code, **kwargs
    849     )
    850 elif config_tokenizer_class is not None:
    851     tokenizer_class = None

File ~/.cache/huggingface/modules/transformers_modules/THUDM/LongCite-llama3.1-8b/8265f5e5bceab232605db43e6e0c6579ff941354/tiktoken_tokenizer.py:58, in TikTokenizer.from_pretrained(path, *inputs, **kwargs)
     56 @staticmethod
     57 def from_pretrained(path, *inputs, **kwargs):
---> 58     return TikTokenizer(vocab_file=os.path.join(path, "tokenizer.tiktoken"))

File ~/.cache/huggingface/modules/transformers_modules/THUDM/LongCite-llama3.1-8b/8265f5e5bceab232605db43e6e0c6579ff941354/tiktoken_tokenizer.py:67, in TikTokenizer.__init__(self, vocab_file)
     65 if vocab_file is not None:
     66     mergeable_ranks = {}
---> 67     with open(vocab_file) as f:
     68         for line in f:
     69             token, rank = line.strip().split()

FileNotFoundError: [Errno 2] No such file or directory: 'THUDM/LongCite-llama3.1-8b/tokenizer.tiktoken'

yes, the same issue.

A workaround is to download the model locally (with huggingface_cli download) and load it via path instead of model id

ok, thanks for your workaround!

Awesome Thank you for the Workaround.

Here is a bit more Detail for those who use paths instead of ids for the first time like me :)

  1. huggingface-cli download https://huggingface.co/THUDM/LongCite-llama3.1-8b/tree/main

  2. Adjust for local path ->! important to provide snapshot ! only /home/someuser/.cache/huggingface/hub/models--THUDM--LongCite-llama3.1-8b/ wont work
    tokenizer = AutoTokenizer.from_pretrained('/home/someuser/.cache/huggingface/hub/models--THUDM--LongCite-llama3.1-8b/snapshots/58260b89bc2a547b814f44b89914b1e282b2d5cd/', trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
    '/home/someuser/.cache/huggingface/hub/models--THUDM--LongCite-llama3.1-8b/snapshots/58260b89bc2a547b814f44b89914b1e282b2d5cd/',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map='auto'
    )

To the developers: Thank you for this amazing model. I had high expectations, and they have been surpassed.

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Thanks for pointing out this bug. We have fix it now.

NeoZ123 changed discussion status to closed

Sign up or log in to comment