any limit to the input text length?
using the following code:
'''
tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-morph', cache_dir = r"F:\nlp_project\py\dictabert-morph")
model = AutoModel.from_pretrained('dicta-il/dictabert-morph', trust_remote_code=True, cache_dir = r"F:\nlp_project\py\dictabert-morph")
res = model.predict([txt], tokenizer)
'''
the result only tokenizes and processes the first few tokens of the text
cant attach here JSON or txt file so pasting full output in comments
The model dictabert
was pretrained with a window of 512 tokens, and when you input a longer sentence the predict
function truncates it to the maximum length.
The model dictabert-morph
was finetuned on sentence units (of lengths <512).
Therefore, the model can't handle inputs of longer than 512 tokens and probably is not ideal for handling multiple concatenated sentences.
I'd recommend splitting your input into a list of sentence units and then sending the list of sentences to the predict
function. (note: sending multiple sentences will results in them being run through the model as a single batch, so if resources are limited it's probably best to send them one at a time).
Thank you