Potentially incorrect tokenizer

#23
by drodel - opened

Hello,

I was trying to play with your model for a summarization task and the output is not understandable.

When inputting text I am getting out put like this:

[{'summary_text': ' academic us ascend carpet than ask commence '
'concentrate content and raise de total at '}]

This is my code using the text from the huggingface transformers tutorial for your reference.

Thank you,
Dale


from pprint import pprint
from transformers import pipeline


input_text = """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""

summarizer = pipeline("summarization",
                      model="pszemraj/long-t5-tglobal-base-16384-book-summary")
params = {
    "max_length": 100,
    "min_length": 8,
    "no_repeat_ngram_size": 3,
    "early_stopping": True,
    "repetition_penalty": 3.5,
    "length_penalty": 0.3,
    "encoder_no_repeat_ngram_size": 3,
    "num_beams": 4,
} # parameters for text generation out of model

pprint( summarizer(input_text, **params) )

Hi, what package versions are you using? Try updating with pip install -U transformers

BTW, I don't want to discourage you from using this model, but unless there is a specific reason you want to use models trained on the Booksum dataset, I would recommend these models, which I trained more recently for general use. if you continue to have problems you can try the pegasus-x model instead, etc https://hf.co/BEE-spoke-data/pegasus-x-base-synthsumm_open-16k

Hello pszemraj,

Thank you for the reply. I am not using your model for a specific reason. I am just playing with a variety of seq2seq models to see how they preform on summarization tasks and yours landed on the list.

I forgot to mention that I am not running on a GPU. I am just doing toy problems on a laptop to get a feel for the models. This may be the root of the problem.

The version of the packages I think might be relevant are:
torch 2.5.1
transformers 4.47.1

transformers was installed using this command.

pip install -U 'transformers[torch]'

After running the upgrade command it changed transformers to version 4.48.0, but generated nonsense again.

If this issue is not important to you, I am happy for you to not address it and close this thread. I will take a look at the models you suggested.

I sincerely appreciate your your reply and thank you for contributing your work.
Dale

Hello,

After rereading your Model Card I added the cuda test to the pipeline call and things look better.

device=0 if torch.cuda.is_available() else -1

The output now looks like this, not the odd output from before.

[{'summary_text': 'parent origin which  went ran demandstan union back name '
                  'rule fact out over run and gut find " cons carpet When cu '
                  'Stewart more how where when see the up money hair The'}]

I am sorry to have bothered you.
Dale

don't worry! After digging a bit deeper, it seems to be some issue that arose with transformers after version 4.42 - I was able to replicate your issue and confirmed that 4.42 works (link to Colab).

I'll create an issue on the transformers github to help solve it in the coming days - in the meantime, I'd recommend pip install transformers==4.42.0 (or earlier). I'll also update the relevant model card(s) soon so others are aware

thanks again for pointing this out, and good call on creating the issue!

Much improved! I made a virtual environment with version 4.42.0 and it works as expected.

[{'summary_text': 'This paper discusses the recent changes in the field of '
                  'engineering in America. While there are still many talented '
                  'young men entering the field, there is a growing divide '
                  'between traditional and applied sciences. The majority of '
                  'universities concentrate on "engineering science" while '
                  'other areas focus on "high technology subjects."'}]

Thank you for your assistance.

drodel changed discussion status to closed

Sign up or log in to comment