9x token reduction

#10

by Sijuade - opened Nov 26, 2024

Nov 26, 2024

Great work guys!! I wanted to clarify about the token reduction. Let’s say the hidden size of the vision encoder is 1000. After the 9x reduction of the patches, the new hidden size is now 9000. Does this mean the projection layer then projects from 9000 to 2560 (assuming this is the embed size of the language model)?

If this is the case, that’s a steep reduction, did you do anything else to make it work?

alexchen4ai

Nexa AI org Nov 27, 2024

Thanks! We especially train the projector longer to accommodate this. And we also do hyperparameters search, 9X reduction is chosen based on experimental result. If you do this more aggressively like 81X reduction, the result becomes much worse. We will disclose more detail in some kind of paper / technical report soon.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment