9x token reduction
#10
by
Sijuade
- opened
Great work guys!! I wanted to clarify about the token reduction. Let’s say the hidden size of the vision encoder is 1000. After the 9x reduction of the patches, the new hidden size is now 9000. Does this mean the projection layer then projects from 9000 to 2560 (assuming this is the embed size of the language model)?
If this is the case, that’s a steep reduction, did you do anything else to make it work?
Thanks! We especially train the projector longer to accommodate this. And we also do hyperparameters search, 9X reduction is chosen based on experimental result. If you do this more aggressively like 81X reduction, the result becomes much worse. We will disclose more detail in some kind of paper / technical report soon.