naver/splade-v3 · Max coc token length

Nov 25, 2024

•

edited Nov 25, 2024

Hi,

Thank you for releasing the model! I am just wondering what max_doc size should I use for this splade v3 model? 256 like previous ones?

Thanks again,

Yuchen

tformal

NAVER LABS Europe org 28 days ago

Hi Yuchen,
In practice, you can use the BERT's max length (so, 512). Models have been trained w/ lower values (128 or 256), but it should still work the same if you increase (or decrease) this value at inference.
Hope it helps
Thibault

Meranti

28 days ago

Hi Thibault,

Thank you for the answer and have a nice day!

Best,

Yuchen

Meranti

28 days ago

Oh by the way, I have another question.

Now I have built a clueweb22B index using the github code. And I found that https://github.com/naver/splade/blob/main/splade/indexing/inverted_index.py#L32 this line takes a long while to load the index into the memory, which makes the whole splade retrieval process about 20 minutes.

I am thinking that as a sparse retrieval model, there should be some way to accelerate the search process. Do you know any of this type of efforts in the community?

Thanks again,

Yuchen

tformal

NAVER LABS Europe org 27 days ago

Hi Yuchen,
You can have a look here: https://github.com/TusKANNy/seismic
They have super fast retrieval algorithms for neural sparse retrievers!

Thibault

tformal changed discussion status to closed 27 days ago