Max coc token length

#5
by Meranti - opened

Hi,

Thank you for releasing the model! I am just wondering what max_doc size should I use for this splade v3 model? 256 like previous ones?

Thanks again,

Yuchen

NAVER LABS Europe org

Hi Yuchen,
In practice, you can use the BERT's max length (so, 512). Models have been trained w/ lower values (128 or 256), but it should still work the same if you increase (or decrease) this value at inference.
Hope it helps
Thibault

Hi Thibault,

Thank you for the answer and have a nice day!

Best,

Yuchen

Oh by the way, I have another question.

Now I have built a clueweb22B index using the github code. And I found that https://github.com/naver/splade/blob/main/splade/indexing/inverted_index.py#L32 this line takes a long while to load the index into the memory, which makes the whole splade retrieval process about 20 minutes.

I am thinking that as a sparse retrieval model, there should be some way to accelerate the search process. Do you know any of this type of efforts in the community?

Thanks again,

Yuchen

NAVER LABS Europe org

Hi Yuchen,
You can have a look here: https://github.com/TusKANNy/seismic
They have super fast retrieval algorithms for neural sparse retrievers!

Thibault

tformal changed discussion status to closed

Sign up or log in to comment