--- license: unlicense pipeline_tag: sentence-similarity language: - ru tags: - PyTorch - Transformers --- Encoder-model for search query similarity task. Fast and accurate. Sentencepiece tokenizer fitted on 269 million Russian search queries log. Short sequence length is used to use less memory. The data set for validation with manual markup consisted of 362 thousand examples. |![Validation results](https://huggingface.co/fkrasnov2/SBE/resolve/main/bvf_recall1k_query_len_eng.svg)| ```python from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained('fkrasnov2/SBE') tokenizer = AutoTokenizer.from_pretrained('fkrasnov2/SBE') input_ids = tokenizer.encode("чёрное платье", max_length=model.config.max_position_embeddings, truncation=True, return_tensors='pt') model.eval() vector = model(input_ids=input_ids, attention_mask=input_ids>tokenizer.mask_token_id)[0][0,0] assert model.config.hidden_size == vector.shape[0] ```