Description
We use MS Marco Encoder msmarco-MiniLM-L-6-v3 from the sentence-transformers library to encode the text from dataset abokbot/wikipedia-first-paragraph.
The dataset contains the first paragraphs of the English "20220301.en" version of the Wikipedia dataset.
The output is an embedding tensor of size [6458670, 384].
Code
It was obtained by running the following code.
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
dataset = load_dataset("abokbot/wikipedia-first-paragraph")
bi_encoder = SentenceTransformer('msmarco-MiniLM-L-6-v3')
bi_encoder.max_seq_length = 256
wikipedia_embedding = bi_encoder.encode(dataset["text"], convert_to_tensor=True, show_progress_bar=True)
This operation took 35min on a Google Colab notebook with GPU.
Reference
More information of MS Marco encoders here https://www.sbert.net/docs/pretrained-models/ce-msmarco.html