upskyy commited on
Commit
939a953
·
verified ·
1 Parent(s): 9d2ff40

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +46 -3
  2. config.json +1 -1
README.md CHANGED
@@ -169,7 +169,7 @@ model-index:
169
  name: Spearman Max
170
  ---
171
 
172
- # SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
173
 
174
  This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
175
 
@@ -196,7 +196,8 @@ SentenceTransformer(
196
 
197
  ## Usage
198
 
199
- ### Direct Usage (Sentence Transformers)
 
200
 
201
  First install the Sentence Transformers library:
202
 
@@ -209,7 +210,7 @@ Then you can load this model and run inference.
209
  from sentence_transformers import SentenceTransformer
210
 
211
  # Download from the 🤗 Hub
212
- model = SentenceTransformer("upskyy/gte-korean-base")
213
 
214
  # Run inference
215
  sentences = [
@@ -225,6 +226,48 @@ print(embeddings.shape)
225
  similarities = model.similarity(embeddings, embeddings)
226
  print(similarities.shape)
227
  # [3, 3]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
228
  ```
229
 
230
  <!--
 
169
  name: Spearman Max
170
  ---
171
 
172
+ # upskyy/gte-korean-base
173
 
174
  This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
175
 
 
196
 
197
  ## Usage
198
 
199
+ ### Usage (Sentence-Transformers)
200
+
201
 
202
  First install the Sentence Transformers library:
203
 
 
210
  from sentence_transformers import SentenceTransformer
211
 
212
  # Download from the 🤗 Hub
213
+ model = SentenceTransformer("upskyy/gte-korean-base", trust_remote_code=True)
214
 
215
  # Run inference
216
  sentences = [
 
226
  similarities = model.similarity(embeddings, embeddings)
227
  print(similarities.shape)
228
  # [3, 3]
229
+ print(similarities)
230
+ # tensor([[1.0000, 0.6274, 0.3788],
231
+ # [0.6274, 1.0000, 0.5978],
232
+ # [0.3788, 0.5978, 1.0000]])
233
+ ```
234
+
235
+ ### Usage (HuggingFace Transformers)
236
+
237
+ Without sentence-transformers, you can use the model like this:
238
+ First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
239
+
240
+ ```python
241
+ from transformers import AutoTokenizer, AutoModel
242
+ import torch
243
+
244
+
245
+ # Mean Pooling - Take attention mask into account for correct averaging
246
+ def mean_pooling(model_output, attention_mask):
247
+ token_embeddings = model_output[0] # First element of model_output contains all token embeddings
248
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
249
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
250
+
251
+
252
+ # Sentences we want sentence embeddings for
253
+ sentences = ["안녕하세요?", "한국어 문장 임베딩을 위한 버트 모델입니다."]
254
+
255
+ # Load model from HuggingFace Hub
256
+ tokenizer = AutoTokenizer.from_pretrained("upskyy/gte-korean-base")
257
+ model = AutoModel.from_pretrained("upskyy/gte-korean-base", trust_remote_code=True)
258
+
259
+ # Tokenize sentences
260
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
261
+
262
+ # Compute token embeddings
263
+ with torch.no_grad():
264
+ model_output = model(**encoded_input)
265
+
266
+ # Perform pooling. In this case, mean pooling.
267
+ sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
268
+
269
+ print("Sentence embeddings:")
270
+ print(sentence_embeddings)
271
  ```
272
 
273
  <!--
config.json CHANGED
@@ -47,4 +47,4 @@
47
  "unpad_inputs": false,
48
  "use_memory_efficient_attention": false,
49
  "vocab_size": 250048
50
- }
 
47
  "unpad_inputs": false,
48
  "use_memory_efficient_attention": false,
49
  "vocab_size": 250048
50
+ }