|
--- |
|
datasets: |
|
- armvectores/hy_wikipedia_2023 |
|
pipeline_tag: feature-extraction |
|
language: |
|
- hy |
|
library_name: fasttext |
|
--- |
|
|
|
414M tokens |
|
1) 73M hy wikipedia |
|
2) 341M arlis database |
|
|
|
74951 unique words |
|
|
|
3-5 ngrams |
|
|
|
5 window length |
|
|
|
300 embedding dim |
|
|
|
skipgram |
|
|
|
minimum number of words 150 |
|
|
|
100 epochs, 0.05 start lr |
|
|
|
26 hours on 20 xeon gold cores |
|
|
|
How to use |
|
|
|
1) Install fastText |
|
|
|
``` |
|
pip install fasttext-wheel |
|
``` |
|
|
|
2) Import fastText in python |
|
|
|
``` |
|
import fasttext |
|
from huggingface_hub import hf_hub_download |
|
|
|
model_path = hf_hub_download(local_dir=".", |
|
repo_id="armvectores/wikipedia_arlis_tokens_fasttextskipgram_300_5", |
|
filename="model.bin") |
|
model = fasttext.load_model(model_path) |
|
|
|
``` |
|
|
|
3) Examples of usage |
|
|
|
``` |
|
word = 'զենքեր' |
|
print(model.get_nearest_neighbors(word)) |
|
print(model.get_sentence_vector(word)) |
|
|
|
``` |