cahya commited on
Commit
613b72c
·
1 Parent(s): 1d1b1a1

updated the readme and tehe model

Browse files
Files changed (2) hide show
  1. README.md +35 -17
  2. pytorch_model.bin +1 -1
README.md CHANGED
@@ -24,23 +24,41 @@ You can use this model directly with a pipeline for masked language modeling:
24
  ```python
25
  >>> from transformers import pipeline
26
  >>> unmasker = pipeline('fill-mask', model='cahya/distilbert-base-indonesian')
27
- >>> unmasker("Ibu ku sedang bekerja [MASK] supermarket")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
- [{'sequence': '[CLS] ibu ku sedang bekerja di supermarket [SEP]',
30
- 'score': 0.7983310222625732,
31
- 'token': 1495},
32
- {'sequence': '[CLS] ibu ku sedang bekerja. supermarket [SEP]',
33
- 'score': 0.090003103017807,
34
- 'token': 17},
35
- {'sequence': '[CLS] ibu ku sedang bekerja sebagai supermarket [SEP]',
36
- 'score': 0.025469014421105385,
37
- 'token': 1600},
38
- {'sequence': '[CLS] ibu ku sedang bekerja dengan supermarket [SEP]',
39
- 'score': 0.017966199666261673,
40
- 'token': 1555},
41
- {'sequence': '[CLS] ibu ku sedang bekerja untuk supermarket [SEP]',
42
- 'score': 0.016971781849861145,
43
- 'token': 1572}]
44
  ```
45
  Here is how to use this model to get the features of a given text in PyTorch:
46
  ```python
@@ -67,7 +85,7 @@ output = model(encoded_input)
67
 
68
  ## Training data
69
 
70
- This model was pre-trained with 522MB of indonesian Wikipedia and 1GB of
71
  [indonesian newspapers](https://huggingface.co/datasets/id_newspapers_2018).
72
  The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are
73
  then of the form:
 
24
  ```python
25
  >>> from transformers import pipeline
26
  >>> unmasker = pipeline('fill-mask', model='cahya/distilbert-base-indonesian')
27
+ >>> unmasker("Ayahku sedang bekerja di sawah untuk [MASK] padi")
28
+
29
+ [
30
+ {
31
+ "sequence": "[CLS] ayahku sedang bekerja di sawah untuk menanam padi [SEP]",
32
+ "score": 0.6853187084197998,
33
+ "token": 12712,
34
+ "token_str": "menanam"
35
+ },
36
+ {
37
+ "sequence": "[CLS] ayahku sedang bekerja di sawah untuk bertani padi [SEP]",
38
+ "score": 0.03739545866847038,
39
+ "token": 15484,
40
+ "token_str": "bertani"
41
+ },
42
+ {
43
+ "sequence": "[CLS] ayahku sedang bekerja di sawah untuk memetik padi [SEP]",
44
+ "score": 0.02742469497025013,
45
+ "token": 30338,
46
+ "token_str": "memetik"
47
+ },
48
+ {
49
+ "sequence": "[CLS] ayahku sedang bekerja di sawah untuk penggilingan padi [SEP]",
50
+ "score": 0.02214187942445278,
51
+ "token": 28252,
52
+ "token_str": "penggilingan"
53
+ },
54
+ {
55
+ "sequence": "[CLS] ayahku sedang bekerja di sawah untuk tanam padi [SEP]",
56
+ "score": 0.0185895636677742,
57
+ "token": 11308,
58
+ "token_str": "tanam"
59
+ }
60
+ ]
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ```
63
  Here is how to use this model to get the features of a given text in PyTorch:
64
  ```python
 
85
 
86
  ## Training data
87
 
88
+ This model was distiled with 522MB of indonesian Wikipedia and 1GB of
89
  [indonesian newspapers](https://huggingface.co/datasets/id_newspapers_2018).
90
  The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are
91
  then of the form:
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:550663fcbd5a7473047e55ef8778939ecab3a04685fa6666d755159cd929b1a5
3
  size 272513919
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:39b114f8d3260960d4a3a28c2b1ba0543e4ec09a96342d88747f1bed1cd9ab0e
3
  size 272513919