small edits, still requires links and evaluation updates
Browse files
README.md
CHANGED
@@ -100,11 +100,11 @@ print(text_output[0])
|
|
100 |
|
101 |
## Training data
|
102 |
|
103 |
-
Cerebras-GPT is trained using [the Pile](https://pile.eleuther.ai) dataset from [EleutherAI](https://www.eleuther.ai)
|
104 |
|
105 |
Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the total token count by 33%. Our models are trained on the Pile **without deduplication**, which presents an opportunity for further improvement with the deduplicated data set.
|
106 |
|
107 |
-
Our tokenized version of the Pile has 371B tokens. We used byte-pair encoding, a vocabulary size of 50257, and a maximum sequence length of 2048. We include more details about the training dataset preprocessing in Appendix B.1 of our paper.
|
108 |
|
109 |
<br><br>
|
110 |
|
@@ -193,7 +193,7 @@ We evaluate our models on the PILE validation set comprising 380M tokens. We als
|
|
193 |
## Uses and Limitations
|
194 |
|
195 |
### Intended Use
|
196 |
-
The models we train are being open-sourced to further research into LLM scaling laws, but release these models with a fully permissive Apache license for the community to use freely.
|
197 |
|
198 |
You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or the Hugging Face Transformers Library. We recommend assessing potential bias and harms prior to deployment of any LLM.
|
199 |
|
@@ -208,7 +208,6 @@ Like many large text corpora, the Pile contains offensive text. Cerebras-GPT mod
|
|
208 |
|
209 |
<br><br>
|
210 |
|
211 |
-
# TODO
|
212 |
## Citation and Related Information
|
213 |
|
214 |
### BibTeX entry
|
|
|
100 |
|
101 |
## Training data
|
102 |
|
103 |
+
Cerebras-GPT is trained using [the Pile](https://pile.eleuther.ai) dataset from [EleutherAI](https://www.eleuther.ai). See the [Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed breakdown of data sources and methodology.
|
104 |
|
105 |
Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the total token count by 33%. Our models are trained on the Pile **without deduplication**, which presents an opportunity for further improvement with the deduplicated data set.
|
106 |
|
107 |
+
Our tokenized version of the Pile has 371B tokens. We used byte-pair encoding, a vocabulary size of 50257, and a maximum sequence length of 2048. We include more details about the training dataset preprocessing in Appendix B.1 of [TODO: our paper](https://www.cerebras.net).
|
108 |
|
109 |
<br><br>
|
110 |
|
|
|
193 |
## Uses and Limitations
|
194 |
|
195 |
### Intended Use
|
196 |
+
The models we train are being open-sourced to further research into LLM scaling laws, but we release these models with a fully permissive Apache license for the community to use freely.
|
197 |
|
198 |
You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or the Hugging Face Transformers Library. We recommend assessing potential bias and harms prior to deployment of any LLM.
|
199 |
|
|
|
208 |
|
209 |
<br><br>
|
210 |
|
|
|
211 |
## Citation and Related Information
|
212 |
|
213 |
### BibTeX entry
|