cerebras
/

Cerebras-GPT-111M

@@ -100,11 +100,11 @@ print(text_output[0])
 ## Training data
-Cerebras-GPT is trained using [the Pile](https://pile.eleuther.ai) dataset from [EleutherAI](https://www.eleuther.ai) which consists of data from 22 data sources. See the [Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed breakdown of data sources and methodology.
 Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the total token count by 33%. Our models are trained on the Pile **without deduplication**, which presents an opportunity for further improvement with the deduplicated data set.
-Our tokenized version of the Pile has 371B tokens. We used byte-pair encoding, a vocabulary size of 50257, and a maximum sequence length of 2048. We include more details about the training dataset preprocessing in Appendix B.1 of our paper.
 <br><br>
@@ -193,7 +193,7 @@ We evaluate our models on the PILE validation set comprising 380M tokens. We als
 ## Uses and Limitations
 ### Intended Use
-The models we train are being open-sourced to further research into LLM scaling laws, but release these models with a fully permissive Apache license for the community to use freely.
 You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or the Hugging Face Transformers Library. We recommend assessing potential bias and harms prior to deployment of any LLM.
@@ -208,7 +208,6 @@ Like many large text corpora, the Pile contains offensive text. Cerebras-GPT mod
 <br><br>
-# TODO
 ## Citation and Related Information
 ### BibTeX entry

 ## Training data
+Cerebras-GPT is trained using [the Pile](https://pile.eleuther.ai) dataset from [EleutherAI](https://www.eleuther.ai). See the [Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed breakdown of data sources and methodology.
 Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the total token count by 33%. Our models are trained on the Pile **without deduplication**, which presents an opportunity for further improvement with the deduplicated data set.
+Our tokenized version of the Pile has 371B tokens. We used byte-pair encoding, a vocabulary size of 50257, and a maximum sequence length of 2048. We include more details about the training dataset preprocessing in Appendix B.1 of [TODO: our paper](https://www.cerebras.net).
 <br><br>
 ## Uses and Limitations
 ### Intended Use
+The models we train are being open-sourced to further research into LLM scaling laws, but we release these models with a fully permissive Apache license for the community to use freely.
 You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or the Hugging Face Transformers Library. We recommend assessing potential bias and harms prior to deployment of any LLM.
 <br><br>
 ## Citation and Related Information
 ### BibTeX entry