Text Generation
Transformers
PyTorch
English
gpt2
causal-lm
text-generation-inference
Inference Endpoints
rskuzma commited on
Commit
38e7fff
·
1 Parent(s): f91e002

small edits, still requires links and evaluation updates

Browse files
Files changed (1) hide show
  1. README.md +3 -4
README.md CHANGED
@@ -100,11 +100,11 @@ print(text_output[0])
100
 
101
  ## Training data
102
 
103
- Cerebras-GPT is trained using [the Pile](https://pile.eleuther.ai) dataset from [EleutherAI](https://www.eleuther.ai) which consists of data from 22 data sources. See the [Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed breakdown of data sources and methodology.
104
 
105
  Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the total token count by 33%. Our models are trained on the Pile **without deduplication**, which presents an opportunity for further improvement with the deduplicated data set.
106
 
107
- Our tokenized version of the Pile has 371B tokens. We used byte-pair encoding, a vocabulary size of 50257, and a maximum sequence length of 2048. We include more details about the training dataset preprocessing in Appendix B.1 of our paper.
108
 
109
  <br><br>
110
 
@@ -193,7 +193,7 @@ We evaluate our models on the PILE validation set comprising 380M tokens. We als
193
  ## Uses and Limitations
194
 
195
  ### Intended Use
196
- The models we train are being open-sourced to further research into LLM scaling laws, but release these models with a fully permissive Apache license for the community to use freely.
197
 
198
  You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or the Hugging Face Transformers Library. We recommend assessing potential bias and harms prior to deployment of any LLM.
199
 
@@ -208,7 +208,6 @@ Like many large text corpora, the Pile contains offensive text. Cerebras-GPT mod
208
 
209
  <br><br>
210
 
211
- # TODO
212
  ## Citation and Related Information
213
 
214
  ### BibTeX entry
 
100
 
101
  ## Training data
102
 
103
+ Cerebras-GPT is trained using [the Pile](https://pile.eleuther.ai) dataset from [EleutherAI](https://www.eleuther.ai). See the [Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed breakdown of data sources and methodology.
104
 
105
  Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the total token count by 33%. Our models are trained on the Pile **without deduplication**, which presents an opportunity for further improvement with the deduplicated data set.
106
 
107
+ Our tokenized version of the Pile has 371B tokens. We used byte-pair encoding, a vocabulary size of 50257, and a maximum sequence length of 2048. We include more details about the training dataset preprocessing in Appendix B.1 of [TODO: our paper](https://www.cerebras.net).
108
 
109
  <br><br>
110
 
 
193
  ## Uses and Limitations
194
 
195
  ### Intended Use
196
+ The models we train are being open-sourced to further research into LLM scaling laws, but we release these models with a fully permissive Apache license for the community to use freely.
197
 
198
  You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or the Hugging Face Transformers Library. We recommend assessing potential bias and harms prior to deployment of any LLM.
199
 
 
208
 
209
  <br><br>
210
 
 
211
  ## Citation and Related Information
212
 
213
  ### BibTeX entry