awacke1 commited on
Commit
1e37fa7
ยท
1 Parent(s): 68cbcae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -2
README.md CHANGED
@@ -9,5 +9,24 @@ app_file: app.py
9
  pinned: false
10
  license: mit
11
  ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pinned: false
10
  license: mit
11
  ---
12
+ ## ChatGPT Datasets ๐Ÿ“š
13
+ - WebText
14
+ - Common Crawl
15
+ - BooksCorpus
16
+ - English Wikipedia
17
+ - Toronto Books Corpus
18
+ - OpenWebText
19
+ ## ChatGPT Datasets - Details ๐Ÿ“š
20
+ - **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
21
+ - [WebText: A Large-Scale Unsupervised Text Corpus by Radford et al.](https://paperswithcode.com/dataset/webtext)
22
+ - **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
23
+ - [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/common-crawl) by Brown et al.
24
+ - **BooksCorpus:** A dataset of over 11,000 books from a variety of genres.
25
+ - [Scalable Methods for 8 Billion Token Language Modeling](https://paperswithcode.com/dataset/bookcorpus) by Zhu et al.
26
+ - **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
27
+ - [Improving Language Understanding by Generative Pre-Training](https://huggingface.co/spaces/awacke1/WikipediaUltimateAISearch?logs=build) Space for Wikipedia Search
28
+ - **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
29
+ - [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://paperswithcode.com/dataset/bookcorpus) by Schwenk and Douze.
30
+ - **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
31
+ - [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/openwebtext) by Brown et al.
32
+