* Each layer consists of one feedforward block and one self attention block.
† Although the embedding matrix has a size of 50400, only 50257 entries are used by the GPT-2 tokenizer.
Models roughly sorted by performance, or by FLOPs if not available.
* Evaluation numbers reported by their respective authors. All other numbers are provided by
running lm-evaluation-harness
either with released
weights or with API access. Due to subtle implementation differences as well as different zero shot task framing, these
might not be directly comparable. See this blog post for more
details.
† Megatron-11B provides no comparable metrics, and several implementations using the released weights do not reproduce the generation quality and evaluations. (see 1 2 3) Thus, evaluation was not attempted.
‡ These models have been trained with data which contains possible test set contamination. The OpenAI GPT-3 models failed to deduplicate training data for certain test sets, while the GPT-Neo models as well as this one is trained on the Pile, which has not been deduplicated against any test sets.