--- extra_gated_prompt: "Access to this model is automatically granted upon accepting the [**AI2 ImpACT License – Low Risk Artifacts (“LR Agreement”)**](https://allenai.org/licenses/impact-lr) and completing all fields below. This model is licensed under the LR Agreement." extra_gated_fields: Your full name: text Organization or entity you are affiliated with: text State or country you are located in: text Contact email: text Please describe your intended use of the low risk artifact(s): text I AGREE to the terms and conditions of the LR Agreement above: checkbox I AGREE to AI2’s use of my information for legal notices and administrative matters: checkbox I CERTIFY that the information I have provided is true and accurate: checkbox # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1 # Doc / guide: https://huggingface.co/docs/hub/model-cards --- # Model Card for Model ID The [Paloma](https://paloma.allen.ai/) 1B baselines are a collection of language models pretrained on popular corpora while controlling all other experimental variables. These models are developed to facilitate scientific comparisons of language model fit using the Paloma benchmark of 585 textual domains. This collection of models includes 6 baseline 1B parameter models each trained on ~150B tokens from one the following corpora: [Dolma](https://github.com/allenai/dolma), [The Pile](https://api.semanticscholar.org/CorpusID:230435736), [RedPajama](https://github.com/togethercomputer/RedPajama-Data), [Falcon-RefinedWeb](https://api.semanticscholar.org/CorpusID:259063761), [C4](https://aclanthology.org/2021.emnlp-main.98), and [MC4-en](https://api.semanticscholar.org/CorpusID:258187051). ## Model Details ### Model Description - **Developed by:** Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, and Jesse Dodge - **Model type:** Decoder-only transformer language model - **Language(s) (NLP):** English - **License:** [**AI2 ImpACT License – Low Risk Artifacts (“LR Agreement”)**](https://allenai.org/licenses/impact-lr) ## Uses ### Direct Use This model is primarily intended as research artifact that is a baseline for the language modeling benchmark [Paloma](https://paloma.allen.ai/). ### Out-of-Scope Use The restrictions to use of this model are described in the model license: [**AI2 ImpACT License – Low Risk Artifacts (“LR Agreement”)**](https://allenai.org/licenses/impact-lr) ## Bias, Risks, and Limitations This model is purely trained as an autoregressive language model. It has not been adapted in any way to prevent bias. It is a model of the language distribution that it is trained on. ### Recommendations This research artifact is a baseline for a language modeling benchmark. Best uses of this model will take advantage of the experimental controls applied to this model and the other Paloma baselines. These enable comparisons of models that vary only in the pretraining corpus used to train them. ## How to Get Started with the Model Install the code needed to run inference with the model ``` pip install ai2-olmo ``` Download and instantiate the model ``` from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("allenai/", trust_remote_code=True) ``` ## Training Details ### Training Data Each of the Paloma baseline models are trained on one of the following datasets: [Dolma](https://github.com/allenai/dolma), [The Pile](https://api.semanticscholar.org/CorpusID:230435736), [RedPajama](https://github.com/togethercomputer/RedPajama-Data), [Falcon-RefinedWeb](https://api.semanticscholar.org/CorpusID:259063761), [C4](https://aclanthology.org/2021.emnlp-main.98), and [MC4-en](https://api.semanticscholar.org/CorpusID:258187051). ### Training Procedure #### Preprocessing We remove any document in the pretraining data that is contaminated with respect to the [Paloma](https://paloma.allen.ai/) evaluation data. We match overlaps of evaluation and train text at the paragraph level, i.e., newline separated spans of text. To avoid coincidental collisions in the space of small strings, we ignore matches in paragraphs smaller than 13 unicode segmented tokens. Similarly, we ignore paragraphs composed of only punctuation, spaces, and emoji, as, unlike words, these can be arbitrarily repeated when used as formatting, leading to high frequency n-grams greater than our 13-gram threshold. Lastly, as code data consists almost entirely of short and often repeated lines, we forgo any decontamination against the code evaluations in Paloma. #### Training Hyperparameters The Paloma baseline 1B parameter models that we train employ the following architecture: 2048 maximum sequence length, 2048 model dimension, 16 layers, 16 attention heads, RoPE embedding, SwiGLU activation, mixed precision, non-parametric layer normalization, and sequential model blocks for attention and feed-forward networks. We use EleutherAI's GPT NeoX tokenizer but add 3 additional special tokens that are used to mask PII in Dolma. We train to 35k steps (∼150B tokens) with the following LionW optimizer configurations: 2.0e-4 peak learning rate, warm-up of 2000 steps, cosine decay to 70k steps (∼300B tokens), 0.1 weight decay, and betas of 0.9 and 0.95. Note that our batch size varies slightly to accommodate two groups of baselines that were run on different hardware. The Dolma and Falcon-RefinedWeb baselines were run with a batch size of 2112 training instances per step on 24 A100s. The RedPajama, The Pile, C4, and mC4-EN baselines were run with a batch size of 2048 on 64 AMD Instinct MI250X GPUs. In each case we save model checkpoints every 5k steps (∼20B tokens). ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data The [Paloma](https://paloma.allen.ai/) benchmark is used to evaluate these baseline models. #### Factors Paloma evaluates on 585 domains. These are a collection of the most fine-grained domains readily available in current metadata. #### Metrics Paloma measures langauge modeling fit using Perplexity. It is a benchmark of language modeling, so examination of downstream uses is out of scope. ### Results To demonstrate possible uses of results from the Paloma benchmark, we conduct a series of case studies. We show that performance improves in almost all domains as models are scaled, but domains improve unequally. Further, across domains, perplexity is driven by strings in the vocabulary, i.e., types, that occur in most domains, but other types even get worse as models scale. Finally, our experiments isolate change in pretraining corpora and find that pretraining without heterogeneous data sources beyond Common Crawl leads to perplexities that do not improve consistently with tokens seen. ## Environmental Impact The Dolma and Falcon-RefinedWeb baselines were run with on 24 A100s for 9 days per model. The RedPajama, The Pile, C4, and mC4-EN baselines were run on 64 AMD Instinct MI250X GPUs for 2 days per model. ## Citation **BibTeX:** ``` @article{paloma, title={{Paloma}: A Benchmark for Evaluating Language Model Fit}, author={Magnusson, Ian and Bhagia, Akshita and Hofmann, Valentin and Soldaini, Luca and Harsh Jha, Ananya and Tafjord, Oyvind and Schwenk,Dustin and Walsh, Evan Pete and Elazar, Yanai and Lo, Kyle and Groenveld,Dirk and Beltagy,Iz and Hajishirz,Hanneneh and Smith, Noah A. and Richardson,Kyle and Dodge,Jesse}, journal={technical report}, year={2023}, url={https://paloma.allen.ai/} } ``` ## Model Card Contact {ianm,jessed}@allenai.org