--- license: apache-2.0 datasets: - tiiuae/falcon-refinedweb pipeline_tag: text-generation library_name: openlm tags: - linear - mistral language: - en model-index: - name: mistral-supra results: - task: type: text-generation dataset: type: MMLU name: MMLU metrics: - name: accuracy type: accuracy value: 34.2 verified: false - task: type: text-generation dataset: type: HellaSwag name: HellaSwag metrics: - name: accuracy type: accuracy value: 77.1 verified: false - task: type: text-generation dataset: type: PIQA name: PIQA metrics: - name: accuracy type: accuracy value: 80.4 verified: false - task: type: text-generation dataset: type: Winogrande name: Winogrande metrics: - name: accuracy type: accuracy value: 70.3 verified: false - task: type: text-generation dataset: type: ai2_arc name: ARC-E metrics: - name: accuracy type: accuracy value: 75.9 verified: false - task: type: text-generation dataset: type: ai2_arc name: ARC-C metrics: - name: accuracy type: accuracy value: 45.8 verified: false --- # Mistral-SUPRA This model was initialized from the weights of the [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) transformer model and up-trained into a linear RNN. This is an accompanying model of our paper [Linearizing Large Language Models](https://arxiv.org/abs/2405.06640), where we detail our process of converting a softmax transformer into a linear transformer, which at inference time can function as both a transformer and a recurrent model. Our linear attention code can be found at https://github.com/TRI-ML/linear_open_lm/ We uptrain Mistral-7B on 100B tokens of RefinedWeb. ## Model Details - **Developed by**: [Toyota Research Institute](https://www.tri.global/our-work/robotics) - **Model Type**: This is an auto-regressive language model initialized from [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) and uptrained into a linear model based on the [SUPRA](https://arxiv.org/abs/2405.06640) architecture. - **Dataset**: Initialized from [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1). Uprained on 100B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). - **Tokenizer**: `mistralai/Mistral-7B-v0.1` - **Library**: [OpenLM](https://github.com/mlfoundations/open_lm/) (we use a [fork](https://github.com/TRI-ML/linear_open_lm/) of OpenLM that supports linear attention) - **License**: This model is licensed under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). | Parameters | Hidden Size | Layers | Vocab Size | Sequence Length | |------------|-------------|--------| ---------- | --------------- | | 7B | 4096 | 32 | 32000 | 2048 | ## Training Details - Mistral-SUPRA was trained using AWS SageMaker on 128 H100 80GB GPUs. - Training on 100B tokens finished in 1.5 days. | **Hyperparameter** | **Value** | |--------------------|------------| | Precision | `bfloat16` | | Optimizer | AdamW | | Learning rate | 3e-5 | | LR cooldown end | 1e-5 | | Warmup steps | 1000 | | Batch size | 2M | | QK norm | False | ## Usage This model was trained using [OpenLM](https://github.com/mlfoundations/open_lm/). The weights have been converted to be compatible with HuggingFace. To use the model, you need to first pip install our fork of OpenLM. ```bash pip install git+https://github.com/tri-ml/linear_open_lm.git ``` Import the OpenLM classes with ```python from open_lm.open_lm_hf import * ``` The model can then be loaded normally using `AutoTokenizer` and `AutoModelForCausalLM` as follows: ```python from open_lm.open_lm_hf import * from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tri-ml/mistral-supra") model = AutoModelForCausalLM.from_pretrained("tri-ml/mistral-supra") inputs = tokenizer(["Machine learning is"], return_tensors="pt") gen_kwargs = {"max_new_tokens": 50, "top_p": 0.8, "temperature": 0.8, "do_sample": True, "repetition_penalty": 1.1} output = model.generate(inputs['input_ids'], **gen_kwargs) output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True) print(output) # Machine learning is a branch of artificial intelligence (AI) that enables computers to learn from experience without being explicitly programmed. Machine learning is used in a wide range of applications, including spam filtering, image recognition, speech recognition, and computer-based medical diagnosis ``` The Mistral-SUPRA model can be used both in parallel mode and in recurrent mode. If `use_cache` is set to `False` for `model.generate(...)`, then it will use parallel mode; otherwise, it will use recurrent mode. The recurrent model uses `xformers` and requires the inputs and models to be loaded to GPU. ```python # Recurrent mode output = model.to('cuda').generate(inputs['input_ids'].to('cuda'), use_cache=True, **gen_kwargs) # Parallel mode output = model.to('cuda').generate(inputs['input_ids'].to('cuda'), use_cache=False, **gen_kwargs) ``` ## Performance Evaluation Our evaluations were done using the [Eleuther LM Eval Harness](https://github.com/EleutherAI/lm-evaluation-harness) repo. Below we report the performance of Mistral-SUPRA compared to other similarly sized models.