bourdoiscatie commited on 26 days ago

Commit

0743270

verified ·

1 Parent(s): 66ded02

Add FAT5-small

Browse files

Files changed (21) hide show

README.md +224 -0
adamw_scaled.py +281 -0
attn_ref.py +29 -0
config.json +59 -0
configuration_flash_t5.py +84 -0
cross_entropy_loss.py +426 -0
custom_heads_flash_t5.py +315 -0
flash_attention_v2_bias.py +905 -0
generation_config.json +7 -0
modeling_flash_t5.py +790 -0
optimizer.pt +3 -0
positional_encoding.py +417 -0
pytorch_model.bin +3 -0
rms_norm.py +287 -0
rng_state.pth +3 -0
scheduler.pt +3 -0
special_tokens_map.json +266 -0
tokenizer.json +0 -0
tokenizer_config.json +2367 -0
trainer_state.json +0 -0
training_args.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,224 @@

+---
+language: fr
+datasets:
+- uonlp/CulturaX
+- wikimedia/wikipedia
+- eckendoerffer/justice_fr
+- bigcode/the-stack-dedup
+metrics:
+- f1
+- exact_match
+library_name: transformers
+co2_eq_emissions: 13.5
+license: apache-2.0
+---
+# FAT5 (Flash Attention T5) ⚡
+<br>
+<div align="center" style="line-height: 1;">
+  <a href="https://huggingface.co/spaces/CATIE-AQ/FAT5-report" style="margin: 2px;">
+    <img alt="Blog post (EN)" src="https://img.shields.io/badge/📝_Blog_post-English_version-f5de53?&color=blue" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+  <a href="https://huggingface.co/spaces/CATIE-AQ/FAT5-rapport" style="margin: 2px;">
+    <img alt="Blog post (FR)" src="https://img.shields.io/badge/📝_Blog_post-French_version-f5de53?&color=blue" style="display: inline-block; vertical-align: middle;"/>
+  <a href="https://huggingface.co/collections/CATIE-AQ/catie-french-fat5-ul2-677697a35feea336389d6403" target="_blank" style="margin: 2px;">
+    <img alt="Hugging Face collection" src="https://img.shields.io/badge/🤗_Hugging_Face-Collection-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+  </a>
+    <a href="https://github.com/catie-aq/flashT5" target="_blank" style="margin: 2px;">
+    <img alt="FAT5 GitHub" src="https://img.shields.io/github/stars/catie-aq/flashT5?style=social" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+  </a>
+    <a href="https://opensource.org/licenses/Apache-2.0" target="_blank" style="margin: 2px;">
+    <img alt="License: Apache 2.0" src="https://img.shields.io/badge/License-Apache--2.0-green.svg" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+</div>
+## Introduction
+FAT5 (for Flash Attention T5) is an implementation of T5 in PyTorch with an [UL2](https://arxiv.org/abs/2205.05131) objective optimized for GPGPU for both training and inference. It uses an experimental feature for using [Flash Attention](https://arxiv.org/abs/2307.08691) (v2) with relative position encoding biases that allow to train or finetune the model on longer sequence lengths than the original T5. It also has support for other positional embeddings such as [RoPE](https://arxiv.org/abs/2104.09864), [ALiBi](https://arxiv.org/abs/2108.12409) or [FIRE](https://arxiv.org/abs/2310.04418).
+This methodology enabled us to efficiently pretrain as a proof of concept a T5 with 147M parameters in French in a reasonable time (1,461H to see 419B tokens) and with limited resources (1 single A100; i.e. a computational budget of around €2,200) which you'll find the weights in this repo.
+To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5, and to provide linear inference, thus extending the context size that can be taken into account by the model.
+Other optimizations have also been implemented, as detailed in a subsequent [blog post](https://huggingface.co/spaces/CATIE-AQ/FAT5-report).
+<br>
+## Motivation
+While a lot of effort has been focused on optimizing decoder-only models, in many practical applications older architectures remains useful.
+We focus on [T5](http://jmlr.org/papers/v21/20-074.html), an encoder-decoder architecture exhibiting very decent performances for [instruction tuning](https://arxiv.org/pdf/2306.04757.pdf) or even sometimes outperforming much larger models when [finetuned](https://arxiv.org/pdf/2402.00841.pdf). Moreover it’s a natural architecture while considering [distillation](https://arxiv.org/abs/2305.02301) of much larger models.
+A critical limitation of this architecture is the length of the sequence that these models can deal with due to the quadratic size in memory. While this quadratic term cannot be removed without considering other form of attention (like for [LongT5](https://arxiv.org/abs/2112.07916)), it can still be alleviated to accomodate longer sequence lengths.
+Another limitation is the pre-training time, since techniques such as Flash Attention are not available for this architecture.
+<br>
+## Our work
+We used the [nanoT5](https://github.com/PiotrNawrot/nanoT5?tab=readme-ov-file#cite) implementation as the base for our work.
+We worked on optimizing the core component of the model, which is the attention part. We used the Flash Attention (v2) that optimize both the memory usage and the efficient use of Tensor Cores.
+We support different implementation of attention biases:
+- Full attention biases with Flash Attention 2 using this [PR](https://github.com/Dao-AILab/flash-attention/pull/617)
+- T5-like relative position encoding biases with Flash Attention 2 using this [PR](https://github.com/Dao-AILab/flash-attention/pull/956)
+- Full attention biases with a [triton implementation](src/model/ops/flash_attention_v2_bias.py) of Flash Attention 2
+<center>
+<img src="https://github.com/catie-aq/flashT5/raw/main/assets/FAT5_dark.gif" alt="FAT5_dark" width="100%">
+</center>
+Other parts of the architecture where optimized using [ad-hoc Triton kernels](https://github.com/catie-aq/flashT5/tree/main/src/model/ops) for the cross-entropy (and z-loss) and layernorm.
+For pretext tasks during pre-training, we use the [UL2](https://arxiv.org/abs/2205.05131v3) mixture of denoisers with the following 7 tasks:
+```python
+  denoiser_list=[
+      {"mu": 3.0, "r": 0.15, "max_spans": max_token_length, "prefix": "[R]"},
+      {"mu": 8.0, "r": 0.15, "max_spans": max_token_length, "prefix": "[R]"},
+      {"mu": 4.0, "r": 0.0, "max_spans": 1, "prefix": "[S]"},
+      {"mu": 3.0, "r": 0.5, "max_spans": max_token_length, "prefix": "[X]"},
+      {"mu": 8.0, "r": 0.15, "max_spans": max_token_length, "prefix": "[X]"},
+      {"mu": 64.0, "r": 0.15, "max_spans": max_token_length, "prefix": "[X]"},
+      {"mu": 64.0, "r": 0.5, "max_spans": max_token_length, "prefix": "[X]"}]
+  denoiser_proportions=[0.165, 0.165, 0.34, 0.0825, 0.0825, 0.0825, 0.0825]
+```
+  where `mu`: the span size, `r`: the % of masking in the span and `prefix`: the type of the pretext task (the meaning of the letters `[R]`, `[S]` and `[X]` is described [here](https://huggingface.co/google/ul2#mixture-of-denoisers)).
+As there was no implementation available in PyTorch, we [added one](https://github.com/catie-aq/flashT5/blob/main/src/data/data_collator_ul2.py) and adapted a dynamic batching mechanism to reduce padding in the model.
+<br>
+## Benchmarks
+#### TFLOPS
+The number of TFLOPS (trillions of floating-point calculations a processor can perform in one second) is probably the most eloquent measure of the impact of the optimizations carried out.
+We therefore compare four approaches:
+• the SPDA (Scaled Dot Product Attention) implementation with full bias,
+• the same implementation but in Triton,
+• the Flash Attention RPE implementation (our kernel),
+• the Flash Attention implementation, i.e. without bias. We've included it here for reference, as it's unusable in practice for a T5.
+For the forward pass, we have:
+<center>
+<img src="https://github.com/catie-aq/flashT5/raw/main/assets/benchmarks/FWD-causal-True_dark.png" alt="FWD-causal-True_dark" width="100%">
+</center>
+For the forward pass, we can see that the Triton approach achieves 1.34 times more FLOPS than SPDA, and that the Flash Attention RPE approach achieves 1.99 times more FLOPS than SPDA.
+We can also see that our bf16 implementation is equivalent to fp16 (doing even better at size 512).
+For the backward pass, we have:
+<center>
+<img src="https://github.com/catie-aq/flashT5/raw/main/assets/benchmarks/BWD-causal-True_dark.png" alt="BWD-causal-True_dark" width="100%">
+</center>
+For the backward pass, the Triton implementation is less efficient than SPDA, with 0.71 times the FLOPS of SPDA. The Flash Attention RPE implementation is more or less equivalent to SPDA (1.018 times more FLOPS).
+We can also observe that Triton in head_dim 64 is more efficient than Triton in head_dim 128.
+#### Torch vs Triton
+We mentioned above that we had optimized parts of the architecture using ad hoc Triton kernels, namely the cross-entropy and RMSNorm layer. The following benchmarks should illustrate why.
+For cross-entropy, we obtain a forward pass 7 to 11.4 times faster, a backward pass 3.26 to 3.75 times faster and a memory reduced by a factor of 4:
+<center>
+<img src="https://github.com/catie-aq/flashT5/raw/main/assets/benchmarks/CE_dark.png" alt="CE_dark" width="100%">
+</center>
+For the RMSNorm layer, we obtain a 3 to 5 times faster forward pass, a 2.33 to 4.33 times faster reverse pass and a memory reduced by a factor of 3.2:
+<center>
+<img src="https://github.com/catie-aq/flashT5/raw/main/assets/benchmarks/LN_dark.png" alt="LN_dark" width="100%">
+</center>
+Note that all benchmark graphs can be generated automatically using the following [code](https://github.com/catie-aq/flashT5/tree/main/benchmarks).
+<br>
+## Applications
+### To French
+We've pretrained a small (147M parameters) FAT5-UL2 in French. This is the model you'll find in this Hugging Face repo.
+The dataset we used is a mixture of [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX), [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia), [justice_fr](https://huggingface.co/datasets/eckendoerffer/justice_fr) and [The Stack](https://huggingface.co/datasets/bigcode/the-stack-dedup).
+Our tokenizer of size 32,768 (8**5) is trained on CulturaX and The Stack.
+Our model is pre-trained on a sequence of 1,024 tokens on a single A100 for 1,461H (= 419.4B tokens seen) at an estimated cost of €2,200.
+Carbon emissions for the pretraining of this model were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact:
+• Hardware Type: A100 PCIe 40/80GB
+• Hours used: 1,461h
+• Cloud Provider: Private Infrastructure
+• Carbon Efficiency (kg/kWh): 0.03696 kg (estimated from [electricitymaps](https://app.electricitymaps.com/zone/FR) (average carbon intensity in France average between October 18, 2024 and December 19, 2024)
+• **Carbon Emitted** *(Power consumption x Time x Carbon produced based on location of power grid)*: **13.5 kg eq. CO2**.
+### To other language
+Our contribution focuses on French, with the pre-training and finetuning of models for comparison against French benchmarks. For other languages, we can't afford to do the same kind of work.
+Nevertheless, to ensure that it can be used in other languages, we have developed a [code](https://github.com/catie-aq/flashT5/blob/main/convert_huggingface_t5.py) for adapting already pre-trained (m)T5/FLAN-T5 weights to our method. In this way, we hope users of a specific language will be able to efficiently continue pre-training one of these models to adapt it to more recent data, for example.
+Note, however, that this adaptation is limited, since the additional pre-training will have to be carried out within the precision of the original model. For example, if the model's weights are in FP32 (which is the case with the FLAN-T5), training will not be as fast as with the FAT5, which is in BF16.
+For English speakers, we have already adapted the weights of the various versions of the [FLANT-T5](https://arxiv.org/abs/2210.11416) to our method. All weights can be found in this Hugging Face [collection](https://huggingface.co/collections/CATIE-AQ/catie-english-fat5-flan-662b679a8e855c7c0137d69e).
+To use one of the models, simply do the command:
+```
+from transformers import AutoModel, AutoTokenizer
+model = AutoModel.from_pretrained("CATIE-AQ/FAT5-small-flan-en", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")
+```
+<br>
+## Pretraining
+If you want to pre-train your own model (to be specialized in a specific domain for example, and thus benefit from a custom tokenizer), we included a [tutorial](examples/minipile) to pretrain a small model on [minipile](https://huggingface.co/datasets/JeanKaddour/minipile) to show how it should be done.
+You can find the documentation of the model configuration file [here](docs/configuration_file.md).
+Note that we tested and trained the model of the tutorial on A100. It may or may not work with other GPUs.
+<br>
+<!--
+## Finetuning
+Once you've pre-trained your model, you'll want to finetune it. Because it's a custom model (hence the need for `trust_remote_code=True` to load it), it's currently causing some difficulties due to flaws in Hugging Face's Transformers library (during the `push_to_hub`, for example). This should be resolved in January 2025, as we're working on porting the FAT5 directly into Transformers with the help of their lovely team 🤗
+-->
+## Roadmap
+We invite you to consult the “Next stage” section of the blog post.
+<br>
+## Citation
+```
+The DOI must be generated once model in public
+```
+<br>
+## License
+[Apache-2.0 license](https://github.com/catie-aq/flashT5/tree/main?tab=Apache-2.0-1-ov-file#readme)
+<br>
+## Ackowledgment
+We use the following repos and thanks the authors for this:
+- [nanoT5](https://github.com/PiotrNawrot/nanoT5) for the simple implementation and the optimizer.
+- [Flash attention](https://github.com/Dao-AILab/flash-attention) for the groundbreaking algorithm for computing attention.
+- [Hugging Face](https://github.com/huggingface/transformers) for their excellent library.
+- [FlagAttention](https://github.com/FlagOpen/FlagAttention) for the implementation of FA2 in Triton.
+- [Unsloth](https://github.com/unslothai/unsloth) for the simple Triton kernels of the cross-entropy and layernorm that we adapted to our usage.
+- [TurboT5](https://github.com/Knowledgator/TurboT5) for the improvement of the February 2024 version of our work.
+This work was support by the [Vaniila platform](http://vaniila.ai/).<br>
+<div align="center">
+  <a href="[https://example.com](http://vaniila.ai/)" target="_blank">
+    <img src="https://www.vaniila.ai/wp-content/uploads/2020/02/Vaniila_bleu_horizontal.png" alt="Vaniila Logo" width="200">
+  </a>
+</div>

adamw_scaled.py ADDED Viewed

	@@ -0,0 +1,281 @@

+import torch
+import math
+from torch.optim import Optimizer
+from torch.optim.optimizer import _default_to_fused_or_foreach
+from torch.utils._foreach_utils import _group_tensors_by_device_and_dtype
+from typing import Iterable, Tuple
+from torch import nn, Tensor
+class AdamWScale(Optimizer):
+    """
+    This AdamW implementation is copied from Huggingface.
+    We modified it with Adagrad scaling by rms of a weight tensor
+    Implements Adam algorithm with weight decay fix as introduced in [Decoupled Weight Decay
+    Regularization](https://arxiv.org/abs/1711.05101).
+    Parameters:
+        params (`Iterable[nn.parameter.Parameter]`):
+            Iterable of parameters to optimize or dictionaries defining parameter groups.
+        lr (`float`, *optional*, defaults to 1e-3):
+            The learning rate to use.
+        betas (`Tuple[float,float]`, *optional*, defaults to (0.9, 0.999)):
+            Adam's betas parameters (b1, b2).
+        eps (`float`, *optional*, defaults to 1e-6):
+            Adam's epsilon for numerical stability.
+        weight_decay (`float`, *optional*, defaults to 0.0):
+            Decoupled weight decay to apply.
+        kahan_sum (`bool`, *optional*, defaults to False):
+            Whether to use Kahan summation for updating parameters.
+        foreach (`bool`, *optional*, defaults to False):
+            Whether to use the foreach implementation.
+        correct_bias (`bool`, *optional*, defaults to True):
+            Whether to correct bias in Adam.
+        use_state_dtype (`torch.dtype`, *optional*, defaults to None):
+            The dtype to use for optimizer state. If None, use the default dtype.
+    """
+    def __init__(
+        self,
+        params: Iterable[nn.parameter.Parameter],
+        lr: float = 1e-3,
+        betas: Tuple[float, float] = (0.9, 0.999),
+        eps: float = 1e-6,
+        weight_decay: float = 0.0,
+        kahan_sum: bool = False,
+        foreach: bool = False,
+        correct_bias: bool = True,
+        use_state_dtype: torch.dtype = None
+    ):
+        if lr < 0.0:
+            raise ValueError(f"Invalid learning rate: {lr} - should be >= 0.0")
+        if not 0.0 <= betas[0] < 1.0:
+            raise ValueError(f"Invalid beta parameter: {betas[0]} - should be in [0.0, 1.0)")
+        if not 0.0 <= betas[1] < 1.0:
+            raise ValueError(f"Invalid beta parameter: {betas[1]} - should be in [0.0, 1.0)")
+        if not 0.0 <= eps:
+            raise ValueError(f"Invalid epsilon value: {eps} - should be >= 0.0")
+        assert not (foreach and use_state_dtype is not None), "foreach is not supported with use_state_dtype"
+        defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, foreach=foreach, \
+            kahan_sum=kahan_sum, correct_bias=correct_bias, use_state_dtype=use_state_dtype)
+        super().__init__(params, defaults)
+    @staticmethod
+    def _rms(tensor):
+        return tensor.norm(2) / (tensor.numel() ** 0.5)
+    @torch.no_grad()
+    def step(self, closure=None):
+        """
+        Performs a single optimization step.
+        Arguments:
+            closure (`Callable`, *optional*): A closure that reevaluates the model and returns the loss.
+        """
+        loss = None
+        if closure is not None:
+            loss = closure()
+        for group in self.param_groups:
+            params, grads, exp_avgs, exp_avg_sqs, steps, kahan_comps = [], [], [], [], [], []
+            # Initialization
+            for p in group['params']:
+                if p.grad is None:
+                    continue
+                params.append(p)
+                if p.grad.is_sparse:
+                    raise RuntimeError('AdamWScale does not support sparse gradients')
+                grads.append(p.grad)
+                state = self.state[p]
+                # State initialization
+                if "kahan_comp" not in state:
+                    state['step'] = torch.tensor(0, dtype=torch.int32, device=p.device)
+                    if group["use_state_dtype"] in [torch.float16, torch.bfloat16]:
+                        state['exp_avg'] = torch.zeros_like(p, device=p.device, dtype=group["use_state_dtype"])
+                        state['exp_avg_sq'] = torch.zeros_like(p, device=p.device, dtype=group["use_state_dtype"])
+                    else:
+                        state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
+                        state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
+                    if group["kahan_sum"] and p.dtype in [torch.float16, torch.bfloat16]:
+                        state["kahan_comp"] = torch.zeros_like(p, memory_format=torch.preserve_format)
+                    else:
+                        state["kahan_comp"] = None
+                        group["kahan_sum"] = False
+                exp_avgs.append(state['exp_avg'])
+                exp_avg_sqs.append(state['exp_avg_sq'])
+                kahan_comps.append(state["kahan_comp"])
+                steps.append(state["step"])
+            torch._foreach_add_(steps, 1)
+            # AdamW step
+            if group["foreach"] and _default_to_fused_or_foreach(params, False, False):
+                self._foreach_adamwscaled(params,
+                                          grads,
+                                          exp_avgs,
+                                          exp_avg_sqs,
+                                          steps,
+                                          kahan_comps,
+                                          group["lr"],
+                                          group["betas"][0],
+                                          group["betas"][1],
+                                          group["weight_decay"],
+                                          group["eps"],
+                                          group["kahan_sum"],
+                                          group["correct_bias"])
+            else:
+                self._adamwscaled(params,
+                                  grads,
+                                  exp_avgs,
+                                  exp_avg_sqs,
+                                  steps,
+                                  kahan_comps,
+                                  group["lr"],
+                                  group["betas"][0],
+                                  group["betas"][1],
+                                  group["weight_decay"],
+                                  group["eps"],
+                                  group["kahan_sum"],
+                                  group["correct_bias"])
+        return loss
+    def _adamwscaled(self,
+                    params: list[Tensor],
+                    grads: list[Tensor],
+                    exp_avgs: list[Tensor],
+                    exp_avg_sqs: list[Tensor],
+                    steps: list[Tensor],
+                    kahan_comps: list[Tensor],
+                    lr: float,
+                    beta1: float,
+                    beta2: float,
+                    weight_decay: float,
+                    eps: float,
+                    do_kahan_sum: bool,
+                    correct_bias: bool):
+        for i, p in enumerate(params):
+            exp_avg, exp_avg_sq, grad, step, kahan_comp = exp_avgs[i], exp_avg_sqs[i], grads[i], steps[i], kahan_comps[i]
+            # Decay the first and second moment running average coefficient
+            # In-place operations to update the averages at the same time
+            exp_avg.mul_(beta1).add_(grad, alpha=(1.0 - beta1))
+            exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=(1.0 - beta2))
+            denom = exp_avg_sq.sqrt().add_(eps)
+            step_size = lr
+            if correct_bias:  # No bias correction for Bert
+                bias_correction1 = 1.0 - beta1 ** step
+                bias_correction2 = 1.0 - beta2 ** step
+                step_size = step_size * math.sqrt(bias_correction2) / bias_correction1
+            # Adapt Step from Adafactor
+            step_size = step_size * max(1e-3, self._rms(p.data))
+            if do_kahan_sum:
+                # Adam step
+                kahan_comp.addcdiv_(exp_avg, denom, value=-step_size)
+                # update weights with kahan compensation using dev_grads as temp buffer
+                grad.copy_(p)
+                p.add_(kahan_comp)
+                # save error back to kahan compensation for next iteration
+                grad.sub_(p, alpha=1)
+                kahan_comp.add_(grad, alpha=1)
+            else:
+                p.addcdiv_(exp_avg, denom, value=-step_size)
+            # Just adding the square of the weights to the loss function is *not*
+            # the correct way of using L2 regularization/weight decay with Adam,
+            # since that will interact with the m and v parameters in strange ways.
+            #
+            # Instead we want to decay the weights in a manner that doesn't interact
+            # with the m/v parameters. This is equivalent to adding the square
+            # of the weights to the loss with plain (non-momentum) SGD.
+            # Add weight decay at the end (fixed version)
+            if weight_decay > 0.0:
+                p.add_(p, alpha=(-lr * weight_decay))
+    def _foreach_adamwscaled(self,
+                    params: list[Tensor],
+                    grads: list[Tensor],
+                    exp_avgs: list[Tensor],
+                    exp_avg_sqs: list[Tensor],
+                    steps: list[Tensor],
+                    kahan_comps: list[Tensor],
+                    lr: float,
+                    beta1: float,
+                    beta2: float,
+                    weight_decay: float,
+                    eps: float,
+                    do_kahan_sum: bool,
+                    correct_bias: bool):
+        grouped_tensors = _group_tensors_by_device_and_dtype([params, grads, exp_avgs, exp_avg_sqs, kahan_comps])
+        for (_, dtype), ((dev_params, dev_grads, dev_exp_avgs, dev_exp_avg_sqs, dev_kahan_comps), _) in grouped_tensors.items():
+            # Foreach implementation
+            torch._foreach_mul_(dev_exp_avgs, beta1)
+            torch._foreach_add_(dev_exp_avgs, dev_grads, alpha=1 - beta1)
+            torch._foreach_mul_(dev_exp_avg_sqs, beta2)
+            torch._foreach_addcmul_(dev_exp_avg_sqs, dev_grads, dev_grads, 1 - beta2)
+            # Compute denominator
+            torch._foreach_copy_(dev_grads, dev_exp_avg_sqs)
+            torch._foreach_sqrt_(dev_grads)
+            torch._foreach_add_(dev_grads, eps)
+            step_size = [torch.tensor(lr, dtype=torch.float32, device=p.device) for p in dev_params]
+            if correct_bias:
+                torch._foreach_mul_(step_size,
+                                   [torch.tensor((math.sqrt(1 - beta2 ** steps[i].item()) / (1 - beta1 ** steps[i].item()) ), dtype=torch.float32, device=p.device)
+                                        for i, p in enumerate(dev_params)])
+            # Adapt step size using RMS of parameters
+            rms_p = torch._foreach_norm(dev_params)
+            numel = [torch.tensor(math.sqrt(p.numel())) for p in dev_params]
+            torch._foreach_div_(rms_p, numel)
+            torch._foreach_maximum_(rms_p, 1e-3)
+            torch._foreach_mul_(step_size, rms_p)
+            torch._foreach_div_(dev_grads, step_size)
+            # explicitly delete tensors when not used
+            del rms_p
+            del numel
+            del step_size
+            # Update parameters
+            if do_kahan_sum:
+                # Adam step
+                torch._foreach_addcdiv_(dev_kahan_comps, dev_exp_avgs, dev_grads, value=-1)
+                # update weights with kahan compensation using dev_grads as temp buffer
+                torch._foreach_copy_(dev_grads, dev_params)
+                torch._foreach_add_(dev_params, dev_kahan_comps, alpha=1)
+                # save error back to kahan compensation for next iteration
+                torch._foreach_sub_(dev_grads, dev_params, alpha=1)
+                torch._foreach_add_(dev_kahan_comps, dev_grads, alpha=1)
+            else:
+                torch._foreach_addcdiv_(dev_params, dev_exp_avgs, dev_grads, value=-1)
+            # Weight decay
+            if weight_decay > 0.0:
+                torch._foreach_add_(dev_params, dev_params, alpha=-weight_decay * lr)

attn_ref.py ADDED Viewed

	@@ -0,0 +1,29 @@

+import torch
+def attn_ref(q, k, v, b, sm_scale, dropout_p=0.0, causal=False, upcast=False):
+    if upcast:
+        q, k, v = q.float(), k.float(), v.float()
+        if b is not None:
+            b = b.float()
+    if b is not None:
+        if (b.shape[0] != q.shape[0]) or (b.shape[1] != q.shape[1]):
+            b = b.expand(q.shape[0], q.shape[1], q.shape[2], k.shape[2])
+    ms = torch.arange(q.shape[2], device=q.device).unsqueeze(-1)
+    ns = torch.arange(k.shape[2], device=q.device)
+    p = torch.matmul(q, k.transpose(2, 3))
+    p *= sm_scale
+    if b is not None:
+        p += b
+    if causal:
+        p = torch.where(ms + k.shape[2] - q.shape[2] >= ns, p, float("-inf"))
+    p = torch.softmax(p.float(), dim=-1).to(q.dtype)
+    if dropout_p > 0.0:
+        p = torch.dropout(p, dropout_p, train=True)
+    ref_out = torch.matmul(p, v)
+    return ref_out

config.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "alibi_mode": "symetric",
+  "architectures": [
+    "FlashT5ForConditionalGeneration"
+  ],
+  "attention_dropout_rate": 0.0,
+  "attention_scale": 1.0,
+  "attention_type": "triton",
+  "auto_map": {
+    "AutoConfig": "configuration_flash_t5.FlashT5Config",
+    "AutoModel": "modeling_flash_t5.FlashT5EncoderModel",
+    "AutoModelForQuestionAnswering": "custom_heads_flash_t5.FlashT5ForQuestionAnswering",
+    "AutoModelForSeq2SeqLM": "modeling_flash_t5.FlashT5ForConditionalGeneration",
+    "AutoModelForSequenceClassification": "custom_heads_flash_t5.FlashT5ForSequenceClassification",
+    "AutoModelForTokenClassification": "custom_heads_flash_t5.FlashT5ForTokenClassification"
+  },
+  "classifier_dropout": 0.0,
+  "crossentropy_inplace_backward": false,
+  "d_ff": 2048,
+  "d_kv": 64,
+  "d_model": 512,
+  "decoder_start_token_id": 0,
+  "dense_act_fn": "relu",
+  "dropout_rate": 0.0,
+  "eos_token_id": 1,
+  "feed_forward_proj": "relu",
+  "fire_mlp_width": 32,
+  "initializer_factor": 1.0,
+  "is_encoder_decoder": false,
+  "is_gated_act": false,
+  "label_smoothing": 0.0,
+  "layer_norm_epsilon": 1e-06,
+  "max_sequence_length": 1024,
+  "model_type": "flash_t5",
+  "num_decoder_layers": 12,
+  "num_heads": 8,
+  "num_layers": 12,
+  "pad_token_id": 3,
+  "position_encoding_type": "t5",
+  "relative_attention_max_distance": 128,
+  "relative_attention_num_buckets": 32,
+  "rotary_base": 10000,
+  "rotary_emb_fraction": 1.0,
+  "rotary_interleaved": false,
+  "rotary_scale_base": null,
+  "tie_word_embeddings": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.46.0.dev0",
+  "use_cache": true,
+  "use_full_bias_size": false,
+  "use_gelu_act": true,
+  "use_glu_mlp": true,
+  "use_masking": false,
+  "use_randomized_position_encoding": false,
+  "use_triton_crossentropy": true,
+  "use_triton_layernorm": true,
+  "vocab_size": 32768,
+  "z_loss": 0.0001
+}

configuration_flash_t5.py ADDED Viewed

	@@ -0,0 +1,84 @@

+import sys
+from collections import OrderedDict
+from typing import Mapping
+import logging
+from transformers import T5Config
+AUTO_MAP = {
+    "AutoModel": "modeling_flash_t5.FlashT5EncoderModel",
+    "AutoModelForSeq2SeqLM": "modeling_flash_t5.FlashT5ForConditionalGeneration",
+    "AutoModelForTokenClassification": "custom_heads_flash_t5.FlashT5ForTokenClassification",
+    "AutoModelForQuestionAnswering": "custom_heads_flash_t5.FlashT5ForQuestionAnswering",
+    "AutoModelForSequenceClassification": "custom_heads_flash_t5.FlashT5ForSequenceClassification",
+}
+class FlashT5Config(T5Config):
+    model_type = "flash_t5"
+    def __init__(
+        self,
+        decoder_start_token_id=0,
+        pad_token_id=-100,
+        use_glu_mlp=False,
+        position_encoding_type="t5",
+        use_randomized_position_encoding=False,
+        label_smoothing=0.0,
+        z_loss=None,
+        attention_type="ref",
+        max_sequence_length=1024,
+        attention_dropout_rate=0.0,
+        alibi_mode="symetric",
+        use_triton_layernorm=False,
+        use_triton_crossentropy=False,
+        crossentropy_inplace_backward=False,
+        use_gelu_act=True,
+        use_full_bias_size=False,
+        rotary_emb_fraction=1.0,
+        rotary_base=10000,
+        rotary_interleaved=False,
+        rotary_scale_base=None,
+        fire_mlp_width=32,
+        use_masking=False,
+        attention_scale=None,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.decoder_start_token_id = decoder_start_token_id
+        self.pad_token_id = pad_token_id
+        self.use_glu_mlp = use_glu_mlp
+        self.position_encoding_type = position_encoding_type
+        self.use_randomized_position_encoding = use_randomized_position_encoding
+        self.label_smoothing = label_smoothing
+        self.z_loss = z_loss
+        self.attention_type = attention_type
+        self.max_sequence_length = max_sequence_length
+        self.alibi_mode = alibi_mode
+        self.attention_dropout_rate = attention_dropout_rate
+        self.use_triton_layernorm = use_triton_layernorm
+        self.use_triton_crossentropy = use_triton_crossentropy
+        self.crossentropy_inplace_backward = crossentropy_inplace_backward
+        self.use_gelu_act = use_gelu_act
+        self.use_full_bias_size = use_full_bias_size
+        self.rotary_base = rotary_base
+        self.rotary_interleaved = rotary_interleaved
+        self.rotary_scale_base = rotary_scale_base
+        self.rotary_emb_fraction = rotary_emb_fraction
+        self.fire_mlp_width = fire_mlp_width
+        self.use_masking = use_masking
+        self.attention_scale = attention_scale
+        self.auto_map = AUTO_MAP
+def str_to_class(classname):
+    return getattr(sys.modules[__name__], classname)
+# Register model in Auto API
+try:
+    FlashT5Config.register_for_auto_class()
+    for key, value in AUTO_MAP.items():
+        str_to_class(value.split(".")[-1]).register_for_auto_class(key)
+except:
+    logging.warn("AutoRegister isn't available.")

cross_entropy_loss.py ADDED Viewed

	@@ -0,0 +1,426 @@

+# Copyright (c) 2023, Tri Dao.
+# Copyright 2024 CATIE. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Modification to the original version from Tri Dao:
+# - support for torch.compile
+from typing import Tuple, Optional
+import torch
+import torch.nn.functional as F
+import triton
+import triton.language as tl
+# `all_gather_into_tensor` and `reduce_scatter_tensor` are new placeholders for
+# `_all_gather_base` and `_reduce_scatter_base`. They require the most recent
+# version of PyTorch. The following 2 lines are for backward compatibility with
+# older PyTorch.
+if "all_gather_into_tensor" not in dir(torch.distributed):
+    torch.distributed.all_gather_into_tensor = torch.distributed._all_gather_base
+@triton.heuristics(
+    {
+        "HAS_SMOOTHING": lambda args: args["smoothing"] > 0.0,
+    }
+)
+@triton.jit
+def cross_entropy_fwd_kernel(
+    loss_ptr,  # data ptrs
+    lse_ptr,
+    z_loss_ptr,
+    logits_ptr,
+    labels_ptr,
+    smoothing,
+    logit_scale,
+    lse_square_scale,
+    ignore_index,
+    total_classes,
+    class_start_idx,  # Useful for tensor parallel when each rank only has a subset of classes
+    n_cols,  # shapes
+    logits_row_stride,  # strides
+    BLOCK_SIZE: tl.constexpr,
+    HAS_SMOOTHING: tl.constexpr,
+    # if SPLIT (e.g. tensor parallel), don't include the LSE in the loss since it's not the final LSE
+    SPLIT: tl.constexpr,
+    PRECOMPUTED_LSE: tl.constexpr,  # If LSE is already computed (also no smoothing and logit_scale == 1.0)
+):
+    row_idx = tl.program_id(0)
+    logits_ptr = logits_ptr + row_idx * logits_row_stride.to(tl.int64)
+    sum_logits = 0.0  # For smoothing
+    if not PRECOMPUTED_LSE:
+        # Statistics for online softmax
+        m_i = -float("inf")
+        l_i = 0.0
+        for col_offset in range(0, n_cols, BLOCK_SIZE):
+            cols = col_offset + tl.arange(0, BLOCK_SIZE)
+            logits = tl.load(logits_ptr + cols, mask=cols < n_cols, other=-float("inf")).to(
+                tl.float32
+            ) * logit_scale
+            if HAS_SMOOTHING:
+                sum_logits += tl.sum(tl.where(cols < n_cols, logits, 0.0))
+            m_i_new = tl.maximum(m_i, tl.max(logits))
+            l_i = tl.exp(m_i - m_i_new) * l_i + tl.sum(tl.exp(logits - m_i_new))
+            m_i = m_i_new
+        lse = tl.log(l_i) + m_i
+        tl.store(lse_ptr + row_idx, lse)
+    else:
+        lse = tl.load(lse_ptr + row_idx)
+    label_idx = tl.load(labels_ptr + row_idx)
+    if label_idx == ignore_index:
+        loss = 0.0
+        z_loss = 0.0
+    else:
+        label_idx -= class_start_idx
+        if label_idx >= 0 and label_idx < n_cols:
+            logits_label = tl.load(logits_ptr + label_idx) * logit_scale
+            if HAS_SMOOTHING:
+                loss = (
+                    (lse if not SPLIT else 0.0)
+                    - smoothing * sum_logits / total_classes
+                    - (1 - smoothing) * logits_label
+                )
+            else:
+                loss = (lse if not SPLIT else 0.0) - logits_label
+        else:
+            # If label is out of bounds, we set the CE loss to 0.0. But we still want the smoothing loss
+            if HAS_SMOOTHING:
+                loss = smoothing * ((lse if not SPLIT else 0.0) - sum_logits / total_classes)
+            else:
+                loss = 0.0
+        if not SPLIT:
+            z_loss = lse_square_scale * lse * lse
+            loss += z_loss
+        else:
+            z_loss = 0.0
+    tl.store(loss_ptr + row_idx, loss)
+    if not SPLIT:
+        tl.store(z_loss_ptr + row_idx, z_loss)
+@triton.heuristics(
+    {
+        "HAS_SMOOTHING": lambda args: args["smoothing"] > 0.0,
+    }
+)
+@triton.jit
+def cross_entropy_bwd_kernel(
+    dlogits_ptr,  # data ptrs
+    dloss_ptr,
+    logits_ptr,
+    lse_ptr,
+    labels_ptr,
+    smoothing,
+    logit_scale,
+    lse_square_scale,
+    ignore_index,
+    total_classes,
+    class_start_idx,  # Useful for tensor parallel when each rank only has a subset of classes
+    n_cols,  # shapes
+    logits_row_stride,  # strides
+    dlogits_row_stride,
+    dloss_row_stride,
+    BLOCK_SIZE: tl.constexpr,
+    HAS_SMOOTHING: tl.constexpr,
+):
+    row_idx = tl.program_id(0)
+    col_block_idx = tl.program_id(1)
+    logits_ptr = logits_ptr + row_idx * logits_row_stride.to(tl.int64)
+    dlogits_ptr = dlogits_ptr + row_idx * dlogits_row_stride.to(tl.int64)
+    col_offsets = col_block_idx * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
+    label_idx = tl.load(labels_ptr + row_idx)
+    if label_idx != ignore_index:
+        dloss = tl.load(dloss_ptr + row_idx * dloss_row_stride)
+    else:
+        dloss = 0.0
+    logits = tl.load(logits_ptr + col_offsets, mask=col_offsets < n_cols, other=-float("inf")).to(
+        tl.float32
+    ) * logit_scale
+    lse = tl.load(lse_ptr + row_idx)
+    probs = tl.exp(logits - lse)
+    probs += 2.0 * lse_square_scale * lse * probs
+    label_idx -= class_start_idx
+    if HAS_SMOOTHING:
+        smooth_positive = 1.0 - smoothing
+        smooth_negative = smoothing / total_classes
+        probs = tl.where(col_offsets == label_idx, probs - smooth_positive, probs) - smooth_negative
+    else:
+        probs = tl.where(col_offsets == label_idx, probs - 1.0, probs)
+    tl.store(dlogits_ptr + col_offsets, (dloss * logit_scale) * probs, mask=col_offsets < n_cols)
+@torch.library.custom_op("flasht5::cross_entropy_triton_fwd", mutates_args=(), device_types="cuda")
+def cross_entropy_triton_fwd(
+    logits: torch.Tensor,
+    labels: torch.Tensor,
+    precomputed_lse: torch.Tensor,
+    use_precomputed_lse: bool,
+    split: bool,
+    smoothing: float,
+    logit_scale: float,
+    lse_square_scale: float,
+    ignore_index: int,
+    total_classes: int,
+    class_start_idx: int,
+    n_cols: int,
+    n_rows: int,
+    BLOCK_SIZE: int,
+    num_warps: int
+) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    if logits.stride(-1) != 1:
+        logits = logits.contiguous()
+    losses = torch.empty(n_rows, dtype=torch.float, device=logits.device)
+    if use_precomputed_lse:
+        assert precomputed_lse.shape == (n_rows,)
+        lse = precomputed_lse.contiguous()
+    else:
+        lse = torch.empty(n_rows, dtype=torch.float, device=logits.device)
+    z_losses = torch.empty(n_rows, dtype=torch.float, device=logits.device)
+    # Need this, otherwise Triton tries to launch from cuda:0 and we get
+    # ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
+    with torch.cuda.device(logits.device.index):
+        cross_entropy_fwd_kernel[(n_rows,)](
+            losses,  # data ptrs
+            lse,
+            z_losses,
+            logits,
+            labels,
+            smoothing,
+            logit_scale,
+            lse_square_scale,
+            ignore_index,
+            total_classes,
+            class_start_idx,
+            n_cols,  # shapes
+            logits.stride(0),  # strides
+            BLOCK_SIZE=BLOCK_SIZE,  # constants
+            SPLIT=split,
+            PRECOMPUTED_LSE=use_precomputed_lse,
+            num_warps=num_warps,
+        )
+    return losses, z_losses, lse
+@torch.library.register_fake("flasht5::cross_entropy_triton_fwd")
+def cross_entropy_triton_fwd_abstract(logits, labels, precomputed_lse, use_precomputed_lse, split, smoothing, logit_scale, lse_square_scale, ignore_index, total_classes, class_start_idx, n_cols, n_rows, BLOCK_SIZE, num_warps):
+    losses    = torch.empty(n_rows, dtype=torch.float32, device=logits.device)
+    z_losses = torch.empty(n_rows, dtype=torch.float32, device=logits.device)
+    logsumexp = torch.empty(n_rows, dtype=torch.float32, device=logits.device)
+    return losses, z_losses, logsumexp
+@torch.library.custom_op("flasht5::cross_entropy_triton_bwd", mutates_args={"logits"}, device_types="cuda")
+def cross_entropy_triton_bwd(
+    dlosses: torch.Tensor,
+    logits: torch.Tensor,
+    lse: torch.Tensor,
+    labels: torch.Tensor,
+    inplace_backward: bool,
+    smoothing: float,
+    logit_scale: float,
+    lse_square_scale: float,
+    ignore_index: int,
+    total_classes: int,
+    class_start_idx: int,
+    n_cols: int,
+    n_rows: int,
+    BLOCK_SIZE: int,
+    num_warps: int
+) -> torch.Tensor:
+    dlogits = logits if inplace_backward else torch.empty_like(logits)
+    grid = lambda META: (n_rows, triton.cdiv(n_cols, META["BLOCK_SIZE"]))  # noqa
+    # Need this, otherwise Triton tries to launch from cuda:0 and we get
+    # ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
+    with torch.cuda.device(logits.device.index):
+        cross_entropy_bwd_kernel[grid](
+            dlogits,  # data ptrs
+            dlosses,
+            logits,
+            lse,
+            labels,
+            smoothing,
+            logit_scale,
+            lse_square_scale,
+            ignore_index,
+            total_classes,
+            class_start_idx,
+            n_cols,  # shapes
+            logits.stride(0),  # strides
+            dlogits.stride(0),
+            dlosses.stride(0),
+            BLOCK_SIZE=BLOCK_SIZE,  # constants
+            num_warps=num_warps,
+        )
+    return dlogits if not inplace_backward else None
+@torch.library.register_fake("flasht5::cross_entropy_triton_bwd")
+def cross_entropy_triton_bwd_abstract(dlosses, logits, lse, labels, inplace_backward, smoothing, logit_scale, lse_square_scale, ignore_index, total_classes, class_start_idx, n_cols, n_rows, BLOCK_SIZE, num_warps):
+    return torch.empty_like(logits)
+class CrossEntropyLoss(torch.autograd.Function):
+    @staticmethod
+    def forward(
+        ctx,
+        logits,
+        labels,
+        precomputed_lse=None,
+        smoothing=0.0,
+        logit_scale=1.0,
+        lse_square_scale=0.0,
+        ignore_index=-100,
+        inplace_backward=False,
+        process_group=None,
+    ):
+        # For some reason Triton generates wrong code when labels has dtype long and its address
+        # is not aligned to 16 bytes. The ld.global.b64 seems to load the wrong label index.
+        if labels.dtype == torch.long and labels.data_ptr() % 16 != 0:
+            labels = F.pad(labels, (0, 1))[..., :-1]
+            assert labels.data_ptr() % 16 == 0
+        n_rows, n_cols = logits.shape
+        assert labels.shape == (n_rows,)
+        world_size = 1 if process_group is None else torch.distributed.get_world_size(process_group)
+        total_classes = world_size * n_cols
+        rank = 0 if process_group is None else torch.distributed.get_rank(process_group)
+        class_start_idx = rank * n_cols
+        use_precomputed_lse = precomputed_lse is not None and logit_scale == 1.0 and smoothing == 0.0
+        MAX_BLOCK_SIZE = 16 * 1024
+        BLOCK_SIZE = min(triton.next_power_of_2(n_cols), MAX_BLOCK_SIZE)
+        num_warps = (
+            4
+            if BLOCK_SIZE < 2048
+            else (8 if BLOCK_SIZE < 8192 else (16 if BLOCK_SIZE < 128 * 1024 else 32))
+        )
+        losses, z_losses, lse = torch.ops.flasht5.cross_entropy_triton_fwd(
+            logits, labels, precomputed_lse, use_precomputed_lse, \
+            world_size > 1, smoothing, logit_scale, lse_square_scale, \
+            ignore_index, total_classes, class_start_idx, \
+            n_cols, n_rows, BLOCK_SIZE, num_warps
+        )
+        if world_size > 1:
+            # If there's no smoothing, if labels are in the vocab of this partition, losses contains
+            # - predicted logit, and 0 otherwise.
+            # If there's smoothing=0.1, for labels in the vocab of this partition, losses contains
+            # -0.9 * predicted logit - 0.1 * sum logit / total_classes.
+            # For labels not in the vocab of this partition, losses contains
+            # -0.1 * sum logit / total_classes.
+            if world_size > 1:
+                lse_allgather = torch.empty(world_size, n_rows, dtype=lse.dtype, device=lse.device)
+                torch.distributed.all_gather_into_tensor(lse_allgather, lse, group=process_group)
+                handle_losses = torch.distributed.all_reduce(
+                    losses, op=torch.distributed.ReduceOp.SUM, group=process_group, async_op=True
+                )
+                lse = torch.logsumexp(lse_allgather, dim=0)
+                handle_losses.wait()
+            # After the allreduce, if there's no smoothing, the total losses are - predicted_logit,
+            # we just have to add the (global) lse.
+            # If there's smoothing=0.1, the total losses are
+            # -0.9 * predicted_logit - 0.1 * sum logit / total_classes.
+            # Again, we just have to add the (global) lse.
+            losses += lse
+            if lse_square_scale != 0.0:
+                z_losses = lse_square_scale * lse.square()
+                z_losses.masked_fill_(labels == ignore_index, 0.0)
+                losses += z_losses
+            else:
+                z_losses = torch.zeros_like(losses)
+            losses.masked_fill_(labels == ignore_index, 0.0)
+        ctx.save_for_backward(logits, lse, labels)
+        ctx.mark_non_differentiable(z_losses)
+        ctx.smoothing = smoothing
+        ctx.logit_scale = logit_scale
+        ctx.lse_square_scale = lse_square_scale
+        ctx.ignore_index = ignore_index
+        ctx.total_classes = total_classes
+        ctx.class_start_idx = class_start_idx
+        ctx.inplace_backward = inplace_backward
+        return losses, z_losses
+    @staticmethod
+    def backward(ctx, grad_losses, grad_z_losses):
+        del grad_z_losses  # z_losses are only for logging.
+        logits, lse, labels = ctx.saved_tensors
+        n_rows, n_cols = logits.shape
+        BLOCK_SIZE = min(triton.next_power_of_2(n_cols), 4 * 1024)
+        num_warps = 4 if BLOCK_SIZE < 2048 else (8 if BLOCK_SIZE < 8192 else 16)
+        dlogits = torch.ops.flasht5.cross_entropy_triton_bwd(
+            grad_losses, logits, lse, labels, \
+            ctx.inplace_backward, ctx.smoothing, ctx.logit_scale, \
+            ctx.lse_square_scale, ctx.ignore_index, ctx.total_classes, \
+            ctx.class_start_idx, n_cols, n_rows, BLOCK_SIZE, num_warps
+        )
+        if ctx.inplace_backward:
+            dlogits = logits
+        return dlogits, None, None, None, None, None, None, None, None, None
+def cross_entropy_loss(
+    logits: torch.Tensor,
+    labels: torch.Tensor,
+    precomputed_lse: Optional[torch.Tensor] = None,
+    label_smoothing: float = 0.0,
+    logit_scale: float = 1.0,
+    lse_square_scale: float = 0.0,
+    ignore_index=-100,
+    inplace_backward: bool = False,
+    process_group=None,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """
+    Arguments:
+        logits: (batch, vocab_size)
+        labels: (batch,)
+        label_smoothing: float
+        logit_scale: float. Multiply logits by this scale before calculating the loss.
+        lse_square_scale: float. If > 0, we add lse_square_scale * lse(logits) ^ 2 to the loss.
+            This is also referred to as "z-loss".
+        ignore_index: int. If labels == ignore_index, the loss is set to 0.0.
+        inplace_backward: bool. If True, we do the backward pass in-place by modifying the logits.
+            This saves memory.
+        process_group: if not None, we're doing Tensor Parallel: each process is responsible for
+            one part of the vocab. The loss will be aggregated across processes.
+    Returns:
+        losses: (batch,), float
+        z_losses: (batch,), float
+    """
+    return CrossEntropyLoss.apply(
+        logits.view(-1, logits.shape[-1]),
+        labels.view(-1),
+        precomputed_lse,
+        label_smoothing,
+        logit_scale,
+        lse_square_scale,
+        ignore_index,
+        inplace_backward,
+        process_group,
+    )

custom_heads_flash_t5.py ADDED Viewed

	@@ -0,0 +1,315 @@

+import torch
+import torch.nn as nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+import copy
+from typing import Optional, Union, Tuple, List
+from transformers.modeling_outputs import (
+    Seq2SeqQuestionAnsweringModelOutput,
+    QuestionAnsweringModelOutput,
+    TokenClassifierOutput,
+    BaseModelOutput,
+    Seq2SeqSequenceClassifierOutput,
+    SequenceClassifierOutput
+)
+from .modeling_flash_t5 import FlashT5PreTrainedModel, FlashT5Stack, FlashT5Model
+from .configuration_flash_t5 import FlashT5Config
+################## Encoder only head ##################
+class FlashT5ForTokenClassification(FlashT5PreTrainedModel):
+    def __init__(self, config: FlashT5Config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        self.encoder = FlashT5Stack(config, self.shared)
+        self.dropout = nn.Dropout(config.classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+        # Initialize classifier
+        self.classifier.weight.data.normal_(mean=0.0, std=config.initializer_factor * 1.0)
+        self.classifier.bias.data.zero_()
+        self.model_parallel = False
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
+        Returns:
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.encoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = outputs[0]
+        hidden_states = self.dropout(hidden_states)
+        logits = self.classifier(hidden_states)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+        if not return_dict:
+            output = (logits, outputs[2:-1])
+            return ((loss,) + output) if loss is not None else output
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class FlashT5ClassificationHead(nn.Module):
+    """Head for sentence-level classification tasks."""
+    def __init__(self, config: FlashT5Config):
+        super().__init__()
+        self.dense = nn.Linear(config.d_model, config.d_model)
+        self.dropout = nn.Dropout(p=config.classifier_dropout)
+        self.out_proj = nn.Linear(config.d_model, config.num_labels)
+        # initialize weights
+        factor = config.initializer_factor
+        self.dense.weight.data.normal_(mean=0.0, std=factor * ((config.d_model) ** -0.5))
+        if hasattr(self.dense, "bias") and self.dense.bias is not None:
+            self.dense.bias.data.zero_()
+        self.out_proj.weight.data.normal_(mean=0.0, std=factor * ((config.d_model) ** -0.5))
+        if hasattr(self.out_proj, "bias") and self.out_proj.bias is not None:
+            self.out_proj.bias.data.zero_()
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.dense(hidden_states)
+        hidden_states = torch.tanh(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.out_proj(hidden_states)
+        return hidden_states
+class FlashT5ForSequenceClassification(FlashT5PreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"encoder.embed_tokens.weight"]
+    def __init__(self, config: FlashT5Config):
+        super().__init__(config)
+        self.model_dim = config.d_model
+        self.config.problem_type = None
+        self.config.is_encoder_decoder = False
+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        encoder_config = copy.deepcopy(config)
+        encoder_config.is_decoder = False
+        encoder_config.is_encoder_decoder = False
+        encoder_config.use_cache = False
+        self.encoder = FlashT5Stack(encoder_config, self.shared)
+        self.classification_head = FlashT5ClassificationHead(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+        self.model_parallel = False
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        cross_attn_head_mask: Optional[torch.Tensor] = None,
+        encoder_outputs: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, Seq2SeqSequenceClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        Returns:
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is not None:
+            use_cache = False
+        if input_ids is None and inputs_embeds is not None:
+            raise NotImplementedError(
+                f"Passing input embeddings is currently not supported for {self.__class__.__name__}"
+            )
+        outputs = self.encoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        eos_mask = input_ids.eq(self.config.eos_token_id).to(sequence_output.device)
+        if len(torch.unique_consecutive(eos_mask.sum(1))) > 1:
+            raise ValueError("All examples must have the same number of <eos> tokens.")
+        batch_size, _, hidden_size = sequence_output.shape
+        sentence_representation = sequence_output[:, -1, :]
+        # sentence_representation = sequence_output[eos_mask, :].view(batch_size, -1, hidden_size)[:, -1, :]
+        logits = self.classification_head(sentence_representation)
+        loss = None
+        if labels is not None:
+            labels = labels.to(logits.device)
+            if self.config.problem_type is None:
+                if self.config.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.config.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = nn.MSELoss()
+                if self.config.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = nn.CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions
+        )
+class FlashT5ForQuestionAnswering(FlashT5PreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"encoder.embed_tokens.weight"]
+    def __init__(self, config: FlashT5Config):
+        super().__init__(config)
+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        encoder_config = copy.deepcopy(config)
+        encoder_config.is_decoder = False
+        encoder_config.is_encoder_decoder = False
+        self.encoder = FlashT5Stack(encoder_config, self.shared)
+        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+        self.qa_outputs.weight.data.normal_(mean=0.0, std=config.initializer_factor * 1.0)
+        self.qa_outputs.bias.data.zero_()
+        self.model_parallel = False
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        start_positions: Optional[torch.LongTensor] = None,
+        end_positions: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, QuestionAnsweringModelOutput]:
+        r"""
+        Returns:
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, MTxEncoderForQuestionAnswering
+        >>> tokenizer = AutoTokenizer.from_pretrained("MTx-small")
+        >>> model = MTxEncoderForQuestionAnswering.from_pretrained("MTx-small")
+        >>> input_ids = tokenizer(
+        ...     "Studies have been shown that owning a dog is good for you", return_tensors="pt"
+        ... ).input_ids  # Batch size 1
+        >>> outputs = model(input_ids=input_ids)
+        >>> start_logits = outputs.start_logits
+        >>> end_logits = outputs.end_logits
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.encoder(
+            input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+        )
+        sequence_output = outputs[0]
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1).contiguous()
+        end_logits = end_logits.squeeze(-1).contiguous()
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1).to(start_logits.device)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1).to(end_logits.device)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions = start_positions.clamp(0, ignored_index)
+            end_positions = end_positions.clamp(0, ignored_index)
+            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[1:]
+            return ((total_loss,) + output) if total_loss is not None else output
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

flash_attention_v2_bias.py ADDED Viewed

	@@ -0,0 +1,905 @@

+# Copyright 2023 BAAI
+# Copyright 2024 CATIE
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Modifications to the orignal file
+# - Support for biases following https://github.com/FlagOpen/FlagAttention/pull/5
+# - Support for shape (1,1,q,k) biases
+import math
+import torch
+import triton
+import triton.language as tl
+from typing import Tuple
+@torch.library.custom_op("flasht5::flash_attn_v2_fwd", mutates_args=(), device_types="cuda")
+def flash_attn_v2_fwd(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    bias: torch.Tensor,
+    causal: bool,
+    sm_scale: float,
+    BLOCK_M: int,
+    BLOCK_N: int,
+    num_warps: int, num_stages: int
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    B, H, M, D = q.shape
+    N = k.shape[2]
+    P_SEQ = N - M
+    larger_m = M > N
+    # Trick to support shape such as (1, 1, seqlen_q, seqlen_k)
+    bias_batch_stride = bias.stride(0) if bias is not None else 0
+    bias_heads_stride = bias.stride(1) if bias is not None else 0
+    if bias is not None:
+        if (bias.shape[0] != q.shape[0]) and (bias.shape[0] == 1):
+            bias_batch_stride = 0
+        if (bias.shape[1] != q.shape[1]) and (bias.shape[1] == 1):
+            bias_heads_stride = 0
+    divisible_m = M % BLOCK_M == 0
+    divisible_n = N % BLOCK_N == 0
+    # consider using 3d grid to avoid div & rem
+    grid = (triton.cdiv(M, BLOCK_M), H, B)
+    o = torch.empty_like(q)
+    L = torch.empty((B, H, M), device=q.device, dtype=torch.float32)
+    with torch.cuda.device(q.device.index):
+        _fwd_kernel[grid](
+            q, k, v, bias, sm_scale,
+            L, o,
+            q.stride(0), q.stride(1), q.stride(2), q.stride(3),
+            k.stride(0), k.stride(1), k.stride(2), k.stride(3),
+            v.stride(0), v.stride(1), v.stride(2), v.stride(3),
+            o.stride(0), o.stride(1), o.stride(2), o.stride(3),
+            bias_batch_stride, bias_heads_stride,
+            bias.stride(2) if bias is not None else 0,
+            bias.stride(3) if bias is not None else 0,
+            B, H, M, N, P_SEQ,
+            BLOCK_M=BLOCK_M, BLOCK_N=BLOCK_N, BLOCK_DMODEL=D,
+            IS_CAUSAL=causal, LARGER_M=larger_m,
+            DIVISIBLE_M=divisible_m, DIVISIBLE_N=divisible_n,
+            HAS_BIAS=(bias is not None),
+            num_warps=num_warps, num_stages=num_stages,
+        )
+    return o, L
+@torch.library.register_fake("flasht5::flash_attn_v2_fwd")
+def flash_attn_v2_fwd_abstract(q, k, v, bias, causal, sm_scale, BLOCK_M, BLOCK_N, num_warps, num_stages):
+    B, H, M, D = q.shape
+    o = torch.empty_like(q)
+    L = torch.empty((B, H, M), dtype=torch.float32, device=q.device)
+    return o, L
+@torch.library.custom_op("flasht5::flash_attn_v2_bwd", mutates_args=(), device_types="cuda")
+def flash_attn_v2_bwd(
+    o: torch.Tensor,
+    do: torch.Tensor,
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    bias: torch.Tensor,
+    L: torch.Tensor,
+    causal: bool,
+    sm_scale: float,
+    BLOCK_M: int,
+    BLOCK_N: int,
+    num_warps: int,
+    num_stages: int
+) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+    B, H, M, D = q.shape
+    N = k.shape[2]
+    P_SEQ = N - M
+    larger_m = M > N
+    divisible_m = M % BLOCK_M == 0
+    divisible_n = N % BLOCK_N == 0
+    # Trick to support shape such as (1, 1, seqlen_q, seqlen_k)
+    bias_batch_stride = bias.stride(0) if bias is not None else 0
+    bias_heads_stride = bias.stride(1) if bias is not None else 0
+    if bias is not None:
+        if (bias.shape[0] != q.shape[0]) and (bias.shape[0] == 1):
+            bias_batch_stride = 0
+        if (bias.shape[1] != q.shape[1]) and (bias.shape[1] == 1):
+            bias_heads_stride = 0
+    delta = torch.empty_like(L)
+    grid = (triton.cdiv(M, BLOCK_M), H, B)
+    with torch.cuda.device(q.device.index):
+        _bwd_preprocess[grid](
+            o, do,
+            delta,
+            o.stride(0), o.stride(1), o.stride(2), o.stride(3),
+            do.stride(0), do.stride(1), do.stride(2), do.stride(3),
+            delta.stride(0), delta.stride(1), delta.stride(2),
+            M,
+            BLOCK_M=BLOCK_M, D_HEAD=D,
+            DIVISIBLE_M=divisible_m,
+        )
+    dk = torch.empty_like(k)
+    dv = torch.empty_like(v)
+    HAS_BIAS = bias is not None
+    RETURN_DS = HAS_BIAS
+    IS_BATCH_REDUCED = (bias_batch_stride == 0)
+    #GROUP_SIZE_BIAS = min(B, 16)
+    GROUP_SIZE_BIAS = B
+    ds = None
+    locks = None
+    if RETURN_DS:
+        if IS_BATCH_REDUCED:
+            if causal:
+                ds = torch.zeros((GROUP_SIZE_BIAS, *bias.shape[1:]), dtype=bias.dtype, device=bias.device)
+            else:
+                ds = torch.empty((GROUP_SIZE_BIAS, *bias.shape[1:]), dtype=bias.dtype, device=bias.device)
+            locks = torch.zeros(2 * GROUP_SIZE_BIAS, dtype=torch.int32, device=q.device)
+        else:
+            if causal:
+                ds = torch.zeros_like(bias)
+            else:
+                ds = torch.empty_like(bias)
+    grid = (triton.cdiv(N, BLOCK_N), H, B)
+    with torch.cuda.device(q.device.index):
+        _bwd_kv_kernel[grid](
+            q, k, v, bias, sm_scale, do,
+            dk, dv, ds,
+            L, delta,
+            q.stride(0), q.stride(1), q.stride(2), q.stride(3),
+            k.stride(0), k.stride(1), k.stride(2), k.stride(3),
+            v.stride(0), v.stride(1), v.stride(2), v.stride(3),
+            bias.stride(0) if HAS_BIAS else 0,
+            bias_heads_stride,
+            bias.stride(2) if HAS_BIAS else 0,
+            bias.stride(3) if HAS_BIAS else 0,
+            do.stride(0), do.stride(1), do.stride(2), do.stride(3),
+            dk.stride(0), dk.stride(1), dk.stride(2), dk.stride(3),
+            dv.stride(0), dv.stride(1), dv.stride(2), dv.stride(3),
+            B, H, M, N, P_SEQ,
+            locks,
+            BLOCK_M=BLOCK_M, BLOCK_DMODEL=D, BLOCK_N=BLOCK_N, CAUSAL=causal,
+            DIVISIBLE_M=divisible_m, DIVISIBLE_N=divisible_n,
+            HAS_BIAS=HAS_BIAS,
+            RETURN_DS=RETURN_DS,
+            IS_BATCH_REDUCED=IS_BATCH_REDUCED,
+            GROUP_SIZE_BIAS=GROUP_SIZE_BIAS,
+            num_stages=num_stages, num_warps=num_warps,
+        )
+    dq = torch.empty_like(q)
+    grid = (triton.cdiv(M, BLOCK_M), H, B)
+    with torch.cuda.device(q.device.index):
+        _bwd_q_kernel[grid](
+            q, k, v, bias, sm_scale, do,
+            dq,
+            L, delta,
+            q.stride(0), q.stride(1), q.stride(2), q.stride(3),
+            k.stride(0), k.stride(1), k.stride(2), k.stride(3),
+            v.stride(0), v.stride(1), v.stride(2), v.stride(3),
+            bias_batch_stride, bias_heads_stride,
+            bias.stride(2) if HAS_BIAS else 0,
+            bias.stride(3) if HAS_BIAS else 0,
+            do.stride(0), do.stride(1), do.stride(2), do.stride(3),
+            dq.stride(0), dq.stride(1), dq.stride(2), dq.stride(3),
+            B, H, M, N, P_SEQ,
+            BLOCK_M=BLOCK_M, BLOCK_DMODEL=D, BLOCK_N=BLOCK_N,
+            CAUSAL=causal, LARGER_M=larger_m,
+            DIVISIBLE_M=divisible_m, DIVISIBLE_N=divisible_n,
+            HAS_BIAS=HAS_BIAS,
+            num_stages=num_stages, num_warps = num_warps,
+        )
+    if RETURN_DS and IS_BATCH_REDUCED and GROUP_SIZE_BIAS > 1:
+        ds = ds.sum(0, keepdim=True)
+    return dq, dk, dv, ds
+@torch.library.register_fake("flasht5::flash_attn_v2_bwd")
+def flash_attn_v2_bwd_abstract(o, do, q, k, v, bias, L, causal, sm_scale, BLOCK_M, BLOCK_N, num_warps, num_stages):
+    dq = torch.empty_like(q)
+    dk = torch.empty_like(k)
+    dv = torch.empty_like(v)
+    ds = torch.empty_like(bias) if bias is not None else None
+    return dq, dk, dv, ds
+class FlashAttentionAdditiveBias(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, q, k, v, bias, causal, sm_scale):
+        Dq, Dk, Dv = q.shape[-1], k.shape[-1], v.shape[-1]
+        assert Dq == Dk == Dv
+        assert Dk in {16, 32, 64, 128}
+        B, H, M, D = q.shape
+        N = k.shape[2]
+        if sm_scale is None:
+            sm_scale = 1. / math.sqrt(D)
+        config = get_fwd_config(B, H, M, N, D, causal)
+        BLOCK_M, BLOCK_N, num_stages, num_warps = config
+        o, L = torch.ops.flasht5.flash_attn_v2_fwd(q, k, v, bias, causal, sm_scale, BLOCK_M, BLOCK_N, num_warps, num_stages)
+        # autograd context maintenance
+        ctx.save_for_backward(q, k, v, bias, o, L)
+        ctx.sm_scale = sm_scale
+        ctx.causal = causal
+        return o
+    @staticmethod
+    def backward(ctx, do, *ignored):
+        q, k, v, bias, o, L = ctx.saved_tensors
+        sm_scale = ctx.sm_scale
+        causal = ctx.causal
+        B, H, M, D = q.shape
+        N = k.shape[2]
+        if sm_scale is None:
+            sm_scale = 1. / math.sqrt(D)
+        config = get_bwd_config(B, H, M, N, D, causal)
+        BLOCK_M, BLOCK_N, num_stages, num_warps = config
+        dq, dk, dv, ds = torch.ops.flasht5.flash_attn_v2_bwd(o, do, q, k, v, bias, L, causal, sm_scale, BLOCK_M, BLOCK_N, num_warps, num_stages)
+        return dq, dk, dv, ds, None, None, None, None
+def flash_attention_v2_bias(q, k, v, bias, causal=False, sm_scale=None):
+    """
+    An implementation of FlashAttention v2(https://arxiv.org/abs/2307.08691).
+    Arguments:
+        q(torch.Tensor): The first queries. The shape is (batch_size, nheads, seqlen_q, headdim).
+        k(torch.Tensor): The first keys. The shape is (batch_size, nheads, seqlen_k, headdim).
+        v(torch.Tensor): The values. The shape is (batch_size, nheads, seqlen_k, headdim).
+        causal(bool): Whether causal masking is applied to attention scores before applying softmax.
+        sm_scale(float): The scaling of attention scores before applying softmax.
+    Returns:
+        out(torch.Tensor): The output. The shape is (batch_size, nheads, seqlen_q, headdim).
+    """
+    return FlashAttentionAdditiveBias.apply(q, k, v, bias, causal, sm_scale)
+# --------------------------- Forward ---------------------------
+# NOTE: this function can be overwritten at runtime to use your custom config
+def get_fwd_config(B, H, M, N, D, causal):
+    if torch.cuda.get_device_capability() == (8, 0):
+        if not causal:
+            if D <= 64:
+                BLOCK_M, BLOCK_N, num_stages, num_warps = 128, 64, 3, 4
+            else:
+                if M <= 1024:
+                    BLOCK_M, BLOCK_N, num_stages, num_warps = 128, 32, 3, 4
+                else:
+                    BLOCK_M, BLOCK_N, num_stages, num_warps = 128, 128, 3, 8
+        else:
+            if D <= 64:
+                BLOCK_M, BLOCK_N, num_stages, num_warps = 128, 64, 3, 4
+            else:
+                if M <= 1024:
+                    BLOCK_M, BLOCK_N, num_stages, num_warps = 128, 32, 2, 4
+                else:
+                    BLOCK_M, BLOCK_N, num_stages, num_warps = 128, 128, 3, 8
+    elif torch.cuda.get_device_capability() == (8, 6):
+        if not causal:
+            if D <= 64:
+                BLOCK_M, BLOCK_N, num_stages, num_warps = 128, 64, 3, 4
+            else:
+                BLOCK_M, BLOCK_N, num_stages, num_warps = 128, 32, 2, 4
+        else: # causal
+            if D <= 64:
+                BLOCK_M, BLOCK_N, num_stages, num_warps = 64, 64, 3, 4
+            else:
+                BLOCK_M, BLOCK_N, num_stages, num_warps = 128, 32, 2, 4
+    else:
+        BLOCK_M, BLOCK_N, num_stages, num_warps = 32, 32, 1, 4
+    return (BLOCK_M, BLOCK_N, num_stages, num_warps)
+@triton.jit
+def _fwd_kernel(
+    Q, K, V, B, sm_scale,
+    L, O,
+    stride_qz, stride_qh, stride_qm, stride_qk,
+    stride_kz, stride_kh, stride_kn, stride_kk,
+    stride_vz, stride_vh, stride_vn, stride_vk,
+    stride_oz, stride_oh, stride_om, stride_ok,
+    stride_bz, stride_bh, stride_bm, stride_bn,
+    Z, H, M, N, P_SEQ,
+    BLOCK_M: tl.constexpr, BLOCK_DMODEL: tl.constexpr, BLOCK_N: tl.constexpr,
+    IS_CAUSAL: tl.constexpr, LARGER_M: tl.constexpr,
+    DIVISIBLE_M: tl.constexpr, DIVISIBLE_N: tl.constexpr,
+    HAS_BIAS: tl.constexpr,
+):
+    input_dtype = Q.dtype.element_ty
+    # -- grid id --
+    start_m = tl.program_id(0)
+    off_h = tl.program_id(1)
+    off_z = tl.program_id(2)
+    # scale sm_scale by log_2(e) and use
+    # 2^x instead of exp in the loop because CSE and LICM
+    # don't work as expected with `exp` in the loop
+    log2e: tl.constexpr = 1.4426950408889634
+    # offset pointers for (batch, head)
+    Q += off_z * stride_qz + off_h * stride_qh
+    K += off_z * stride_kz + off_h * stride_kh
+    V += off_z * stride_vz + off_h * stride_vh
+    O += off_z * stride_oz + off_h * stride_oh
+    if HAS_BIAS:
+        B += off_z * stride_bz + off_h * stride_bh
+    L += (off_z * H + off_h) * M # l's shape is (B, H, M)
+    offs_m_base = tl.arange(0, BLOCK_M)
+    offs_m = start_m * BLOCK_M + offs_m_base
+    offs_n_base = tl.arange(0, BLOCK_N)
+    offs_k = tl.arange(0, BLOCK_DMODEL)
+    # initialize pointers to value-like data
+    q_ptrs = Q + (offs_m[:, None] * stride_qm + offs_k[None, :] * stride_qk) # (BLOCK_M, BLOCK_DMODEL)
+    o_ptrs = O + (offs_m[:, None] * stride_om + offs_k[None, :] * stride_ok) # (BLOCK_M, BLOCK_DMODEL)
+    l_ptrs = L + offs_m
+    # initialize pointer to m and l, fp32 for accumulators
+    m_i = tl.full([BLOCK_M], value=-float("inf"), dtype=tl.float32)
+    l_i = tl.zeros([BLOCK_M], dtype=tl.float32)
+    acc = tl.zeros([BLOCK_M, BLOCK_DMODEL], dtype=tl.float32)
+    # load q
+    mask_m = offs_m < M
+    if DIVISIBLE_M:
+        q = tl.load(q_ptrs, cache_modifier=".cg")
+    else:
+        q = tl.load(q_ptrs, mask=mask_m[:, None], cache_modifier=".cg")
+    #Dot I trick: to place q in registers, it saves shared memory
+    if BLOCK_DMODEL < 128:
+        I = tl.where(offs_k[:, None] == offs_k,
+                     tl.full((BLOCK_DMODEL, BLOCK_DMODEL), 1.0, dtype=input_dtype),
+                     tl.full((BLOCK_DMODEL, BLOCK_DMODEL), 0.0, dtype=input_dtype))
+        q = tl.dot(q, I).to(input_dtype)
+    # else:
+    #     I = tl.where(offs_m_base[:, None] == offs_m_base,
+    #                  tl.full((BLOCK_M, BLOCK_M), 1.0, dtype=input_dtype),
+    #                  tl.full((BLOCK_M, BLOCK_M), 0.0, dtype=input_dtype))
+    #     q = tl.dot(I, q).to(input_dtype)
+    # NOTE: Loop-Bound-For-N
+    # The indices in m-dimension that this block may access is in `[start_m * BLOCK_M, (start_m + 1) * BLOCK_M)`.
+    # According to the rule of causal masking, then max index in n-dimension that this block may access
+    # is `P_SEQ + (start_m + 1) * BLOCK_M`.
+    # However, the upper bound of index in n-dimension should never exceed the sequence length of k/v(`P_SEQ + N_CTX`).
+    # `P_SEQ + (start_m + 1) * BLOCK_M` may be larger than `N`.
+    # At this case, there would be illegal memory access when loading k & v tiles
+    # if mask_n is not applied for loading(only when `DIVISIBLE_N`` is true).
+    # See also https://github.com/FlagOpen/FlagAttention/pull/8
+    if IS_CAUSAL:
+        hi = tl.minimum(N, P_SEQ + (start_m + 1) * BLOCK_M)
+        if LARGER_M:
+            hi = tl.maximum(0, hi)
+    else:
+        hi = N
+    # loop over k, v and update accumulators
+    offs_n_init = offs_n_base
+    k_ptrs = K + (offs_k[:, None] * stride_vk + offs_n_init[None, :] * stride_vn) # (BLOCK_DMODEL, BLOCK_N)
+    v_ptrs = V + (offs_n_init[:, None] * stride_kn + offs_k[None, :] * stride_kk) # (BLOCK_N, BLOCK_DMODEL)
+    if HAS_BIAS:
+        bias_ptrs = B + (offs_m[:, None] * stride_bm + offs_n_init[None, :] * stride_bn) # (BLOCK_M, BLOCK_N)
+    for start_n in range(0, hi, BLOCK_N):
+        start_n = tl.multiple_of(start_n, BLOCK_N)
+        offs_n = start_n + offs_n_base
+        # -- load k, v --
+        mask_n = offs_n < N
+        if DIVISIBLE_N:
+            k = tl.load(k_ptrs, cache_modifier=".cg")
+            v = tl.load(v_ptrs, cache_modifier=".cg")
+        else:
+            k = tl.load(k_ptrs, mask=mask_n[None, :], cache_modifier=".cg")
+            v = tl.load(v_ptrs, mask=mask_n[:, None], cache_modifier=".cg")
+        # -- load bias --
+        if HAS_BIAS:
+            if DIVISIBLE_M and DIVISIBLE_N:
+                b = tl.load(bias_ptrs)
+            else:
+                b = tl.load(bias_ptrs, mask_m[:, None] & mask_n[None, :])
+        # -- compute qk ---
+        s = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
+        s += tl.dot(q, k) * sm_scale
+        if HAS_BIAS:
+            s += b
+        if not DIVISIBLE_N:
+            s = tl.where(mask_n[None, :], s, float("-inf"))
+        if IS_CAUSAL:
+            causal_mask = (P_SEQ + offs_m[:, None]) >= offs_n[None, :]
+            s = tl.where(causal_mask, s, float("-inf"))
+        # -- compute scaling constant ---
+        m_i_new = tl.maximum(m_i, tl.max(s, 1))
+        alpha = tl.math.exp2((m_i - m_i_new)*log2e)
+        p = tl.math.exp2((s - m_i_new[:, None])*log2e)
+        # -- scale and update acc: acc *= alpha[:, None]--
+        acc *= alpha[:, None]
+        acc += tl.dot(p.to(input_dtype), v)
+        # -- update m_i and l_i --
+        l_i = l_i * alpha + tl.sum(p, 1)
+        m_i = m_i_new
+        # update pointers
+        k_ptrs += BLOCK_N * stride_kn
+        v_ptrs += BLOCK_N * stride_vn
+        if HAS_BIAS:
+            bias_ptrs += BLOCK_N * stride_bn
+    # write back l & o
+    if IS_CAUSAL and LARGER_M:
+        is_empty_line = (offs_m + P_SEQ) < 0
+        acc = tl.where(is_empty_line[:, None], 0.0, acc * (1.0 / l_i[:, None]))
+        l = tl.where(is_empty_line, float("-inf"), m_i + tl.log(l_i))
+    else:
+        acc = acc * (1.0 / l_i[:, None])
+        l = m_i + tl.log(l_i) # log(normalizer)
+    if DIVISIBLE_M:
+        tl.store(l_ptrs, l, cache_modifier=".cg")
+        tl.store(o_ptrs, acc.to(input_dtype), cache_modifier=".cg")
+    else:
+        tl.store(l_ptrs, l, mask=mask_m, cache_modifier=".cg")
+        tl.store(o_ptrs, acc.to(input_dtype), mask=mask_m[:, None], cache_modifier=".cg")
+# --------------------------- Backward ---------------------------
+# NOTE: this function can be overwritten at runtime to use your custom config
+def get_bwd_config(B, H, M, N, D, causal):
+    if torch.cuda.get_device_capability() == (8, 0):
+        if not causal:
+            BLOCK_M = 128 if D <= 64 else 64
+            BLOCK_N = 64
+            num_stages = 2
+            num_warps = 4
+        else:
+            BLOCK_M = 64
+            BLOCK_N = 64
+            num_stages = 3 if D <= 64 else 2
+            num_warps = 4
+    elif torch.cuda.get_device_capability() == (8, 6): # tune for RTX-3090, device_capability(8, 6)
+        if not causal:
+            if D <= 64:
+                BLOCK_M, BLOCK_N, num_stages, num_warps = 64, 64, 2, 4
+            else:
+                BLOCK_M, BLOCK_N, num_stages, num_warps = 64, 64, 2, 8
+        else:
+            if D <= 64:
+                BLOCK_M, BLOCK_N, num_stages, num_warps = 64, 64, 2, 4
+            else:
+                BLOCK_M, BLOCK_N, num_stages, num_warps = 32, 32, 2, 4
+    else:
+        BLOCK_M, BLOCK_N, num_stages, num_warps = 32, 32, 1, 4
+    return (BLOCK_M, BLOCK_N, num_stages, num_warps)
+@triton.jit
+def _bwd_preprocess(
+    Out, DO,
+    Delta,
+    stride_oz, stride_oh, stride_om, stride_ok,
+    stride_doz, stride_doh, stride_dom, stride_dok,
+    stride_dz, stride_dh, stride_dm,
+    M,
+    BLOCK_M: tl.constexpr, D_HEAD: tl.constexpr,
+    DIVISIBLE_M: tl.constexpr,
+):
+    off_h = tl.program_id(1)
+    off_z = tl.program_id(2)
+    Out += off_z * stride_oz + off_h * stride_oh
+    DO += off_z * stride_doz + off_h * stride_doh
+    Delta += off_z * stride_dz + off_h * stride_dh
+    # compute (Out * Dout).sum() for vector interpretation
+    off_m = tl.program_id(0) * BLOCK_M + tl.arange(0, BLOCK_M)
+    off_n = tl.arange(0, D_HEAD)
+    # load
+    o_ptrs = Out + off_m[:, None] * stride_om + off_n[None, :] * stride_ok
+    do_ptrs = DO + off_m[:, None] * stride_dom + off_n[None, :] * stride_dok
+    if DIVISIBLE_M:
+        o = tl.load(o_ptrs).to(tl.float32)
+        do = tl.load(do_ptrs).to(tl.float32)
+    else:
+        mask_m = off_m < M
+        o = tl.load(o_ptrs, mask=mask_m[:, None]).to(tl.float32)
+        do = tl.load(do_ptrs, mask=mask_m[:, None]).to(tl.float32)
+    # compute
+    delta = tl.sum(o * do, axis=1)
+    # write-back
+    d_ptrs = Delta + off_m * stride_dm
+    if DIVISIBLE_M:
+        tl.store(d_ptrs, delta)
+    else:
+        tl.store(d_ptrs, delta, mask=mask_m)
+@triton.jit
+def _bwd_kv_kernel(
+    Q, K, V, B, sm_scale, DO,
+    DK, DV, DS,
+    L,
+    D,
+    stride_qz, stride_qh, stride_qm, stride_qk,
+    stride_kz, stride_kh, stride_kn, stride_kk,
+    stride_vz, stride_vh, stride_vn, stride_vk,
+    stride_bz, stride_bh, stride_bm, stride_bn,
+    stride_doz, stride_doh, stride_dom, stride_dok,
+    stride_dkz, stride_dkh, stride_dkn, stride_dkk,
+    stride_dvz, stride_dvh, stride_dvn, stride_dvk,
+    Z, H, M, N, P_SEQ,
+    lock,
+    BLOCK_M: tl.constexpr, BLOCK_DMODEL: tl.constexpr, BLOCK_N: tl.constexpr,
+    CAUSAL: tl.constexpr,
+    DIVISIBLE_M: tl.constexpr, DIVISIBLE_N: tl.constexpr,
+    HAS_BIAS: tl.constexpr,
+    RETURN_DS: tl.constexpr,
+    IS_BATCH_REDUCED: tl.constexpr,
+    GROUP_SIZE_BIAS: tl.constexpr,
+):
+    input_dtype = Q.dtype.element_ty
+    # -- grid id --
+    start_n = tl.program_id(0)
+    off_h = tl.program_id(1)
+    off_z = tl.program_id(2)
+    log2e: tl.constexpr = 1.4426950408889634
+    # offset pointers for (batch, head)
+    Q += off_z * stride_qz + off_h * stride_qh
+    K += off_z * stride_kz + off_h * stride_kh
+    V += off_z * stride_vz + off_h * stride_vh
+    if HAS_BIAS:
+        if IS_BATCH_REDUCED:
+            B += off_h * stride_bh
+        else:
+            B += off_z * stride_bz + off_h * stride_bh
+    DO += off_z * stride_doz + off_h * stride_doh
+    # offset pointers for batch/head
+    DK += off_z * stride_dkz + off_h * stride_dkh
+    DV += off_z * stride_dvz + off_h * stride_dvh
+    # offset pointer for ds tensor and locks for the reduction
+    if RETURN_DS:
+            DS += off_z * stride_bz + off_h * stride_bh
+    # offset pointers for batch/head
+    D += (off_z * H + off_h) * M
+    L += (off_z * H + off_h) * M
+    if CAUSAL:
+        lo = tl.maximum(start_n * BLOCK_N - P_SEQ, 0)
+        lo = (lo // BLOCK_M) * BLOCK_M
+    else:
+        lo = 0
+    offs_m_init = lo + tl.arange(0, BLOCK_M)
+    offs_n = start_n * BLOCK_N + tl.arange(0, BLOCK_N)
+    offs_m_base = tl.arange(0, BLOCK_M)
+    offs_k = tl.arange(0, BLOCK_DMODEL)
+    # initialize pointers to value-like data
+    q_ptrs = Q + (offs_m_init[:, None] * stride_qm + offs_k[None, :] * stride_qk) # (BLOCK_M, BLOCK_DMODEL)
+    k_ptrs = K + (offs_n[:, None] * stride_kn + offs_k[None, :] * stride_kk) # (BLOCK_N, BLOCK_DMODEL)
+    v_ptrs = V + (offs_n[:, None] * stride_vn + offs_k[None, :] * stride_vk) # (BLOCK_N, BLOCK_DMODEL)
+    do_ptrs = DO + (offs_m_init[:, None] * stride_dom + offs_k[None, :] * stride_dok) # (BLOCK_M, BLOCK_DMODEL)
+    dv_ptrs = DV + (offs_n[:, None] * stride_dvn + offs_k[None, :] * stride_dvk) # (BLOCK_N, BLOCK_DMODEL)
+    dk_ptrs = DK + (offs_n[:, None] * stride_dkn + offs_k[None, :] * stride_dkk) # (BLOCK_N, BLOCK_DMODEL)
+    if HAS_BIAS:
+        bias_ptrs = B + (offs_m_init[:, None] * stride_bm + offs_n[None, :] * stride_bn)
+    if RETURN_DS:
+        ds_ptrs = DS + (offs_m_init[:, None] * stride_bm + offs_n[None, :] * stride_bn)
+    # k and v stay in SRAM throughout
+    mask_n = offs_n < N
+    if DIVISIBLE_N:
+        v = tl.load(v_ptrs)
+        k = tl.load(k_ptrs)
+    else:
+        v = tl.load(v_ptrs, mask=mask_n[:, None])
+        k = tl.load(k_ptrs, mask=mask_n[:, None])
+    # initialize dk amd dv
+    dk = tl.zeros([BLOCK_N, BLOCK_DMODEL], dtype=tl.float32)
+    dv = tl.zeros([BLOCK_N, BLOCK_DMODEL], dtype=tl.float32)
+    # loop over a col
+    for start_m in range(lo, M, BLOCK_M):
+        start_m = tl.multiple_of(start_m, BLOCK_M)
+        offs_m = start_m + offs_m_base
+        causal_mask = (P_SEQ + offs_m[:, None]) >= (offs_n[None, :]) # (BLOCK_M, BLOCK_N)
+        # load q1, k1, q2, k2, v, do on-chip
+        mask_m = offs_m < M
+        if DIVISIBLE_M:
+            q = tl.load(q_ptrs)
+        else:
+            valid_mask = mask_m[:, None] # & mask_n
+            q = tl.load(q_ptrs, mask=mask_m[:, None])
+        # load bias
+        if HAS_BIAS:
+            if DIVISIBLE_M and DIVISIBLE_N:
+                b = tl.load(bias_ptrs)
+            else:
+                b = tl.load(bias_ptrs, mask=mask_m[:, None] & mask_n[None, :])
+        # recompute p = softmax(qk * sm_scale, dim=-1)
+        s = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
+        s += tl.dot(q, tl.trans(k)) * sm_scale
+        if HAS_BIAS:
+            s += b
+        # NOTE: since softmax in backward is pointwise, the normalizer has been saved in fwd)
+        # So masking on s is not needed.
+        # s = tl.where(valid_mask, s , float("-inf"))
+        # if CAUSAL:
+        #     s = tl.where(causal_mask, s, float("-inf"))
+        # -- recompute p ---
+        if DIVISIBLE_M:
+            l = tl.load(L + offs_m)
+        else:
+            l = tl.load(L + offs_m, mask=mask_m)
+        p = tl.math.exp2((s - l[:, None])*log2e) # (BLOCK_M, BLOCK_N)
+        if not DIVISIBLE_M:
+            p = tl.where(valid_mask, p, 0.0)
+        if CAUSAL:
+            p = tl.where(causal_mask, p, 0.0)
+        # compute dv = dot(p, do)
+        if DIVISIBLE_M:
+            do = tl.load(do_ptrs)
+        else:
+            do = tl.load(do_ptrs, mask=mask_m[:, None]) # (BLOCK_M, BLOCK_DMODEL)
+        dv += tl.dot(tl.trans(p.to(do.dtype)), do) # (BLOCK_N, BLOCK_DMODEL)  # still correct
+        # compute dp = dot(v, do)
+        if DIVISIBLE_M:
+            delta = tl.load(D + offs_m)
+        else:
+            delta = tl.load(D + offs_m, mask=mask_m)
+        dp = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
+        dp += tl.dot(do, tl.trans(v))
+        # compute ds = p * (dp - delta[:, None])
+        ds = p * (dp - delta[:, None]) # (BLOCK_M, BLOCK_N)
+        if not DIVISIBLE_M:
+            ds = tl.where(valid_mask, ds, 0.0)
+        if CAUSAL:
+            ds = tl.where(causal_mask, ds, 0.0)
+        ds = ds.to(input_dtype)
+        # compute dk = dot(ds.T, q) masking
+        dk += tl.dot(tl.trans(ds), q)
+        # store ds
+        if RETURN_DS:
+                if DIVISIBLE_M and DIVISIBLE_N:
+                    tl.store(ds_ptrs, ds)
+                else:
+                    tl.store(ds_ptrs, ds, mask=mask_m[:, None] & mask_n[None, :])
+        # increment pointers
+        q_ptrs += BLOCK_M * stride_qm
+        do_ptrs += BLOCK_M * stride_dom
+        if HAS_BIAS:
+            bias_ptrs += BLOCK_M * stride_bm
+        if RETURN_DS:
+            ds_ptrs += BLOCK_M * stride_bm
+    dk *= sm_scale
+    if DIVISIBLE_N:
+        tl.store(dk_ptrs, dk.to(input_dtype)) # (BLOCK_N, BLOCK_DMODEL)
+        tl.store(dv_ptrs, dv.to(input_dtype)) # (BLOCK_N, BLOCK_DMODEL,)
+    else:
+        tl.store(dk_ptrs, dk.to(input_dtype), mask=mask_n[:, None]) # (BLOCK_N, BLOCK_DMODEL)
+        tl.store(dv_ptrs, dv.to(input_dtype), mask=mask_n[:, None]) # (BLOCK_N, BLOCK_DMODEL,)
+@triton.jit
+def _bwd_q_kernel(
+    Q, K, V, B, sm_scale, DO,
+    DQ,
+    L,
+    D,
+    stride_qz, stride_qh, stride_qm, stride_qk,
+    stride_kz, stride_kh, stride_kn, stride_kk,
+    stride_vz, stride_vh, stride_vn, stride_vk,
+    stride_bz, stride_bh, stride_bm, stride_bn,
+    stride_doz, stride_doh, stride_dom, stride_dok,
+    stride_dqz, stride_dqh, stride_dqm, stride_dqk,
+    Z, H, M, N, P_SEQ,
+    BLOCK_M: tl.constexpr, BLOCK_DMODEL: tl.constexpr, BLOCK_N: tl.constexpr,
+    CAUSAL: tl.constexpr, LARGER_M: tl.constexpr,
+    DIVISIBLE_M: tl.constexpr, DIVISIBLE_N: tl.constexpr,
+    HAS_BIAS: tl.constexpr
+):
+    input_dtype = Q.dtype.element_ty
+    # -- grid id --
+    start_m = tl.program_id(0)
+    off_h = tl.program_id(1)
+    off_z = tl.program_id(2)
+    # scale sm_scale by log_2(e) and use
+    # 2^x instead of exp in the loop because CSE and LICM
+    # don't work as expected with `exp` in the loop
+    log2e: tl.constexpr = 1.4426950408889634
+    # offset pointers for (batch, head)
+    Q += off_z * stride_qz + off_h * stride_qh
+    K += off_z * stride_kz + off_h * stride_kh
+    V += off_z * stride_vz + off_h * stride_vh
+    if HAS_BIAS:
+        B += off_z * stride_bz + off_h * stride_bh
+    DO += off_z * stride_doz + off_h * stride_doh
+    D += (off_z * H + off_h) * M
+    L += (off_z * H + off_h) * M
+    # offset pointers for batch/head
+    DQ += off_z * stride_dqz + off_h * stride_dqh
+    offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
+    offs_n_base = tl.arange(0, BLOCK_N)
+    offs_n_init = offs_n_base
+    offs_k = tl.arange(0, BLOCK_DMODEL)
+    # initialize pointers to value-like data
+    q_ptrs = Q + (offs_m[:, None] * stride_qm + offs_k[None, :] * stride_qk) # (BLOCK_M, BLOCK_DMODEL)
+    k_ptrs = K + (offs_n_init[:, None] * stride_kn + offs_k[None, :] * stride_kk) # (BLOCK_N, BLOCK_DMODEL)
+    v_ptrs = V + (offs_n_init[:, None] * stride_vn + offs_k[None, :] * stride_vk) # (BLOCK_N, BLOCK_DMODEL)
+    if HAS_BIAS:
+        bias_ptrs = B + (offs_m[:, None] * stride_bm + offs_n_init[None, :] * stride_bn)
+    dq_ptrs = DQ + (offs_m[:, None] * stride_dqm + offs_k[None, :] * stride_dqk) # (BLOCK_M, BLOCK_DMODEL)
+    do_ptrs = DO + (offs_m[:, None] * stride_dom + offs_k[None, :] * stride_dok) # (BLOCK_M, BLOCK_DMODEL)
+    # pointer to row-wise quantities in value-like data
+    d_ptrs = D + offs_m
+    l_ptrs = L + offs_m
+    # load q: it will stay in SRAM throughout
+    mask_m = offs_m < M
+    if DIVISIBLE_M:
+        q = tl.load(q_ptrs)
+        do = tl.load(do_ptrs)
+        delta = tl.load(d_ptrs)
+        l = tl.load(l_ptrs)
+    else:
+        q = tl.load(q_ptrs, mask=mask_m[:, None])
+        do = tl.load(do_ptrs, mask=mask_m[:, None])
+        delta = tl.load(d_ptrs, mask=mask_m)
+        l = tl.load(l_ptrs, mask=mask_m)
+    # initialize dq
+    dq = tl.zeros([BLOCK_M, BLOCK_DMODEL], dtype=tl.float32)
+    # loop over k, v and update accumulator
+    # see note "Loop-Bound-For-N"
+    if CAUSAL:
+        hi = tl.minimum(N, P_SEQ + (start_m + 1) * BLOCK_M)
+        if LARGER_M:
+            hi = tl.maximum(0, hi)
+    else:
+        hi = N
+    # loop over a row
+    for start_n in range(0, hi, BLOCK_N):
+        offs_n = start_n + offs_n_base
+        # load k1, k2, v on chip
+        mask_n = offs_n < N
+        if DIVISIBLE_N:
+            v = tl.load(v_ptrs)
+            k = tl.load(k_ptrs)
+        else:
+            v = tl.load(v_ptrs, mask=mask_n[:, None])
+            k = tl.load(k_ptrs, mask=mask_n[:, None])
+        # load bias
+        if HAS_BIAS:
+            if DIVISIBLE_M and DIVISIBLE_N:
+                b = tl.load(bias_ptrs)
+            else:
+                b = tl.load(bias_ptrs, mask=mask_m[:, None] & mask_n[None, :])
+        # recompute p = softmax(qk * sm_scale, dim=-1)
+        if not DIVISIBLE_N:
+            valid_mask = mask_n # & mask_m[:, None]
+        if CAUSAL:
+            causal_mask = (P_SEQ + offs_m[:, None]) >= (offs_n[None, :]) # (BLOCK_M, BLOCK_N)
+        s = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
+        s += tl.dot(q, tl.trans(k)) * sm_scale
+        if HAS_BIAS:
+            s += b
+        # NOTE: since softmax in backward is pointwise, the normalizer has been saved in fwd)
+        # So masking on s is not needed.
+        # if CAUSAL:
+        #     s = tl.where(causal_mask & valid_mask, s, float("-inf"))
+        # else:
+        #     s = tl.where(valid_mask, s, float("-inf"))
+        p = tl.math.exp2((s - l[:, None])*log2e) # (BLOCK_M, BLOCK_N)
+        # compute dp = dot(v, do)
+        dp = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
+        dp += tl.dot(do.to(input_dtype), tl.trans(v))
+        # no need to mask dp
+        # if CAUSAL:
+        #     dp = tl.where(causal_mask & valid_mask, dp, 0.0)
+        # else:
+        #     dp = tl.where(valid_mask, dp, 0.0)
+        # compute ds = p * (dp - delta[:, None])
+        # move scale out to dq at last
+        ds = p * (dp - delta[:, None]) # (BLOCK_M, BLOCK_N)
+        # mask ds to ensure no small values
+        if not DIVISIBLE_N:
+            ds = tl.where(valid_mask, ds, 0.0)
+        if CAUSAL:
+            ds = tl.where(causal_mask, ds, 0.0)
+        dq += tl.dot(ds.to(input_dtype), k)
+        # increment pointers
+        k_ptrs += BLOCK_N * stride_kn
+        v_ptrs += BLOCK_N * stride_vn
+        if HAS_BIAS:
+            bias_ptrs += BLOCK_N * stride_bn
+    dq *= sm_scale
+    if DIVISIBLE_M:
+        tl.store(dq_ptrs, dq.to(input_dtype))
+    else:
+        tl.store(dq_ptrs, dq.to(input_dtype), mask=mask_m[:, None])

generation_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_from_model_config": true,
+  "decoder_start_token_id": 0,
+  "eos_token_id": 1,
+  "pad_token_id": 3,
+  "transformers_version": "4.46.0.dev0"
+}

modeling_flash_t5.py ADDED Viewed

	@@ -0,0 +1,790 @@

+# From: https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py
+import copy
+import math
+from typing import Optional, Tuple, Union
+import torch
+from torch import nn
+import torch.nn.functional as F
+from transformers.modeling_utils import ModuleUtilsMixin
+from transformers.modeling_outputs import ModelOutput, Seq2SeqModelOutput, BaseModelOutput, Seq2SeqLMOutput
+from transformers import PreTrainedModel
+try:
+    from .rms_norm import fast_rms_layernorm
+except ImportError:
+    fast_rms_layernorm = None
+try:
+    from .cross_entropy_loss import cross_entropy_loss as fast_cross_entropy_loss
+except ImportError:
+    fast_cross_entropy_loss = None
+try:
+    from .flash_attention_v2_bias import flash_attention_v2_bias
+except ImportError:
+    flash_attention_v2_bias = None
+try:
+    from flash_attn import flash_attn_kvpacked_func, flash_attn_func
+except ImportError:
+    flash_attn_kvpacked_func, flash_attn_func = None, None
+from .attn_ref import attn_ref
+from .configuration_flash_t5 import FlashT5Config
+from .positional_encoding import ALiBiPositionalEncoding, RelativePositionalEncoding, RotaryPositionalEncoding, FIRE
+class FlashT5CrossEntropyLoss(nn.Module):
+    def __init__(self, z_loss_factor=0.0, label_smoothing=0.0, use_triton_crossentropy=False, inplace_backward=False):
+        super().__init__()
+        if use_triton_crossentropy and fast_cross_entropy_loss is None:
+            raise ImportError("fast_cross_entropy_loss is not available")
+        self.use_triton_crossentropy = use_triton_crossentropy
+        self.z_loss_factor = z_loss_factor
+        self.label_smoothing = label_smoothing
+        self.inplace_backward = inplace_backward
+        self.cross_entropy_loss = nn.CrossEntropyLoss(label_smoothing=label_smoothing)
+    def compute_zloss(self, logits: torch.Tensor, z_loss: float):
+        logits_sum = torch.logsumexp(logits, dim=-1, keepdim=True)
+        log_z = torch.squeeze(logits_sum, axis=-1)
+        total_z_loss = z_loss * torch.square(log_z)
+        return total_z_loss.mean()
+    def forward(self, logits, labels):
+        if self.use_triton_crossentropy:
+            return fast_cross_entropy_loss(logits, labels, \
+                                           lse_square_scale=self.z_loss_factor, \
+                                           label_smoothing=self.label_smoothing, \
+                                           inplace_backward=self.inplace_backward \
+                                          )[0].mean()
+        # use standard method
+        batch, seq_len, d = logits.shape
+        logits_flatten = logits.float().view(batch*seq_len, d) # Must cast to float32 for numerical stability
+        labels_flatten = labels.view(-1)
+        loss = self.cross_entropy_loss(logits_flatten, labels_flatten)
+        z_loss = 0.0
+        if self.z_loss_factor != 0.0:
+            z_loss = self.compute_zloss(logits_flatten[labels_flatten != -100],
+                                   z_loss=self.z_loss_factor)
+        return loss + z_loss
+class FlashT5LayerNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6, use_triton_layernorm=False):
+        """
+        Construct a layernorm module in the T5 style. No bias and no subtraction of mean.
+        """
+        super().__init__()
+        if use_triton_layernorm and fast_rms_layernorm is None:
+            raise ImportError("fast_rms_layernorm is not available")
+        self.use_triton_layernorm = use_triton_layernorm
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        if self.use_triton_layernorm:
+            return fast_rms_layernorm(hidden_states, self.weight, self.variance_epsilon)
+        # T5 uses a layer_norm which only scales and doesn't shift, which is also known as Root Mean
+        # Square Layer Normalization https://arxiv.org/abs/1910.07467 thus varience is calculated
+        # w/o mean and there is no bias. Additionally we want to make sure that the accumulation for
+        # half-precision inputs is done in fp32
+        variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        # convert into half-precision if necessary
+        if self.weight.dtype in [torch.float16, torch.bfloat16]:
+            hidden_states = hidden_states.to(self.weight.dtype)
+        return self.weight * hidden_states
+class FlashT5DenseAct(nn.Module):
+    def __init__(self, config: FlashT5Config):
+        super().__init__()
+        self.wi = nn.Linear(config.d_model, config.d_ff, bias=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+        self.act = torch.nn.GELU(approximate='tanh') if config.use_gelu_act else torch.nn.ReLU()
+    def forward(self, hidden_states):
+        hidden_states = self.wi(hidden_states)
+        hidden_states = self.act(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        if (
+            isinstance(self.wo.weight, torch.Tensor)
+            and hidden_states.dtype != self.wo.weight.dtype
+            and self.wo.weight.dtype != torch.int8
+        ):
+            hidden_states = hidden_states.to(self.wo.weight.dtype)
+        return hidden_states
+class FlashT5DenseGatedAct(nn.Module):
+    def __init__(self, config: FlashT5Config):
+        super().__init__()
+        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
+        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+        self.act = torch.nn.GELU(approximate='tanh') if config.use_gelu_act else torch.nn.ReLU()
+        self.use_gelu_act = config.use_gelu_act
+    def forward(self, hidden_states):
+        hidden_act = self.act(self.wi_0(hidden_states))
+        hidden_linear = self.wi_1(hidden_states)
+        hidden_states = hidden_act * hidden_linear
+        hidden_states = self.dropout(hidden_states)
+        return hidden_states
+class FlashT5LayerFF(nn.Module):
+    def __init__(self, config: FlashT5Config):
+        super().__init__()
+        if config.use_glu_mlp:
+            self.act = FlashT5DenseGatedAct(config)
+        else:
+            self.act = FlashT5DenseAct(config)
+        self.layer_norm = FlashT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon, use_triton_layernorm=config.use_triton_layernorm)
+        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+    def forward(self, hidden_states):
+        forwarded_states = self.layer_norm(hidden_states).type_as(hidden_states)
+        forwarded_states = self.act(forwarded_states)
+        forwarded_states = self.wo(forwarded_states)
+        hidden_states = hidden_states + self.dropout(forwarded_states)
+        return hidden_states
+class FlashT5Attention(nn.Module, ModuleUtilsMixin):
+    def __init__(self, config: FlashT5Config, has_positional_encoding=False, is_causal=False):
+        super().__init__()
+        self.is_decoder = config.is_decoder
+        self.has_positional_encoding = has_positional_encoding
+        self.is_causal = is_causal
+        self.relative_attention_num_buckets = config.relative_attention_num_buckets
+        self.relative_attention_max_distance = config.relative_attention_max_distance
+        self.d_model = config.d_model
+        self.key_value_proj_dim = config.d_kv
+        self.n_heads = config.num_heads
+        self.p_dropout = config.attention_dropout_rate
+        self.inner_dim = self.n_heads * self.key_value_proj_dim
+        self.attention_type = config.attention_type
+        self.position_encoding_type = config.position_encoding_type
+        self.max_sequence_length = config.max_sequence_length
+        self.softmax_scale = config.attention_scale if config.attention_scale is not None else 1.0/math.sqrt(self.n_heads)
+        self.use_full_bias_size = config.use_full_bias_size
+        self.use_masking = config.use_masking
+        if self.use_masking and not self.use_full_bias_size:
+            raise ValueError("Masking can only be used with full batch size.")
+        if self.attention_type == "triton" and flash_attention_v2_bias is None:
+            raise ImportError("flash_attention_triton is not available")
+        elif self.attention_type.startswith("fa2") and flash_attn_func is None:
+            raise ImportError("Flash Attention 2 is not available")
+        if self.attention_type == "fa2_rpe" and self.position_encoding_type != "t5":
+             raise ValueError("fa2_rpe is not compatible with non-T5 position encoding")
+        assert (self.p_dropout == 0.0) or (self.attention_type != "triton"), "Triton attention does not support dropout"
+        self.pe_encoding = None
+        if self.position_encoding_type == "ALiBi" and has_positional_encoding:
+            # build alibi matrix with an upper bound on seq length
+            self.pe_encoding = ALiBiPositionalEncoding(self.max_sequence_length,
+                                                       self.n_heads,
+                                                       config.alibi_mode,
+                                                       randomized_position=config.use_randomized_position_encoding)
+        elif self.position_encoding_type == "t5" and has_positional_encoding:
+            self.pe_encoding = RelativePositionalEncoding(self.relative_attention_num_buckets,
+                                                          self.relative_attention_max_distance,
+                                                          self.n_heads,
+                                                          self.max_sequence_length,
+                                                          bidirectional=(not self.is_decoder),
+                                                          randomized_position=config.use_randomized_position_encoding)
+        elif self.position_encoding_type == "RoPE":
+            self.pe_encoding = RotaryPositionalEncoding(int(self.key_value_proj_dim * config.rotary_emb_fraction),
+                                                        self.max_sequence_length,
+                                                        config.rotary_base,
+                                                        config.rotary_interleaved,
+                                                        config.rotary_scale_base,
+                                                        randomized_position=config.use_randomized_position_encoding)
+        elif self.position_encoding_type == "FIRE" and has_positional_encoding:
+            self.pe_encoding = FIRE(num_heads=self.n_heads,
+                                    mlp_width=config.fire_mlp_width,
+                                    init_c=0.1,
+                                    init_L=self.relative_attention_max_distance)
+        self.Wq = nn.Linear(self.d_model, self.inner_dim, bias=False)
+        self.Wk = nn.Linear(self.d_model, self.inner_dim, bias=False)
+        self.Wv = nn.Linear(self.d_model, self.inner_dim, bias=False)
+        self.o = nn.Linear(self.inner_dim, self.d_model, bias=False)
+    def forward(
+        self,
+        hidden_states,
+        mask=None,
+        key_value_states=None,
+        position_bias=None,
+    ):
+        """
+        Self-attention (if key_value_states is None) or attention over source sentence (provided by key_value_states).
+        """
+        # Input is (batch_size, seq_length, dim)
+        # Mask is (batch_size, key_length) (non-causal) or (batch_size, key_length, key_length)
+        batch_size, seq_length = hidden_states.shape[:2]
+        key_length = seq_length if key_value_states is None else key_value_states.shape[1]
+        q = self.Wq(hidden_states)
+        if key_value_states is None:
+            k = self.Wk(hidden_states)
+            v = self.Wv(hidden_states)
+        else:
+            k = self.Wk(key_value_states)
+            v = self.Wv(key_value_states)
+        q = q.view(batch_size, seq_length, self.n_heads, self.key_value_proj_dim)
+        k = k.view(batch_size, key_length, self.n_heads, self.key_value_proj_dim)
+        v = v.view(batch_size, key_length, self.n_heads, self.key_value_proj_dim)
+        if position_bias is None and self.pe_encoding is not None and self.attention_type != "fa2_rpe":
+            q, k, v, position_bias = self.pe_encoding(q, k, v)
+        if position_bias is not None and self.use_full_bias_size:
+            position_bias = position_bias.expand(q.shape[0], q.shape[2], q.shape[1], k.shape[1])
+            if self.attention_type == "fa2_bias" or self.attention_type == "triton":
+                position_bias = position_bias.contiguous()
+        if position_bias is not None and mask is not None and self.use_masking:
+            mask = mask.unsqueeze(1)
+            if len(mask.shape) == 3:
+                mask = mask.unsqueeze(3)
+            position_bias = torch.where(mask, position_bias, torch.finfo(hidden_states.dtype).min)
+        if self.attention_type == "fa2_bias":
+            output = flash_attn_func(q, k, v, dropout_p=self.p_dropout, softmax_scale=self.softmax_scale, \
+                                    attn_bias=position_bias, causal=self.is_causal)
+        elif self.attention_type == "fa2_rpe":
+            output = flash_attn_func(q, k, v, dropout_p=self.p_dropout, softmax_scale=self.softmax_scale, \
+                                    rpe_weights=self.pe_encoding.relative_attention_bias.weight.t(), \
+                                    rpe_max_distance=self.relative_attention_max_distance, \
+                                    causal=self.is_causal)
+        elif self.attention_type == "triton":
+            q = q.permute(0, 2, 1, 3)
+            k = k.permute(0, 2, 1, 3)
+            v = v.permute(0, 2, 1, 3)
+            output = flash_attention_v2_bias(q, k, v, position_bias, self.is_causal, self.softmax_scale)
+            output = output.permute(0, 2, 1, 3)
+        else: # use flash attention
+            q = q.permute(0, 2, 1, 3)
+            k = k.permute(0, 2, 1, 3)
+            v = v.permute(0, 2, 1, 3)
+            output = attn_ref(q, k, v, position_bias, dropout_p=self.p_dropout, sm_scale=self.softmax_scale, causal=self.is_causal)
+            output = output.permute(0, 2, 1, 3)
+        output = self.o(output.reshape(output.shape[0], output.shape[1], self.inner_dim))
+        return (output, position_bias)
+class FlashT5LayerSelfAttention(nn.Module):
+    def __init__(self, config, has_positional_encoding=False):
+        super().__init__()
+        self.self_attention = FlashT5Attention(config, has_positional_encoding=has_positional_encoding, is_causal=config.is_decoder)
+        self.layer_norm = FlashT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon, use_triton_layernorm=config.use_triton_layernorm)
+        self.dropout = nn.Dropout(config.dropout_rate)
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        position_bias=None,
+    ):
+        normed_hidden_states = self.layer_norm(hidden_states).type_as(hidden_states)
+        attention_output = self.self_attention(
+            normed_hidden_states,
+            mask=attention_mask,
+            position_bias=position_bias,
+        )
+        hidden_states = hidden_states + self.dropout(attention_output[0])
+        outputs = (hidden_states,) + attention_output[1:]
+        return outputs
+class FlashT5LayerCrossAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.cross_attention = FlashT5Attention(config, has_positional_encoding=False)
+        self.layer_norm = FlashT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon, use_triton_layernorm=config.use_triton_layernorm)
+        self.dropout = nn.Dropout(config.dropout_rate)
+    def forward(
+        self,
+        hidden_states,
+        key_value_states,
+        attention_mask=None,
+        position_bias=None,
+    ):
+        normed_hidden_states = self.layer_norm(hidden_states)
+        attention_output = self.cross_attention(
+            normed_hidden_states,
+            mask=attention_mask,
+            key_value_states=key_value_states,
+            position_bias=position_bias,
+        )
+        layer_output = hidden_states + self.dropout(attention_output[0])
+        outputs = (layer_output,) + attention_output[1:]
+        return outputs
+class FlashT5Block(nn.Module):
+    def __init__(self, config, has_positional_encoding=False):
+        super().__init__()
+        self.is_decoder = config.is_decoder
+        self.self_attention_layer = FlashT5LayerSelfAttention(config, has_positional_encoding=has_positional_encoding)
+        if self.is_decoder:
+            self.cross_attention_layer = FlashT5LayerCrossAttention(config)
+        self.ff_layer = FlashT5LayerFF(config)
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        position_bias=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        encoder_decoder_position_bias=None,
+    ):
+        self_attention_outputs = self.self_attention_layer(
+            hidden_states,
+            attention_mask=attention_mask,
+            position_bias=position_bias,
+        )
+        hidden_states = self_attention_outputs[0]
+        attention_outputs = self_attention_outputs[1:]  # Relative position weights
+        if self.is_decoder and encoder_hidden_states is not None:
+            cross_attention_outputs = self.cross_attention_layer(
+                hidden_states,
+                key_value_states=encoder_hidden_states,
+                attention_mask=encoder_attention_mask,
+                position_bias=encoder_decoder_position_bias,
+            )
+            hidden_states = cross_attention_outputs[0]
+            # Keep relative position weights
+            attention_outputs = attention_outputs + cross_attention_outputs[1:]
+        # Apply Feed Forward layer
+        hidden_states = self.ff_layer(hidden_states)
+        outputs = (hidden_states,) + attention_outputs
+        return outputs  # hidden-states, (self-attention position bias), (cross-attention position bias)
+class FlashT5Stack(nn.Module, ModuleUtilsMixin):
+    def __init__(self, config, embed_tokens):
+        super().__init__()
+        assert embed_tokens is not None
+        self.config = config
+        self.embed_tokens = embed_tokens
+        self.is_decoder = config.is_decoder
+        self.block = nn.ModuleList(
+            [FlashT5Block(config, has_positional_encoding=bool(i == 0)) for i in range(config.num_layers)]
+        )
+        self.final_layer_norm = FlashT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon, use_triton_layernorm=config.use_triton_layernorm)
+        self.dropout = nn.Dropout(config.dropout_rate)
+    def forward(
+        self,
+        input_ids=None,
+        # input_ids: Optional[torch.LongTensor] = None,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        inputs_embeds=None,
+        head_mask=None,
+        cross_attn_head_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ) -> BaseModelOutput:
+        input_shape = input_ids.size()
+        batch_size, seq_length = input_shape
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        if torch.is_autocast_enabled() and input_ids.device.type == 'cuda':
+            inputs_embeds = inputs_embeds.to(torch.get_autocast_gpu_dtype())
+        # Masking
+        if attention_mask is None:
+            attention_mask = torch.ones(batch_size, seq_length, device=inputs_embeds.device, dtype=torch.bool)
+        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:
+            encoder_seq_length = encoder_hidden_states.shape[1]
+            encoder_attention_mask = torch.ones(
+                batch_size, encoder_seq_length, device=inputs_embeds.device, dtype=torch.bool
+            )
+        position_bias = None
+        encoder_decoder_position_bias = None
+        hidden_states = self.dropout(inputs_embeds)
+        for _, layer_module in enumerate(self.block):
+            layer_outputs = layer_module(
+                hidden_states,
+                attention_mask=attention_mask,
+                position_bias=position_bias,
+                encoder_hidden_states=encoder_hidden_states,
+                encoder_attention_mask=encoder_attention_mask,
+                encoder_decoder_position_bias=encoder_decoder_position_bias,
+            )
+            # We share the position biases between the layers - the first layer store them
+            position_bias = layer_outputs[1]
+            if self.is_decoder and encoder_hidden_states is not None:
+                encoder_decoder_position_bias = layer_outputs[2]
+            hidden_states = layer_outputs[0]
+        hidden_states = self.final_layer_norm(hidden_states).type_as(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states
+        )
+class FlashT5PreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = FlashT5Config
+    base_model_prefix = "transformer"
+    is_parallelizable = False
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["FlashT5Block"]
+    _keep_in_fp32_modules = []
+    def _init_weights(self, module):
+        factor = self.config.initializer_factor  # Used for testing weights initialization
+        if isinstance(module, FlashT5LayerNorm):
+            module.weight.data.fill_(factor * 1.0)
+        elif isinstance(module, (FlashT5ForConditionalGeneration)):
+            module.shared.weight.data.normal_(mean=0.0, std=factor * 1.0)
+            if hasattr(module, "lm_head") and not self.config.tie_word_embeddings:
+                module.lm_head.weight.data.normal_(mean=0.0, std=factor * self.config.d_model ** -0.5)
+        elif isinstance(module, FlashT5DenseGatedAct):
+            d_ff, d_model = module.wi_0.weight.data.size()
+            module.wi_0.weight.data.normal_(mean=0.0, std=factor * ((d_model) ** -0.5))
+            module.wi_1.weight.data.normal_(mean=0.0, std=factor * ((d_model) ** -0.5))
+        elif isinstance(module, FlashT5LayerFF):
+            d_ff, d_model = module.wo.weight.data.size()
+            module.wo.weight.data.normal_(mean=0.0, std=factor * ((d_ff) ** -0.5))
+        elif isinstance(module, FlashT5Attention):
+            d_model = self.config.d_model
+            key_value_proj_dim = self.config.d_kv
+            n_heads = self.config.num_heads
+            module.Wq.weight.data.normal_(mean=0.0, std=factor * ((d_model * key_value_proj_dim) ** -0.5))
+            module.Wk.weight.data.normal_(mean=0.0, std=factor * (d_model**-0.5))
+            module.Wv.weight.data.normal_(mean=0.0, std=factor * (d_model**-0.5))
+            module.o.weight.data.normal_(mean=0.0, std=factor * ((n_heads * key_value_proj_dim) ** -0.5))
+            if module.has_positional_encoding:
+                if hasattr(module.pe_encoding, "relative_attention_bias"):
+                    module.pe_encoding.relative_attention_bias.weight.data.normal_(mean=0.0, std=factor * ((d_model) ** -0.5))
+    def _shift_right(self, input_ids):
+        decoder_start_token_id = self.config.decoder_start_token_id
+        pad_token_id = self.config.pad_token_id
+        shifted_input_ids = input_ids.new_zeros(input_ids.shape)
+        shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
+        shifted_input_ids[..., 0] = decoder_start_token_id
+        # replace possible -100 values in labels by `pad_token_id`
+        shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)
+        return shifted_input_ids
+class FlashT5Model(FlashT5PreTrainedModel):
+    def __init__(self, config: FlashT5Config):
+        super().__init__(config)
+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        encoder_config = copy.deepcopy(config)
+        encoder_config.is_decoder = False
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.encoder = FlashT5Stack(encoder_config, self.shared)
+        decoder_config = copy.deepcopy(config)
+        decoder_config.is_decoder = True
+        decoder_config.is_encoder_decoder = False
+        decoder_config.num_layers = config.num_decoder_layers
+        self.decoder = FlashT5Stack(decoder_config, self.shared)
+        # Initialize weights and apply final processing
+        self.post_init()
+        # Model parallel
+        self.model_parallel = False
+        self.device_map = None
+    def get_input_embeddings(self):
+        return self.shared
+    def set_input_embeddings(self, new_embeddings):
+        self.shared = new_embeddings
+        self.encoder.set_input_embeddings(new_embeddings)
+        self.decoder.set_input_embeddings(new_embeddings)
+    def get_encoder(self):
+        return self.encoder
+    def get_decoder(self):
+        return self.decoder
+    def forward(
+        self,
+        input_ids=None,
+        # input_ids: Optional[torch.LongTensor] = None,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        inputs_embeds=None,
+        head_mask=None,
+        cross_attn_head_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ) -> Union[Tuple[torch.FloatTensor], Seq2SeqModelOutput]:
+        # Encode if needed (training, first prediction pass)
+        if encoder_outputs is None:
+            encoder_outputs = self.encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                inputs_embeds=inputs_embeds
+            )
+        hidden_states = encoder_outputs[0]
+        # Decode
+        decoder_outputs = self.decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=decoder_attention_mask,
+            inputs_embeds=decoder_inputs_embeds,
+            encoder_hidden_states=hidden_states,
+            encoder_attention_mask=attention_mask
+        )
+        return Seq2SeqModelOutput(
+            last_hidden_state=decoder_outputs.last_hidden_state,
+            decoder_hidden_states=decoder_outputs.hidden_states,
+            encoder_last_hidden_state=encoder_outputs.last_hidden_state,
+            encoder_hidden_states=encoder_outputs.hidden_states,
+        )
+class FlashT5ForConditionalGeneration(FlashT5PreTrainedModel):
+    def __init__(self, config: FlashT5Config):
+        super().__init__(config)
+        config.is_encoder_decoder = False
+        assert not config.tie_word_embeddings
+        self.config = config
+        self.model_dim = config.d_model
+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        encoder_config = copy.deepcopy(config)
+        encoder_config.is_decoder = False
+        self.encoder = FlashT5Stack(encoder_config, self.shared)
+        decoder_config = copy.deepcopy(config)
+        decoder_config.is_decoder = True
+        decoder_config.num_layers = config.num_decoder_layers
+        self.decoder = FlashT5Stack(decoder_config, self.shared)
+        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
+        self.loss_fct = FlashT5CrossEntropyLoss(z_loss_factor=config.z_loss,
+                                                label_smoothing=config.label_smoothing,
+                                                use_triton_crossentropy=config.use_triton_crossentropy,
+                                                inplace_backward=config.crossentropy_inplace_backward)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def prepare_inputs_for_generation(
+        self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+    ):
+        # do nothing
+        model_inputs = {"input_ids": input_ids, "attention_mask": attention_mask}
+        return model_inputs
+    def get_input_embeddings(self):
+        return self.shared
+    def set_input_embeddings(self, value):
+        self.shared = value
+    def generate(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        max_length = 32,
+        **kwargs,
+    ) -> torch.LongTensor:
+        """
+            input_ids: B x L_encoder, int64
+            attention_mask: B x L_encoder, int64
+                1 for tokens to attend to, 0 for tokens to ignore
+            Generation:
+                Starts with 0, ends with 1, padding is 0
+            # For 20 input/outputs, the diff between my implementation and HF is 9.8s vs 11.4s
+        """
+        B, _ = input_ids.size()
+        labels = torch.zeros(B, 1, dtype=torch.long, device=input_ids.device)
+        encoder_hidden_states = None
+        for _ in range(max_length):
+            out = self.forward(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                decoder_input_ids=labels,
+                encoder_hidden_states=encoder_hidden_states,
+            )
+            encoder_hidden_states = out.encoder_hidden_states
+            top_labels = out.logits[:, -1].argmax(-1).unsqueeze(-1)
+            labels = torch.cat([labels, top_labels], dim=-1)
+            if (labels == 1).sum(-1).clamp(min=0, max=1).sum().item() == B:
+                break
+        labels[:, -1] = 1
+        # Mask out the padding, i.e., all positions after the first 1 with 0
+        B, L = labels.size()
+        mask = torch.arange(L, device=labels.device).unsqueeze(0) <= (labels == 1).long().argmax(-1).unsqueeze(-1)
+        labels = labels.masked_fill(~mask, 0)
+        return labels
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        decoder_input_ids: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.BoolTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        encoder_hidden_states: Optional[torch.FloatTensor] = None,
+    ) -> Seq2SeqLMOutput:
+        """
+            input_ids: B x L_encoder, int64
+            attention_mask: B x L_encoder, int64
+                1 for tokens to attend to, 0 for tokens to ignore
+            labels: B x L_decoder, int64
+        """
+        if encoder_hidden_states is None:
+            encoder_hidden_states = self.encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+            )[0]
+        hidden_states = encoder_hidden_states
+        if labels is not None and decoder_input_ids is None:
+            decoder_input_ids = self._shift_right(labels)
+        decoder_outputs = self.decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=decoder_attention_mask,
+            encoder_hidden_states=hidden_states,
+            encoder_attention_mask=attention_mask,
+        )
+        sequence_output = decoder_outputs[0]
+        lm_logits = self.lm_head(sequence_output)
+        loss = None
+        if labels is not None:
+            loss = self.loss_fct(lm_logits, labels)
+        return Seq2SeqLMOutput(
+            loss=loss,
+            logits=lm_logits,
+            encoder_hidden_states=encoder_hidden_states
+        )
+class FlashT5EncoderModel(FlashT5PreTrainedModel):
+    def __init__(self, config: FlashT5Config):
+        super().__init__(config)
+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        encoder_config = copy.deepcopy(config)
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.encoder = FlashT5Stack(encoder_config, self.shared)
+        # Initialize weights and apply final processing
+        self.post_init()
+        # Model parallel
+        self.model_parallel = False
+        self.device_map = None
+    def get_input_embeddings(self):
+        return self.shared
+    def set_input_embeddings(self, new_embeddings):
+        self.shared = new_embeddings
+        self.encoder.set_input_embeddings(new_embeddings)
+    def get_encoder(self):
+        return self.encoder
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        token_type_ids: Optional[bool] = None,
+    ) -> Union[Tuple[torch.FloatTensor], BaseModelOutput]:
+        encoder_outputs = self.encoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds
+        )
+        return encoder_outputs

optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1a61ade6d36e7273c43a001cc4e665c4bae25570aecff45ed23f189b8a2b8687
+size 1174905530

positional_encoding.py ADDED Viewed

	@@ -0,0 +1,417 @@

+import math
+import torch
+import torch.nn as nn
+from einops import rearrange, repeat
+try:
+    from flash_attn.layers.rotary import apply_rotary_emb_qkv_, apply_rotary_emb_func, apply_rotary_emb_kv_
+except:
+    apply_rotary_emb_qkv_, apply_rotary_emb_func, apply_rotary_emb_kv_ = None, None, None
+class RelativePositionalEncoding(nn.Module):
+    def __init__(self, relative_attention_num_buckets, relative_attention_max_distance, n_heads, max_sequence_length, bidirectional=True, randomized_position=False):
+        super().__init__()
+        self.relative_attention_num_buckets = relative_attention_num_buckets
+        self.relative_attention_max_distance = relative_attention_max_distance
+        self.n_heads = n_heads
+        self.max_sequence_length = max_sequence_length
+        self.bidirectional = bidirectional
+        self.randomized_position = randomized_position
+        self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.n_heads)
+    @staticmethod
+    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
+        """
+        Adapted from Mesh Tensorflow:
+        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593
+        Translate relative position to a bucket number for relative attention. The relative position is defined as
+        memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to
+        position. If bidirectional=False, then positive relative positions are invalid. We use smaller buckets for
+        small absolute relative_position and larger buckets for larger absolute relative_positions. All relative
+        positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket.
+        This should allow for more graceful generalization to longer sequences than the model has been trained on
+        Args:
+            relative_position: an int32 Tensor
+            bidirectional: a boolean - whether the attention is bidirectional
+            num_buckets: an integer
+            max_distance: an integer
+        Returns:
+            a Tensor with the same shape as relative_position, containing int32 values in the range [0, num_buckets)
+        """
+        relative_buckets = 0
+        if bidirectional:
+            num_buckets //= 2
+            relative_buckets += (relative_position > 0).to(torch.long) * num_buckets
+            relative_position = torch.abs(relative_position)
+        else:
+            relative_position = -torch.min(relative_position, torch.zeros_like(relative_position))
+        # now relative_position is in the range [0, inf)
+        # half of the buckets are for exact increments in positions
+        max_exact = num_buckets // 2
+        is_small = relative_position < max_exact
+        # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
+        relative_position_if_large = max_exact + (
+            torch.log(relative_position.float() / max_exact)
+            / torch.log(torch.tensor(max_distance / max_exact))
+            * (num_buckets - max_exact)
+        ).to(torch.long)
+        relative_position_if_large = torch.min(
+            relative_position_if_large, torch.full_like(relative_position_if_large, num_buckets - 1)
+        )
+        relative_buckets += torch.where(is_small, relative_position, relative_position_if_large)
+        return relative_buckets
+    def compute_bias(self, query_length, key_length, device=None):
+        """Compute binned relative position bias"""
+        if device is None:
+            device = self.relative_attention_bias.weight.device
+        if self.randomized_position:
+            context_position = torch.arange(self.max_sequence_length, dtype=torch.long, device=device)
+            context_indices_rand, _ = torch.sort(torch.randperm(self.max_sequence_length)[:query_length])
+            context_indices_rand[0] = 0 # root the first element of the sequence
+            context_position = context_position[context_indices_rand][:, None]
+            memory_position = torch.arange(self.max_sequence_length, dtype=torch.long, device=device)
+            memory_indices_rand, _ = torch.sort(torch.randperm(self.max_sequence_length)[:key_length])
+            memory_indices_rand[0] = 0 # root the first element of the sequence
+            memory_position = memory_position[memory_indices_rand][None, :]
+        else:
+            context_position = torch.arange(query_length, dtype=torch.long, device=device)[:, None]
+            memory_position = torch.arange(key_length, dtype=torch.long, device=device)[None, :]
+        relative_position = memory_position - context_position  # shape (query_length, key_length)
+        relative_position_bucket = self._relative_position_bucket(
+            relative_position,  # shape (query_length, key_length)
+            bidirectional=self.bidirectional,
+            num_buckets=self.relative_attention_num_buckets,
+            max_distance=self.relative_attention_max_distance,
+        )
+        values = self.relative_attention_bias(relative_position_bucket)  # shape (query_length, key_length, num_heads)
+        values = values.permute([2, 0, 1]).unsqueeze(0)  # shape (1, num_heads, query_length, key_length)
+        return values
+    def forward(self, q, k=None, v=None):
+        query_length = q.shape[1]
+        key_length = k.shape[1] if k is not None else query_length
+        bias = self.compute_bias(query_length, key_length, device=q.device).contiguous().to(q.dtype)
+        return q, k, v, bias
+class ALiBiPositionalEncoding(nn.Module):
+    def __init__(self, max_sequence_length, num_heads, mode='symetric', randomized_position=False):
+        super().__init__()
+        self.max_sequence_length = max_sequence_length
+        self.num_heads = num_heads
+        self.mode = mode
+        self.randomized_position = randomized_position
+        self.alibi_bias = self.build_alibi_bias_matrix(num_heads, max_sequence_length, mode)
+    @staticmethod
+    def fill_with_neg_inf(t):
+        """FP16-compatible function that fills a tensor with -inf."""
+        return t.float().fill_(float("-inf")).type_as(t)
+    def get_slopes(self, n):
+        def get_slopes_power_of_2(n):
+            start = (2**(-2**-(math.log2(n)-3)))
+            ratio = start
+            return [start*ratio**i for i in range(n)]
+        if math.log2(n).is_integer():
+            return get_slopes_power_of_2(n)                   #In the paper, we only train models that have 2^a heads for some a. This function has
+        else:                                                 #some good properties that only occur when the input is a power of 2. To maintain that even
+            closest_power_of_2 = 2**math.floor(math.log2(n))  #when the number of heads is not a power of 2, we use this workaround.
+            return get_slopes_power_of_2(closest_power_of_2) + self.get_slopes(2*closest_power_of_2)[0::2][:n-closest_power_of_2]
+    def build_symetric_alibi_bias_matrix(self, num_heads, maxpos):
+        context_position = torch.arange(maxpos)[:, None]
+        memory_position = torch.arange(maxpos)[None, :]
+        relative_position = memory_position - context_position
+        relative_position = torch.abs(relative_position).unsqueeze(0).expand(num_heads, -1,-1)
+        slopes = torch.Tensor(self.get_slopes(num_heads)) * -1
+        alibi = slopes.unsqueeze(1).unsqueeze(1) * relative_position
+        return alibi.view(1, num_heads, maxpos, maxpos)
+    def build_asymetric_alibi_bias_matrix(self, num_heads, maxpos):
+        _future_mask_right = torch.triu(self.fill_with_neg_inf(torch.zeros([maxpos, maxpos])), 1).unsqueeze(0).repeat(num_heads // 2, 1, 1)
+        _future_mask_left = torch.tril(self.fill_with_neg_inf(torch.zeros([maxpos, maxpos])), -1).unsqueeze(0).repeat(num_heads // 2, 1, 1)
+        nonsym_mask = torch.cat((_future_mask_right, _future_mask_left), dim = 0).unsqueeze(0)
+        slopes = torch.Tensor(self.get_slopes(num_heads // 2)) * -1
+        context_position = torch.arange(maxpos)[:, None]
+        memory_position = torch.arange(maxpos)[None, :]
+        relative_position = memory_position - context_position
+        relative_position = torch.abs(relative_position).unsqueeze(0).expand(num_heads // 2, -1,-1)
+        alibi = slopes.unsqueeze(1).unsqueeze(1) * relative_position
+        alibi = alibi.view(1, num_heads // 2, maxpos, maxpos)
+        alibi = alibi.repeat(1, 2, 1, 1)
+        return alibi.view(1, num_heads, maxpos, maxpos) + nonsym_mask.view(1, num_heads, maxpos, maxpos)
+    def build_alibi_bias_matrix(self, num_heads, maxpos, mode='symetric'):
+        if mode == 'symetric':
+            return self.build_symetric_alibi_bias_matrix(num_heads, maxpos)
+        elif mode == 'asymetric':
+            return self.build_asymetric_alibi_bias_matrix(num_heads, maxpos)
+        else:
+            raise ValueError("ALiBi mode " + mode + " is not implemented.")
+    def forward(self, q, k=None, v=None):
+        query_length = q.shape[1]
+        key_length = k.shape[1] if k is not None else query_length
+        assert (self.alibi_bias.shape[1] < query_length) & (self.alibi_bias.shape[1] < key_length), "Sequence length larger than allowed alibi bound"
+        if self.randomized_position:
+            query_indices_rand, _ = torch.sort(torch.randperm(self.max_sequence_length)[:query_length])
+            key_indices_rand, _ = torch.sort(torch.randperm(self.max_sequence_length)[:key_length])
+            # ground sequences
+            query_indices_rand[0] = 0
+            key_indices_rand[0] = 0
+            bias = self.alibi_bias[:, :, query_indices_rand, key_indices_rand].to(q.device)
+        else:
+            bias = self.alibi_bias[:, :, :query_length, :key_length].to(q.device)
+        return q, k, v, bias.to(q.dtype).contiguous()
+class RotaryPositionalEncoding(nn.Module):
+    def __init__(self, dim,
+        max_sequence_length,
+        base=10000.0,
+        interleaved=False,
+        scale_base=None,
+        randomized_position=False):
+        super().__init__()
+        self.max_sequence_length = max_sequence_length
+        self.randomized_position = randomized_position
+        self.dim = dim
+        self.base = base
+        self.interleaved = interleaved
+        self.scale_base = scale_base
+        inv_freq = self._compute_inv_freq()
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        scale = (
+            (torch.arange(0, dim, 2, dtype=torch.float32) + 0.4 * dim) / (1.4 * dim)
+            if scale_base is not None
+            else None
+        )
+        self.register_buffer("scale", scale, persistent=False)
+        self._cos_cached = None
+        self._sin_cached = None
+        self._cos_k_cached = None
+        self._sin_k_cached = None
+    def _compute_inv_freq(self, device=None):
+        return 1.0 / (
+            self.base
+            ** (torch.arange(0, self.dim, 2, device=device, dtype=torch.float32) / self.dim)
+        )
+    def _update_cos_sin_cache(self, seqlen, device=None, dtype=None):
+        # Reset the tables if the sequence length has changed,
+        # if we're on a new device (possibly due to tracing for instance),
+        # or if we're switching from inference mode to training
+        if (
+            self._cos_cached is None
+            or self._cos_cached.device != device
+            or self._cos_cached.dtype != dtype
+            or (self.training and self._cos_cached.is_inference())
+        ):
+            # We want fp32 here, not self.inv_freq.dtype, since the model could be loaded in bf16
+            # And the output of arange can be quite large, so bf16 would lose a lot of precision.
+            # However, for compatibility reason, we add an option to use the dtype of self.inv_freq.
+            inv_freq = self._compute_inv_freq(device=device)
+            # Don't do einsum, it converts fp32 to fp16 under AMP
+            # freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+            t = torch.arange(seqlen, device=device, dtype=dtype)
+            freqs = torch.outer(t, inv_freq)
+            if self.scale is None:
+                self._cos_cached = torch.cos(freqs).to(dtype)
+                self._sin_cached = torch.sin(freqs).to(dtype)
+                self._cos_k_cached = None
+                self._sin_k_cached = None
+            else:
+                power = (
+                    torch.arange(seqlen, dtype=self.scale.dtype, device=self.scale.device)
+                    - seqlen // 2
+                ) / self.scale_base
+                scale = self.scale.to(device=power.device) ** rearrange(power, "s -> s 1")
+                # We want the multiplication by scale to happen in fp32
+                self._cos_cached = (torch.cos(freqs) * scale).to(dtype)
+                self._sin_cached = (torch.sin(freqs) * scale).to(dtype)
+                self._cos_k_cached = (torch.cos(freqs) / scale).to(dtype)
+                self._sin_k_cached = (torch.sin(freqs) / scale).to(dtype)
+    def forward(self, q, k=None, v=None):
+        if self._cos_cached is None:
+            self._update_cos_sin_cache(self.max_sequence_length, device=q.device, dtype=q.dtype)
+        if k is None and v is None:
+            q = apply_rotary_emb_qkv_(
+                    q,
+                    self._cos_cached,
+                    self._sin_cached,
+                    self._cos_k_cached,
+                    self._sin_k_cached,
+                    interleaved=self.interleaved,
+                    seqlen_offsets=0
+                )
+        elif v is None and k is not None:
+            q = apply_rotary_emb_func(
+                q,
+                self._cos_cached,
+                self._sin_cached,
+                interleaved=self.interleaved,
+                inplace=True,
+                seqlen_offsets=0
+            )
+            k = apply_rotary_emb_kv_(
+                k,
+                self._cos_cached if self._cos_k_cached is None else self._cos_k_cached,
+                self._sin_cached if self._sin_k_cached is None else self._sin_k_cached,
+                interleaved=self.interleaved,
+                seqlen_offsets=0,
+            )
+        else:
+            q = apply_rotary_emb_func(
+                q,
+                self._cos_cached,
+                self._sin_cached,
+                interleaved=self.interleaved,
+                inplace=True,
+                seqlen_offsets=0
+            )
+            k = apply_rotary_emb_func(
+                k,
+                self._cos_cached if self._cos_k_cached is None else self._cos_k_cached,
+                self._sin_cached if self._sin_k_cached is None else self._sin_k_cached,
+                interleaved=self.interleaved,
+                seqlen_offsets=0,
+            )
+            v = apply_rotary_emb_func(
+                v,
+                self._cos_cached if self._cos_k_cached is None else self._cos_k_cached,
+                self._sin_cached if self._sin_k_cached is None else self._sin_k_cached,
+                interleaved=self.interleaved,
+                seqlen_offsets=0,
+            )
+        return q, k, v, None
+class FIRE(nn.Module):
+    def __init__(self, num_heads=12, mlp_width=32, init_c=0.1, init_L=512., eps=1e-6):
+        """
+        FIRE attention bias module.
+        Args:
+        num_heads: number of attention heads.
+        mlp_width: Width of MLP.
+        init_c: initial value of log transformation parameter
+        init_L: initial value of thresholding parameter
+        eps: small constant for numerical stability
+        """
+        super(FIRE, self).__init__()
+        # Define the MLP layers
+        self.mlp = nn.Sequential(
+            nn.Linear(1, mlp_width),
+            nn.ReLU(),
+            nn.Linear(mlp_width, num_heads)
+        )
+        # Initialize c (log transformation parameter)
+        self.c = nn.Parameter(torch.tensor(init_c))
+        # Initialize L (threshold)
+        self.init_L = nn.Parameter(torch.tensor(init_L),
+        requires_grad=False)
+        # Learn a multiplier to L
+        self.L_multiplier = nn.Parameter(torch.tensor(1.0))
+        self.eps = eps
+    def apply_fire(self, seq_length, device):
+        """
+        Compute FIRE attention bias.
+        Args:
+        x: input sequence,
+        shape [bsz, seq_len, num_heads, hidden_dim]
+        Returns:
+        attention bias,
+        shape [1, num_heads, seq_len, seq_len]
+        """
+        positions = torch.arange(seq_length,
+        dtype=torch.float32,
+        device=device)
+        rel_distance = positions[:, None] - positions[None, :]
+        # Thresholding the normalizer
+        threshold = torch.abs(self.L_multiplier * self.init_L)
+        pos_normalizer = torch.max(positions, threshold)
+        pos_normalizer = pos_normalizer[:, None]
+        # Amplifying differences among local positions
+        # with log transform
+        rel_distance = torch.sign(rel_distance) * torch.log(
+        torch.abs(self.c * rel_distance) + 1
+        )
+        pos_normalizer = torch.log(
+        torch.abs(self.c * pos_normalizer) + 1
+        ) + self.eps
+        # Progressive interpolation
+        normalized_distance = rel_distance / pos_normalizer
+        fire_bias = self.mlp(normalized_distance.unsqueeze(-1))
+        fire_bias = fire_bias.unsqueeze(0).permute(0, 3, 1, 2)
+        return fire_bias
+    def forward(self, q, k=None, v=None):
+        bias = self.apply_fire(q.shape[1], device=q.device).contiguous().to(q.dtype)
+        return q, k, v, bias

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:39cac982d9842a1d653966fb42b2fa3df4077334030c76cd8fa165b5aea244ea
+size 587431346

rms_norm.py ADDED Viewed

	@@ -0,0 +1,287 @@

+# Copyright (c) 2023, Tri Dao.
+# Copyright 2024 CATIE. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Modifications to the orignal file
+# - support for torch.compile
+import triton
+import triton.language as tl
+import torch
+import math
+from typing import Tuple
+@triton.jit
+def _rmsnorm_fwd_kernel(
+    X,  # pointer to the input
+    Y,  # pointer to the output
+    W,  # pointer to the weights
+    Rstd,  # pointer to the 1/std
+    stride_x_row,  # how much to increase the pointer when moving by 1 row
+    stride_y_row,
+    N,  # number of columns in X
+    eps,  # epsilon to avoid division by zero
+    BLOCK_N: tl.constexpr,
+    IS_EVEN_N: tl.constexpr
+):
+    row = tl.program_id(0)
+    X += row * stride_x_row
+    Y += row * stride_y_row
+    # Compute mean and variance
+    cols = tl.arange(0, BLOCK_N)
+    x = tl.load(X + cols, mask=cols < N, other=0.0).to(tl.float32)
+    xbar = tl.where(cols < N, x, 0.0)
+    var = tl.sum(xbar * xbar, axis=0) / N
+    rstd = 1 / tl.sqrt(var + eps)
+    tl.store(Rstd + row, rstd)
+    # Normalize and apply linear transformation
+    mask = cols < N
+    if IS_EVEN_N:
+        w = tl.load(W + cols).to(tl.float32)
+    else:
+        w = tl.load(W + cols, mask=mask).to(tl.float32)
+    x_hat = x * rstd
+    y = x_hat * w
+    # Write output
+    if IS_EVEN_N:
+        tl.store(Y + cols, y)
+    else:
+        tl.store(Y + cols, y, mask=mask)
+@triton.jit
+def _rmsnorm_bwd_kernel(
+    X,  # pointer to the input
+    W,  # pointer to the weights
+    DY,  # pointer to the output gradient
+    DX,  # pointer to the input gradient
+    DW,  # pointer to the partial sum of weights gradient
+    Rstd,  # pointer to the 1/std
+    stride_x_row,  # how much to increase the pointer when moving by 1 row
+    stride_dy_row,
+    stride_dx_row,
+    M,  # number of rows in X
+    N,  # number of columns in X
+    eps,  # epsilon to avoid division by zero
+    rows_per_program,
+    BLOCK_N: tl.constexpr,
+    IS_EVEN_N: tl.constexpr
+):
+    # Map the program id to the elements of X, DX, and DY it should compute.
+    row_block_id = tl.program_id(0)
+    row_start = row_block_id * rows_per_program
+    cols = tl.arange(0, BLOCK_N)
+    mask = cols < N
+    X += row_start * stride_x_row
+    DY += row_start * stride_dy_row
+    DX += row_start * stride_dx_row
+    w = tl.load(W + cols, mask=mask).to(tl.float32)
+    dw = tl.zeros((BLOCK_N,), dtype=tl.float32)
+    row_end = min((row_block_id + 1) * rows_per_program, M)
+    for row in range(row_start, row_end):
+        # Load data to SRAM
+        if IS_EVEN_N:
+            x = tl.load(X + cols).to(tl.float32)
+            dy = tl.load(DY + cols).to(tl.float32)
+        else:
+            x = tl.load(X + cols, mask=mask, other=0).to(tl.float32)
+            dy = tl.load(DY + cols, mask=mask, other=0).to(tl.float32)
+        rstd = tl.load(Rstd + row)
+        # Compute dx
+        xhat = x * rstd
+        if not IS_EVEN_N:
+            xhat = tl.where(mask, xhat, 0.0)
+        wdy = w * dy
+        dw += dy * xhat
+        c1 = tl.sum(xhat * wdy, axis=0) / N
+        dx = (wdy - xhat * c1) * rstd
+        tl.store(DX + cols, dx, mask=mask)
+        X += stride_x_row
+        DY += stride_dy_row
+        DX += stride_dx_row
+    tl.store(DW + row_block_id * N + cols, dw, mask=mask)
+@torch.library.custom_op("flasht5::rmsnorm_triton_fwd", mutates_args=(), device_types="cuda")
+def rmsnorm_triton_fwd(
+    X: torch.Tensor,
+    weight: torch.Tensor,
+    eps: float
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    M, N = X.shape
+    assert X.stride(-1) == 1
+    assert weight.shape == (N,)
+    assert weight.stride(-1) == 1
+    # allocate output
+    Y = torch.empty_like(X)
+    assert Y.stride(-1) == 1
+    rstd = torch.empty((M,), dtype=torch.float32, device=X.device)
+    # Less than 64KB per feature: enqueue fused kernel
+    MAX_FUSED_SIZE = 65536 // X.element_size()
+    BLOCK_N = min(MAX_FUSED_SIZE, triton.next_power_of_2(N))
+    assert N <= BLOCK_N
+    # heuristics for number of warps
+    with torch.cuda.device(X.device.index):
+        _rmsnorm_fwd_kernel[(M,)](
+            X,
+            Y,
+            weight,
+            rstd,
+            X.stride(0),
+            Y.stride(0),
+            N,
+            eps,
+            BLOCK_N,
+            (N % BLOCK_N == 0)
+        )
+    return Y, rstd
+@torch.library.register_fake("flasht5::rmsnorm_triton_fwd")
+def rmsnorm_triton_fwd_abstract(X, weight, eps):
+    M, N = X.shape
+    Y = torch.empty_like(X)
+    rstd = torch.empty((M,), dtype=torch.float32, device=X.device)
+    return Y, rstd
+@torch.library.custom_op("flasht5::rmsnorm_triton_bwd", mutates_args=(), device_types="cuda")
+def rmsnorm_triton_bwd(
+    dy: torch.Tensor,
+    x: torch.Tensor,
+    weight: torch.Tensor,
+    rstd: torch.Tensor,
+    eps: float
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    M, N = x.shape
+    assert x.stride(-1) == 1
+    assert dy.stride(-1) == 1
+    assert dy.shape == (M, N)
+    assert weight.shape == (N,)
+    assert weight.stride(-1) == 1
+    # allocate output
+    dx = torch.empty_like(x)
+    # Less than 64KB per feature: enqueue fused kernel
+    MAX_FUSED_SIZE = 65536 // x.element_size()
+    BLOCK_N = min(MAX_FUSED_SIZE, triton.next_power_of_2(N))
+    assert N <= BLOCK_N
+    sm_count = torch.cuda.get_device_properties(x.device).multi_processor_count
+    _dw = torch.empty((sm_count, N), dtype=torch.float32, device=weight.device)
+    rows_per_program = math.ceil(M / sm_count)
+    grid = (sm_count,)
+    with torch.cuda.device(x.device.index):
+        _rmsnorm_bwd_kernel[grid](
+            x,
+            weight,
+            dy,
+            dx,
+            _dw,
+            rstd,
+            x.stride(0),
+            dy.stride(0),
+            dx.stride(0),
+            M,
+            N,
+            eps,
+            rows_per_program,
+            BLOCK_N,
+            (N % BLOCK_N == 0)
+        )
+    dw = _dw.sum(0).to(weight.dtype)
+    return dx, dw
+@torch.library.register_fake("flasht5::rmsnorm_triton_bwd")
+def rmsnorm_triton_bwd_abstract(dy, x, weight, rstd, eps):
+    M, N = x.shape
+    dx = torch.empty_like(x)
+    dw = torch.empty((1, N), dtype=torch.float32, device=weight.device)
+    return dx, dw
+class Fast_RMS_Layernorm(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, X, W, eps=1e-6):
+        X_orig_shape = X.shape
+        X = X.reshape(-1, X.shape[-1])
+        y, rstd, = torch.ops.flasht5.rmsnorm_triton_fwd(X, W, eps)
+        y = y.reshape(X_orig_shape)
+        # We don't store y, will be recomputed in the backward pass to save memory
+        ctx.save_for_backward(X, W, rstd)
+        ctx.x_shape_og = X_orig_shape
+        ctx.eps = eps
+        return y
+    @staticmethod
+    def backward(ctx, dY):
+        X, weight, rstd = ctx.saved_tensors
+        dY = dY.reshape(-1, dY.shape[-1])
+        assert dY.shape == X.shape
+        dx, dw = torch.ops.flasht5.rmsnorm_triton_bwd(
+            dY,
+            X,
+            weight,
+            rstd,
+            ctx.eps
+        )
+        return dx.reshape(ctx.x_shape_og), dw, None
+def fast_rms_layernorm(X, W, eps):
+    out = Fast_RMS_Layernorm.apply(X, W, eps)
+    return out

rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:672546ccb6198cb6029856dffdff7fa8bbc52726b212606155673ca847d54511
+size 14244

scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:605bdfbc75175c1891038202d54ff9a471d27b66aed2deedcdaf8b1fa6ca2185
+size 1256

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,266 @@

+{
+  "additional_special_tokens": [
+    "<extra_id_0>",
+    "<extra_id_1>",
+    "<extra_id_2>",
+    "<extra_id_3>",
+    "<extra_id_4>",
+    "<extra_id_5>",
+    "<extra_id_6>",
+    "<extra_id_7>",
+    "<extra_id_8>",
+    "<extra_id_9>",
+    "<extra_id_10>",
+    "<extra_id_11>",
+    "<extra_id_12>",
+    "<extra_id_13>",
+    "<extra_id_14>",
+    "<extra_id_15>",
+    "<extra_id_16>",
+    "<extra_id_17>",
+    "<extra_id_18>",
+    "<extra_id_19>",
+    "<extra_id_20>",
+    "<extra_id_21>",
+    "<extra_id_22>",
+    "<extra_id_23>",
+    "<extra_id_24>",
+    "<extra_id_25>",
+    "<extra_id_26>",
+    "<extra_id_27>",
+    "<extra_id_28>",
+    "<extra_id_29>",
+    "<extra_id_30>",
+    "<extra_id_31>",
+    "<extra_id_32>",
+    "<extra_id_33>",
+    "<extra_id_34>",
+    "<extra_id_35>",
+    "<extra_id_36>",
+    "<extra_id_37>",
+    "<extra_id_38>",
+    "<extra_id_39>",
+    "<extra_id_40>",
+    "<extra_id_41>",
+    "<extra_id_42>",
+    "<extra_id_43>",
+    "<extra_id_44>",
+    "<extra_id_45>",
+    "<extra_id_46>",
+    "<extra_id_47>",
+    "<extra_id_48>",
+    "<extra_id_49>",
+    "<extra_id_50>",
+    "<extra_id_51>",
+    "<extra_id_52>",
+    "<extra_id_53>",
+    "<extra_id_54>",
+    "<extra_id_55>",
+    "<extra_id_56>",
+    "<extra_id_57>",
+    "<extra_id_58>",
+    "<extra_id_59>",
+    "<extra_id_60>",
+    "<extra_id_61>",
+    "<extra_id_62>",
+    "<extra_id_63>",
+    "<extra_id_64>",
+    "<extra_id_65>",
+    "<extra_id_66>",
+    "<extra_id_67>",
+    "<extra_id_68>",
+    "<extra_id_69>",
+    "<extra_id_70>",
+    "<extra_id_71>",
+    "<extra_id_72>",
+    "<extra_id_73>",
+    "<extra_id_74>",
+    "<extra_id_75>",
+    "<extra_id_76>",
+    "<extra_id_77>",
+    "<extra_id_78>",
+    "<extra_id_79>",
+    "<extra_id_80>",
+    "<extra_id_81>",
+    "<extra_id_82>",
+    "<extra_id_83>",
+    "<extra_id_84>",
+    "<extra_id_85>",
+    "<extra_id_86>",
+    "<extra_id_87>",
+    "<extra_id_88>",
+    "<extra_id_89>",
+    "<extra_id_90>",
+    "<extra_id_91>",
+    "<extra_id_92>",
+    "<extra_id_93>",
+    "<extra_id_94>",
+    "<extra_id_95>",
+    "<extra_id_96>",
+    "<extra_id_97>",
+    "<extra_id_98>",
+    "<extra_id_99>",
+    "<extra_id_100>",
+    "<extra_id_101>",
+    "<extra_id_102>",
+    "<extra_id_103>",
+    "<extra_id_104>",
+    "<extra_id_105>",
+    "<extra_id_106>",
+    "<extra_id_107>",
+    "<extra_id_108>",
+    "<extra_id_109>",
+    "<extra_id_110>",
+    "<extra_id_111>",
+    "<extra_id_112>",
+    "<extra_id_113>",
+    "<extra_id_114>",
+    "<extra_id_115>",
+    "<extra_id_116>",
+    "<extra_id_117>",
+    "<extra_id_118>",
+    "<extra_id_119>",
+    "<extra_id_120>",
+    "<extra_id_121>",
+    "<extra_id_122>",
+    "<extra_id_123>",
+    "<extra_id_124>",
+    "<extra_id_125>",
+    "<extra_id_126>",
+    "<extra_id_127>",
+    "<extra_id_128>",
+    "<extra_id_129>",
+    "<extra_id_130>",
+    "<extra_id_131>",
+    "<extra_id_132>",
+    "<extra_id_133>",
+    "<extra_id_134>",
+    "<extra_id_135>",
+    "<extra_id_136>",
+    "<extra_id_137>",
+    "<extra_id_138>",
+    "<extra_id_139>",
+    "<extra_id_140>",
+    "<extra_id_141>",
+    "<extra_id_142>",
+    "<extra_id_143>",
+    "<extra_id_144>",
+    "<extra_id_145>",
+    "<extra_id_146>",
+    "<extra_id_147>",
+    "<extra_id_148>",
+    "<extra_id_149>",
+    "<extra_id_150>",
+    "<extra_id_151>",
+    "<extra_id_152>",
+    "<extra_id_153>",
+    "<extra_id_154>",
+    "<extra_id_155>",
+    "<extra_id_156>",
+    "<extra_id_157>",
+    "<extra_id_158>",
+    "<extra_id_159>",
+    "<extra_id_160>",
+    "<extra_id_161>",
+    "<extra_id_162>",
+    "<extra_id_163>",
+    "<extra_id_164>",
+    "<extra_id_165>",
+    "<extra_id_166>",
+    "<extra_id_167>",
+    "<extra_id_168>",
+    "<extra_id_169>",
+    "<extra_id_170>",
+    "<extra_id_171>",
+    "<extra_id_172>",
+    "<extra_id_173>",
+    "<extra_id_174>",
+    "<extra_id_175>",
+    "<extra_id_176>",
+    "<extra_id_177>",
+    "<extra_id_178>",
+    "<extra_id_179>",
+    "<extra_id_180>",
+    "<extra_id_181>",
+    "<extra_id_182>",
+    "<extra_id_183>",
+    "<extra_id_184>",
+    "<extra_id_185>",
+    "<extra_id_186>",
+    "<extra_id_187>",
+    "<extra_id_188>",
+    "<extra_id_189>",
+    "<extra_id_190>",
+    "<extra_id_191>",
+    "<extra_id_192>",
+    "<extra_id_193>",
+    "<extra_id_194>",
+    "<extra_id_195>",
+    "<extra_id_196>",
+    "<extra_id_197>",
+    "<extra_id_198>",
+    "<extra_id_199>",
+    "<extra_id_200>",
+    "<extra_id_201>",
+    "<extra_id_202>",
+    "<extra_id_203>",
+    "<extra_id_204>",
+    "<extra_id_205>",
+    "<extra_id_206>",
+    "<extra_id_207>",
+    "<extra_id_208>",
+    "<extra_id_209>",
+    "<extra_id_210>",
+    "<extra_id_211>",
+    "<extra_id_212>",
+    "<extra_id_213>",
+    "<extra_id_214>",
+    "<extra_id_215>",
+    "<extra_id_216>",
+    "<extra_id_217>",
+    "<extra_id_218>",
+    "<extra_id_219>",
+    "<extra_id_220>",
+    "<extra_id_221>",
+    "<extra_id_222>",
+    "<extra_id_223>",
+    "<extra_id_224>",
+    "<extra_id_225>",
+    "<extra_id_226>",
+    "<extra_id_227>",
+    "<extra_id_228>",
+    "<extra_id_229>",
+    "<extra_id_230>",
+    "<extra_id_231>",
+    "<extra_id_232>",
+    "<extra_id_233>",
+    "<extra_id_234>",
+    "<extra_id_235>",
+    "<extra_id_236>",
+    "<extra_id_237>",
+    "<extra_id_238>",
+    "<extra_id_239>",
+    "<extra_id_240>",
+    "<extra_id_241>",
+    "<extra_id_242>",
+    "<extra_id_243>",
+    "<extra_id_244>",
+    "<extra_id_245>",
+    "<extra_id_246>",
+    "<extra_id_247>",
+    "<extra_id_248>",
+    "<extra_id_249>",
+    "<extra_id_250>",
+    "<extra_id_251>",
+    "<extra_id_252>",
+    "<extra_id_253>",
+    "<extra_id_254>",
+    "<extra_id_255>"
+  ],
+  "cls_token": "<cls>",
+  "eos_token": "</s>",
+  "mask_token": "<mask>",
+  "pad_token": "<pad>",
+  "sep_token": "<sep>",
+  "unk_token": "<unk>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,2367 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<cls>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<mask>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<extra_id_0>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<extra_id_1>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<extra_id_2>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<extra_id_3>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<extra_id_4>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<extra_id_5>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<extra_id_6>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<extra_id_7>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<extra_id_8>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<extra_id_9>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<extra_id_10>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<extra_id_11>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<extra_id_12>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "19": {
+      "content": "<extra_id_13>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "20": {
+      "content": "<extra_id_14>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "21": {
+      "content": "<extra_id_15>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "22": {
+      "content": "<extra_id_16>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "23": {
+      "content": "<extra_id_17>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "24": {
+      "content": "<extra_id_18>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "25": {
+      "content": "<extra_id_19>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "26": {
+      "content": "<extra_id_20>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "27": {
+      "content": "<extra_id_21>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "28": {
+      "content": "<extra_id_22>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "29": {
+      "content": "<extra_id_23>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "30": {
+      "content": "<extra_id_24>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "31": {
+      "content": "<extra_id_25>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "32": {
+      "content": "<extra_id_26>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "33": {
+      "content": "<extra_id_27>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "34": {
+      "content": "<extra_id_28>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "35": {
+      "content": "<extra_id_29>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "36": {
+      "content": "<extra_id_30>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "37": {
+      "content": "<extra_id_31>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "38": {
+      "content": "<extra_id_32>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "39": {
+      "content": "<extra_id_33>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "40": {
+      "content": "<extra_id_34>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "41": {
+      "content": "<extra_id_35>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "42": {
+      "content": "<extra_id_36>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "43": {
+      "content": "<extra_id_37>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "44": {
+      "content": "<extra_id_38>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "45": {
+      "content": "<extra_id_39>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "46": {
+      "content": "<extra_id_40>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "47": {
+      "content": "<extra_id_41>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "48": {
+      "content": "<extra_id_42>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49": {
+      "content": "<extra_id_43>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50": {
+      "content": "<extra_id_44>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "51": {
+      "content": "<extra_id_45>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "52": {
+      "content": "<extra_id_46>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "53": {
+      "content": "<extra_id_47>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "54": {
+      "content": "<extra_id_48>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "55": {
+      "content": "<extra_id_49>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "56": {
+      "content": "<extra_id_50>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "57": {
+      "content": "<extra_id_51>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "58": {
+      "content": "<extra_id_52>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "59": {
+      "content": "<extra_id_53>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "60": {
+      "content": "<extra_id_54>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "61": {
+      "content": "<extra_id_55>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "62": {
+      "content": "<extra_id_56>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "63": {
+      "content": "<extra_id_57>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "64": {
+      "content": "<extra_id_58>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "65": {
+      "content": "<extra_id_59>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "66": {
+      "content": "<extra_id_60>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "67": {
+      "content": "<extra_id_61>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "68": {
+      "content": "<extra_id_62>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "69": {
+      "content": "<extra_id_63>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "70": {
+      "content": "<extra_id_64>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "71": {
+      "content": "<extra_id_65>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "72": {
+      "content": "<extra_id_66>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "73": {
+      "content": "<extra_id_67>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "74": {
+      "content": "<extra_id_68>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "75": {
+      "content": "<extra_id_69>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "76": {
+      "content": "<extra_id_70>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "77": {
+      "content": "<extra_id_71>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "78": {
+      "content": "<extra_id_72>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "79": {
+      "content": "<extra_id_73>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "80": {
+      "content": "<extra_id_74>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "81": {
+      "content": "<extra_id_75>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "82": {
+      "content": "<extra_id_76>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "83": {
+      "content": "<extra_id_77>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "84": {
+      "content": "<extra_id_78>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "85": {
+      "content": "<extra_id_79>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "86": {
+      "content": "<extra_id_80>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "87": {
+      "content": "<extra_id_81>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "88": {
+      "content": "<extra_id_82>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "89": {
+      "content": "<extra_id_83>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "90": {
+      "content": "<extra_id_84>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "91": {
+      "content": "<extra_id_85>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "92": {
+      "content": "<extra_id_86>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "93": {
+      "content": "<extra_id_87>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "94": {
+      "content": "<extra_id_88>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "95": {
+      "content": "<extra_id_89>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "96": {
+      "content": "<extra_id_90>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "97": {
+      "content": "<extra_id_91>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "98": {
+      "content": "<extra_id_92>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "99": {
+      "content": "<extra_id_93>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "<extra_id_94>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "<extra_id_95>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "<extra_id_96>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "<extra_id_97>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "104": {
+      "content": "<extra_id_98>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "105": {
+      "content": "<extra_id_99>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "106": {
+      "content": "<extra_id_100>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "107": {
+      "content": "<extra_id_101>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "108": {
+      "content": "<extra_id_102>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "109": {
+      "content": "<extra_id_103>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "110": {
+      "content": "<extra_id_104>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "111": {
+      "content": "<extra_id_105>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "112": {
+      "content": "<extra_id_106>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "113": {
+      "content": "<extra_id_107>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "114": {
+      "content": "<extra_id_108>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "115": {
+      "content": "<extra_id_109>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "116": {
+      "content": "<extra_id_110>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "117": {
+      "content": "<extra_id_111>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "118": {
+      "content": "<extra_id_112>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "119": {
+      "content": "<extra_id_113>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "120": {
+      "content": "<extra_id_114>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "121": {
+      "content": "<extra_id_115>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "122": {
+      "content": "<extra_id_116>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "123": {
+      "content": "<extra_id_117>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "124": {
+      "content": "<extra_id_118>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "125": {
+      "content": "<extra_id_119>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126": {
+      "content": "<extra_id_120>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "127": {
+      "content": "<extra_id_121>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128": {
+      "content": "<extra_id_122>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "129": {
+      "content": "<extra_id_123>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "130": {
+      "content": "<extra_id_124>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "131": {
+      "content": "<extra_id_125>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "132": {
+      "content": "<extra_id_126>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "133": {
+      "content": "<extra_id_127>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "134": {
+      "content": "<extra_id_128>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "135": {
+      "content": "<extra_id_129>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "136": {
+      "content": "<extra_id_130>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "137": {
+      "content": "<extra_id_131>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "138": {
+      "content": "<extra_id_132>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "139": {
+      "content": "<extra_id_133>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "140": {
+      "content": "<extra_id_134>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "141": {
+      "content": "<extra_id_135>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "142": {
+      "content": "<extra_id_136>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "143": {
+      "content": "<extra_id_137>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "144": {
+      "content": "<extra_id_138>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "145": {
+      "content": "<extra_id_139>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "146": {
+      "content": "<extra_id_140>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "147": {
+      "content": "<extra_id_141>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "148": {
+      "content": "<extra_id_142>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "149": {
+      "content": "<extra_id_143>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "150": {
+      "content": "<extra_id_144>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151": {
+      "content": "<extra_id_145>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "152": {
+      "content": "<extra_id_146>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "153": {
+      "content": "<extra_id_147>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "154": {
+      "content": "<extra_id_148>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "155": {
+      "content": "<extra_id_149>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "156": {
+      "content": "<extra_id_150>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "157": {
+      "content": "<extra_id_151>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "158": {
+      "content": "<extra_id_152>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "159": {
+      "content": "<extra_id_153>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "160": {
+      "content": "<extra_id_154>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "161": {
+      "content": "<extra_id_155>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "162": {
+      "content": "<extra_id_156>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "163": {
+      "content": "<extra_id_157>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "164": {
+      "content": "<extra_id_158>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "165": {
+      "content": "<extra_id_159>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "166": {
+      "content": "<extra_id_160>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "167": {
+      "content": "<extra_id_161>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "168": {
+      "content": "<extra_id_162>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "169": {
+      "content": "<extra_id_163>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "170": {
+      "content": "<extra_id_164>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "171": {
+      "content": "<extra_id_165>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "172": {
+      "content": "<extra_id_166>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "173": {
+      "content": "<extra_id_167>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "174": {
+      "content": "<extra_id_168>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "175": {
+      "content": "<extra_id_169>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "176": {
+      "content": "<extra_id_170>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "177": {
+      "content": "<extra_id_171>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "178": {
+      "content": "<extra_id_172>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "179": {
+      "content": "<extra_id_173>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "180": {
+      "content": "<extra_id_174>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "181": {
+      "content": "<extra_id_175>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "182": {
+      "content": "<extra_id_176>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "183": {
+      "content": "<extra_id_177>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "184": {
+      "content": "<extra_id_178>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "185": {
+      "content": "<extra_id_179>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "186": {
+      "content": "<extra_id_180>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "187": {
+      "content": "<extra_id_181>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "188": {
+      "content": "<extra_id_182>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "189": {
+      "content": "<extra_id_183>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "190": {
+      "content": "<extra_id_184>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "191": {
+      "content": "<extra_id_185>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "192": {
+      "content": "<extra_id_186>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "193": {
+      "content": "<extra_id_187>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "194": {
+      "content": "<extra_id_188>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "195": {
+      "content": "<extra_id_189>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "196": {
+      "content": "<extra_id_190>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "197": {
+      "content": "<extra_id_191>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "198": {
+      "content": "<extra_id_192>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "199": {
+      "content": "<extra_id_193>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200": {
+      "content": "<extra_id_194>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "201": {
+      "content": "<extra_id_195>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "202": {
+      "content": "<extra_id_196>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "203": {
+      "content": "<extra_id_197>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "204": {
+      "content": "<extra_id_198>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "205": {
+      "content": "<extra_id_199>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "206": {
+      "content": "<extra_id_200>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "207": {
+      "content": "<extra_id_201>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "208": {
+      "content": "<extra_id_202>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "209": {
+      "content": "<extra_id_203>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "210": {
+      "content": "<extra_id_204>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "211": {
+      "content": "<extra_id_205>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "212": {
+      "content": "<extra_id_206>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "213": {
+      "content": "<extra_id_207>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "214": {
+      "content": "<extra_id_208>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "215": {
+      "content": "<extra_id_209>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "216": {
+      "content": "<extra_id_210>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "217": {
+      "content": "<extra_id_211>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "218": {
+      "content": "<extra_id_212>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "219": {
+      "content": "<extra_id_213>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "220": {
+      "content": "<extra_id_214>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "221": {
+      "content": "<extra_id_215>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "222": {
+      "content": "<extra_id_216>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "223": {
+      "content": "<extra_id_217>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "224": {
+      "content": "<extra_id_218>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "225": {
+      "content": "<extra_id_219>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "226": {
+      "content": "<extra_id_220>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "227": {
+      "content": "<extra_id_221>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "228": {
+      "content": "<extra_id_222>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "229": {
+      "content": "<extra_id_223>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "230": {
+      "content": "<extra_id_224>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "231": {
+      "content": "<extra_id_225>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "232": {
+      "content": "<extra_id_226>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "233": {
+      "content": "<extra_id_227>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "234": {
+      "content": "<extra_id_228>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "235": {
+      "content": "<extra_id_229>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "236": {
+      "content": "<extra_id_230>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "237": {
+      "content": "<extra_id_231>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "238": {
+      "content": "<extra_id_232>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "239": {
+      "content": "<extra_id_233>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "240": {
+      "content": "<extra_id_234>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "241": {
+      "content": "<extra_id_235>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "242": {
+      "content": "<extra_id_236>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "243": {
+      "content": "<extra_id_237>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "244": {
+      "content": "<extra_id_238>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "245": {
+      "content": "<extra_id_239>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "246": {
+      "content": "<extra_id_240>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "247": {
+      "content": "<extra_id_241>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248": {
+      "content": "<extra_id_242>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "249": {
+      "content": "<extra_id_243>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250": {
+      "content": "<extra_id_244>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "251": {
+      "content": "<extra_id_245>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "252": {
+      "content": "<extra_id_246>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "253": {
+      "content": "<extra_id_247>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "254": {
+      "content": "<extra_id_248>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "255": {
+      "content": "<extra_id_249>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "256": {
+      "content": "<extra_id_250>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "257": {
+      "content": "<extra_id_251>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "258": {
+      "content": "<extra_id_252>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "259": {
+      "content": "<extra_id_253>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "260": {
+      "content": "<extra_id_254>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "261": {
+      "content": "<extra_id_255>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<extra_id_0>",
+    "<extra_id_1>",
+    "<extra_id_2>",
+    "<extra_id_3>",
+    "<extra_id_4>",
+    "<extra_id_5>",
+    "<extra_id_6>",
+    "<extra_id_7>",
+    "<extra_id_8>",
+    "<extra_id_9>",
+    "<extra_id_10>",
+    "<extra_id_11>",
+    "<extra_id_12>",
+    "<extra_id_13>",
+    "<extra_id_14>",
+    "<extra_id_15>",
+    "<extra_id_16>",
+    "<extra_id_17>",
+    "<extra_id_18>",
+    "<extra_id_19>",
+    "<extra_id_20>",
+    "<extra_id_21>",
+    "<extra_id_22>",
+    "<extra_id_23>",
+    "<extra_id_24>",
+    "<extra_id_25>",
+    "<extra_id_26>",
+    "<extra_id_27>",
+    "<extra_id_28>",
+    "<extra_id_29>",
+    "<extra_id_30>",
+    "<extra_id_31>",
+    "<extra_id_32>",
+    "<extra_id_33>",
+    "<extra_id_34>",
+    "<extra_id_35>",
+    "<extra_id_36>",
+    "<extra_id_37>",
+    "<extra_id_38>",
+    "<extra_id_39>",
+    "<extra_id_40>",
+    "<extra_id_41>",
+    "<extra_id_42>",
+    "<extra_id_43>",
+    "<extra_id_44>",
+    "<extra_id_45>",
+    "<extra_id_46>",
+    "<extra_id_47>",
+    "<extra_id_48>",
+    "<extra_id_49>",
+    "<extra_id_50>",
+    "<extra_id_51>",
+    "<extra_id_52>",
+    "<extra_id_53>",
+    "<extra_id_54>",
+    "<extra_id_55>",
+    "<extra_id_56>",
+    "<extra_id_57>",
+    "<extra_id_58>",
+    "<extra_id_59>",
+    "<extra_id_60>",
+    "<extra_id_61>",
+    "<extra_id_62>",
+    "<extra_id_63>",
+    "<extra_id_64>",
+    "<extra_id_65>",
+    "<extra_id_66>",
+    "<extra_id_67>",
+    "<extra_id_68>",
+    "<extra_id_69>",
+    "<extra_id_70>",
+    "<extra_id_71>",
+    "<extra_id_72>",
+    "<extra_id_73>",
+    "<extra_id_74>",
+    "<extra_id_75>",
+    "<extra_id_76>",
+    "<extra_id_77>",
+    "<extra_id_78>",
+    "<extra_id_79>",
+    "<extra_id_80>",
+    "<extra_id_81>",
+    "<extra_id_82>",
+    "<extra_id_83>",
+    "<extra_id_84>",
+    "<extra_id_85>",
+    "<extra_id_86>",
+    "<extra_id_87>",
+    "<extra_id_88>",
+    "<extra_id_89>",
+    "<extra_id_90>",
+    "<extra_id_91>",
+    "<extra_id_92>",
+    "<extra_id_93>",
+    "<extra_id_94>",
+    "<extra_id_95>",
+    "<extra_id_96>",
+    "<extra_id_97>",
+    "<extra_id_98>",
+    "<extra_id_99>",
+    "<extra_id_100>",
+    "<extra_id_101>",
+    "<extra_id_102>",
+    "<extra_id_103>",
+    "<extra_id_104>",
+    "<extra_id_105>",
+    "<extra_id_106>",
+    "<extra_id_107>",
+    "<extra_id_108>",
+    "<extra_id_109>",
+    "<extra_id_110>",
+    "<extra_id_111>",
+    "<extra_id_112>",
+    "<extra_id_113>",
+    "<extra_id_114>",
+    "<extra_id_115>",
+    "<extra_id_116>",
+    "<extra_id_117>",
+    "<extra_id_118>",
+    "<extra_id_119>",
+    "<extra_id_120>",
+    "<extra_id_121>",
+    "<extra_id_122>",
+    "<extra_id_123>",
+    "<extra_id_124>",
+    "<extra_id_125>",
+    "<extra_id_126>",
+    "<extra_id_127>",
+    "<extra_id_128>",
+    "<extra_id_129>",
+    "<extra_id_130>",
+    "<extra_id_131>",
+    "<extra_id_132>",
+    "<extra_id_133>",
+    "<extra_id_134>",
+    "<extra_id_135>",
+    "<extra_id_136>",
+    "<extra_id_137>",
+    "<extra_id_138>",
+    "<extra_id_139>",
+    "<extra_id_140>",
+    "<extra_id_141>",
+    "<extra_id_142>",
+    "<extra_id_143>",
+    "<extra_id_144>",
+    "<extra_id_145>",
+    "<extra_id_146>",
+    "<extra_id_147>",
+    "<extra_id_148>",
+    "<extra_id_149>",
+    "<extra_id_150>",
+    "<extra_id_151>",
+    "<extra_id_152>",
+    "<extra_id_153>",
+    "<extra_id_154>",
+    "<extra_id_155>",
+    "<extra_id_156>",
+    "<extra_id_157>",
+    "<extra_id_158>",
+    "<extra_id_159>",
+    "<extra_id_160>",
+    "<extra_id_161>",
+    "<extra_id_162>",
+    "<extra_id_163>",
+    "<extra_id_164>",
+    "<extra_id_165>",
+    "<extra_id_166>",
+    "<extra_id_167>",
+    "<extra_id_168>",
+    "<extra_id_169>",
+    "<extra_id_170>",
+    "<extra_id_171>",
+    "<extra_id_172>",
+    "<extra_id_173>",
+    "<extra_id_174>",
+    "<extra_id_175>",
+    "<extra_id_176>",
+    "<extra_id_177>",
+    "<extra_id_178>",
+    "<extra_id_179>",
+    "<extra_id_180>",
+    "<extra_id_181>",
+    "<extra_id_182>",
+    "<extra_id_183>",
+    "<extra_id_184>",
+    "<extra_id_185>",
+    "<extra_id_186>",
+    "<extra_id_187>",
+    "<extra_id_188>",
+    "<extra_id_189>",
+    "<extra_id_190>",
+    "<extra_id_191>",
+    "<extra_id_192>",
+    "<extra_id_193>",
+    "<extra_id_194>",
+    "<extra_id_195>",
+    "<extra_id_196>",
+    "<extra_id_197>",
+    "<extra_id_198>",
+    "<extra_id_199>",
+    "<extra_id_200>",
+    "<extra_id_201>",
+    "<extra_id_202>",
+    "<extra_id_203>",
+    "<extra_id_204>",
+    "<extra_id_205>",
+    "<extra_id_206>",
+    "<extra_id_207>",
+    "<extra_id_208>",
+    "<extra_id_209>",
+    "<extra_id_210>",
+    "<extra_id_211>",
+    "<extra_id_212>",
+    "<extra_id_213>",
+    "<extra_id_214>",
+    "<extra_id_215>",
+    "<extra_id_216>",
+    "<extra_id_217>",
+    "<extra_id_218>",
+    "<extra_id_219>",
+    "<extra_id_220>",
+    "<extra_id_221>",
+    "<extra_id_222>",
+    "<extra_id_223>",
+    "<extra_id_224>",
+    "<extra_id_225>",
+    "<extra_id_226>",
+    "<extra_id_227>",
+    "<extra_id_228>",
+    "<extra_id_229>",
+    "<extra_id_230>",
+    "<extra_id_231>",
+    "<extra_id_232>",
+    "<extra_id_233>",
+    "<extra_id_234>",
+    "<extra_id_235>",
+    "<extra_id_236>",
+    "<extra_id_237>",
+    "<extra_id_238>",
+    "<extra_id_239>",
+    "<extra_id_240>",
+    "<extra_id_241>",
+    "<extra_id_242>",
+    "<extra_id_243>",
+    "<extra_id_244>",
+    "<extra_id_245>",
+    "<extra_id_246>",
+    "<extra_id_247>",
+    "<extra_id_248>",
+    "<extra_id_249>",
+    "<extra_id_250>",
+    "<extra_id_251>",
+    "<extra_id_252>",
+    "<extra_id_253>",
+    "<extra_id_254>",
+    "<extra_id_255>"
+  ],
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<cls>",
+  "eos_token": "</s>",
+  "mask_token": "<mask>",
+  "model_max_length": 1024,
+  "pad_token": "<pad>",
+  "sep_token": "<sep>",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "<unk>"
+}

trainer_state.json ADDED Viewed

The diff for this file is too large to render. See raw diff

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:48108a519b2d546b6b73832c5c3752b2c0920e3ce76ab654621c6c98f2de2ef0
+size 5240