naist-nlp
/

mitre_913m

Translation

Safetensors

mitre

custom_code

Model card Files Files and versions Community

zhiqu22 commited on 3 days ago

Commit

40dedde

1 Parent(s): e57eb93

updates

Browse files

Files changed (2) hide show

README.md +12 -6
modeling_mitre.py +27 -26

README.md CHANGED Viewed

@@ -32,13 +32,13 @@ pipeline_tag: translation
 # MITRE 913M
 ## Description
-MITRE (multilingual translation with registers) is a multilingual decoder-only model trained for many-to-many translation.
 The technology, i.e., registering, is introduced in our [paper](url_placeholder).
 This repository allows you employ our pre-trained model for inference. If you want to reproduce the data mining and training, please refer to this [repository](url_placeholder).
-The model can directly translate between the 552 directions of 24 languages spanning more than 5 language families.
-You can directly use our models by `transformers` libs.
-MITRE has another version with 466M parameters, which can be found in this [repository](https://huggingface.co/naist-nlp/mitre_466m).
 ## Usages
@@ -47,7 +47,7 @@ You can simply call the tokenizer and the model by
 ```python
 from transformers import AutoModel, AutoTokenizer
-# you can switch the name to "naist-nlp/mitre_913m"
 tokenizer = AutoTokenizer.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True, use_fast=False)
 model = AutoModel.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True)
 ```
@@ -65,6 +65,7 @@ After get the objects of the model and the tokenizer, we can do translation.
 ```python
 english_text = "I have a red apple."
 chinese_text = "我有一个红苹果。"
 model.eval()
 # Translating from one or several sentences to a sole language
@@ -83,11 +84,16 @@ print(results)
 # 1. The difference between tgt_tokens and labels is that the eos_tokens are moved to the right side.
 # 2. We recommend using 'tokenizer.encode_target_tokens_to_labels' instead of modifying tgt_tokens,
 #    because 'tokenizer.encode_target_tokens_to_input_ids' has pads.
-# 3. You can refer our codes to know the details in implementation.
 # tgt_tokens = tokenizer.encode_target_tokens_to_input_ids(chinese_text)
 # labels = tokenizer.encode_target_tokens_to_labels(chinese_text)
 ```
 ## Languages covered
 Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af)
 Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro)

 # MITRE 913M
 ## Description
+MITRE (Multilingual Translation with Registers) is a multilingual, decoder-only model designed for many-to-many translation tasks.
 The technology, i.e., registering, is introduced in our [paper](url_placeholder).
 This repository allows you employ our pre-trained model for inference. If you want to reproduce the data mining and training, please refer to this [repository](url_placeholder).
+The model supports direct translation across 552 directions for 24 languages spanning over 5 language families.
+You can use our models directly via the `transformers` libs.
+An alternative version of MITRE with 466M parameters is also available in this [repository](https://huggingface.co/naist-nlp/mitre_466m).
 ## Usages
 ```python
 from transformers import AutoModel, AutoTokenizer
+# you can switch the name to "naist-nlp/mitre_466m"
 tokenizer = AutoTokenizer.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True, use_fast=False)
 model = AutoModel.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True)
 ```
 ```python
 english_text = "I have a red apple."
 chinese_text = "我有一个红苹果。"
+model.half() # recommended
 model.eval()
 # Translating from one or several sentences to a sole language
 # 1. The difference between tgt_tokens and labels is that the eos_tokens are moved to the right side.
 # 2. We recommend using 'tokenizer.encode_target_tokens_to_labels' instead of modifying tgt_tokens,
 #    because 'tokenizer.encode_target_tokens_to_input_ids' has pads.
+# 3. You can refer to our code for detailed implementation.
 # tgt_tokens = tokenizer.encode_target_tokens_to_input_ids(chinese_text)
 # labels = tokenizer.encode_target_tokens_to_labels(chinese_text)
 ```
+## Notes
+We basically follow the style of [M2M](https://huggingface.co/facebook/m2m100_418M), however, we make some necessary improvements to reduce cost in generation.
+You can refer to the codes of 'generate()' in [modeling_mitre.py](https://huggingface.co/naist-nlp/mitre_466m/blob/main/modeling_mitre.py) for much more details.
+Moreover, we have a plan to implement FlashAttention V2 to further boost our models, which will be updated as soon as possible.
 ## Languages covered
 Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af)
 Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro)

modeling_mitre.py CHANGED Viewed

@@ -74,11 +74,11 @@ class MitreSdpaAttention(nn.Module):
         attention_mask: Optional[torch.Tensor] = None,
     ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
         """
-        1. MitreModel is using MitreSdpaAttention, which is modifed from M2M100SdpaAttention.
-           Notabley, both of them do not support `output_attentions=True` or `layer_head_mask` not None,
-           leading to 'attn_weights' always being None in output.
-           The plan of improving this point has a low priority.
-        2. We plan to improve this code with Flash Attention v2.
         """
         bsz, tgt_len, _ = hidden_states.size()
@@ -777,32 +777,33 @@ class MitreForConditionalGeneration(MitrePreTrainedModel, GenerationMixin):
         ):
         """
             Inference with beam search.
-            This code is improved from 'transformers.generation.utils.GenerationMixin.generate'.
-            There are **two main improved points**:
             1. 'soft early_stop' in beam search.
                 a) problem in the vanilla version.
-                In multilingual translation model, e.g., NLLB and M2M, they adopt the 'vanilla early_
-                stop' in BeamSearchScorer (the official implementation provided by HuggingFace), i.e.,
-                the sequence, which is labled by 'end', is filled by 'pad(1)' still, in other words,
-                the ended sequence is fed into the model still, resulting in a heavy memory waste.
                 b) our improvement.
-                We implement soft early_stop to resolve the problem. Specifically, we do not change
-                anything in BeamSearchScorer to keep the codes' flexibility, rather we remove the ended
-                sequence from the input. Then, given that the output hidden states' shape is changed,
-                we insert some placeholders to keep the shape of BeamSearchScorer's states.
-                Based on our test, this improvement can decrease the memory cost to half than before.
             2. mask reusing.
-                a) problem: registers need attention masks in each step.
-                A sequence possibly consists 4 parts, i.e., pads, source tokens, registers, and target
-                tokens. In training, we mask all tokens before registers for the generation of target
-                tokens. As a result, in generation, we cannot allow the target tokens to 'see' pads.
-                So, we need masks in each step, leading to computational resource waste.
                 b) our improvement.
-                First, we turncate the source tokens to save cost.
-                Second, given that there still exists some source tokens playing the role of placeholders,
-                we modify the mask generation compared to our codes in fairseq.
-                Third, in order to avoid re-generating masks, we add the mask into 'registering_cache'.
-                Then, we manage its order as the kv cache in beam search, and add a column of 0. every step.
         """
         if generation_config != None:
             assert type(generation_config) is GenerationConfig

         attention_mask: Optional[torch.Tensor] = None,
     ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
         """
+        1. MitreModel uses MitreSdpaAttention, which is modified from M2M100SdpaAttention.
+        Notably, neither of them supports 'output_attentions=True' or 'layer_head_mask is not None',
+        meaning that attn_weights are not included in the output.
+        Improving this feature is currently a low priority, and we leave this functionality for users to customize.
+        2.We plan to enhance this code with Flash Attention v2 in the future.
         """
         bsz, tgt_len, _ = hidden_states.size()
         ):
         """
             Inference with beam search.
+            This code is an improved version of transformers.generation.utils.GenerationMixin.generate.
+            There are two main improvements:
             1. 'soft early_stop' in beam search.
                 a) problem in the vanilla version.
+                In multilingual translation models such as NLLB and M2M, the vanilla early stop in BeamSearchScorer
+                (the official implementation by HuggingFace) marks ended sequences with pad(1). However, these ended
+                sequences are still fed into the model, leading to significant memory waste.
                 b) our improvement.
+                We implemented a "soft early stop" to address this issue. Instead of modifying BeamSearchScorer
+                (to maintain code flexibility), we remove ended sequences from the input. Since this changes the
+                shape of the output hidden states, we insert placeholders to maintain compatibility with
+                BeamSearchScorer's state shapes.
+                Based on our tests, this improvement reduces memory usage by half.
             2. mask reusing.
+                a) problem:
+                Registers require attention masks at each step.
+                A sequence may consist of four parts: padding, source tokens, registers, and target tokens.
+                During training, we mask all tokens before registers for target token generation. During generation,
+                we cannot allow target tokens to "see" padding tokens, requiring masks at every step.
+                This leads to computational inefficiency.
                 b) our improvement.
+                First, we turncate the source tokens and their representations to reduce cost.
+                Second, for source tokens acting as placeholders, we modified the mask generation logic compared to
+                our Fairseq implementation.
+                Third, to avoid regenerating masks at each step, we cache the mask in 'registering_cache', where cached
+                mask is managed like the key-value cache in beam search. Then, At every step, we add a column of zeros
+                to maintain alignment.
         """
         if generation_config != None:
             assert type(generation_config) is GenerationConfig