|
--- |
|
license: mit |
|
datasets: |
|
- bentrevett/multi30k |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: translation |
|
--- |
|
|
|
The translator app: |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b2665fee3f66b2b0f7b765/nVheCVJjZiCK3cvof6x84.png) |
|
|
|
# Model Name |
|
German to English Translator |
|
|
|
# Model Description |
|
This model translates german language to english language. It used Sequence to Sequence Transformer(Seq2SeqTransformer) for training. |
|
|
|
|
|
- **Developed by:** Neelima Monjusha Preeti |
|
- **Model type:** Seq2SeqTransformer |
|
- **Language(s):** Python |
|
- **License:** MIT |
|
- **Contact:** [email protected] |
|
|
|
# Task Description |
|
This app translates German to English. First the language is tokenized, passed through encoder, decoder and trained with Seq2SeqTransformer. |
|
Then as output the language is english. |
|
|
|
# Data Processing |
|
|
|
Defining source and target languages and then Tokenization. Tokenizers for German and English are initialized using spaCy (spacy library). |
|
The get_tokenizer function from spaCy is used to obtain tokenizers for each language. |
|
A function yield_tokens is defined to tokenize sentences from the data iterator for both source and target languages. |
|
Special symbols and indices: |
|
|
|
Special indices are defined for unknown words (UNK_IDX), padding (PAD_IDX), beginning of sequence (BOS_IDX), and end of sequence (EOS_IDX). |
|
Special symbols are defined as ['<unk>', '<pad>', '<bos>', '<eos>']. |
|
|
|
Then vocabulary is built.For each language (source and target), the code iterates over the training data and builds a vocabulary using the build_vocab_from_iterator function. |
|
It uses the tokenization function defined earlier to tokenize the data. |
|
The vocabulary is built with a minimum frequency of 1 (including all tokens) and special symbols are added first. |
|
For each language's vocabulary, the default index for unknown tokens (UNK_IDX) is set. |
|
|
|
```bash |
|
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm') |
|
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm') |
|
|
|
|
|
def yield_tokens(data_iter: Iterable, language: str) -> List[str]: |
|
language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1} |
|
|
|
for data_sample in data_iter: |
|
yield token_transform[language](data_sample[language_index[language]]) |
|
|
|
# Define special symbols and indices |
|
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3 |
|
# Make sure the tokens are in order of their indices to properly insert them in vocab |
|
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>'] |
|
|
|
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]: |
|
# Training data Iterator |
|
train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE)) |
|
|
|
vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln), |
|
min_freq=1, |
|
specials=special_symbols, |
|
special_first=True) |
|
|
|
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]: |
|
vocab_transform[ln].set_default_index(UNK_IDX) |
|
``` |
|
# Model Architecture |
|
|
|
For machine translation I used Seq2SeqTransformer. |
|
class PositionalEncoding(nn.Module) adds positional encodings to token embeddings, while class TokenEmbedding(nn.Module) converts token indices into dense embeddings using an embedding layer. |
|
The parameters defined and initialized for the model are: |
|
|
|
### num_encoder_layers: Number of layers in the encoder stack -- 3. |
|
### num_decoder_layers: Number of layers in the decoder stack-- 3. |
|
### emb_size: The dimensionality of token embeddings-- 512. |
|
### nhead: The number of attention heads in the multi-head attention mechanism-- 512. |
|
### src_vocab_size: Vocabulary size of the source language. |
|
### tgt_vocab_size: Vocabulary size of the target language. |
|
### dim_feedforward: Dimensionality of the feedforward network (defaulted to 512). |
|
### dropout: Dropout probability (defaulted to 0.1). |
|
|
|
The loss function and optimizer are calculated with this: |
|
|
|
```bash |
|
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX) |
|
optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9) |
|
``` |
|
Then the model is passed through encoder and decoder layers. |
|
|
|
The helper functions and list are |
|
|
|
```bash |
|
sequential_transforms(*transforms) |
|
tensor_transform(token_ids: List[int]) |
|
collate_fn(batch) |
|
text_transform = {} |
|
``` |
|
|
|
These utility functions and transformations handle the preprocessing of text data, including tokenization, numericalization, adding special tokens, and collating samples into batch tensors suitable for training a sequence-to-sequence transformer model. |
|
|
|
Then the model is trained with Seq2SeqTransformer and evaluated with function evaluate(model). |
|
|
|
# Result Analysis |
|
greedy_decode() - this function takes |
|
### model: The sequence-to-sequence transformer model. |
|
### src: The source sequence tensor. |
|
### src_mask: The mask for the source sequence. |
|
### max_len: The maximum length of the output sequence. |
|
### start_symbol: The index of the start symbol in the target vocabulary |
|
|
|
as parameter and returns the generated target sequence tensor ys, which contains the complete translation. |
|
|
|
## Test input: |
|
|
|
The function for translating german to english is - translate(). |
|
```bash |
|
def translate(src_sentence: str): |
|
model = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM) |
|
|
|
model.load_state_dict(torch.load('./transformer_model.pth')) |
|
model.to(DEVICE) |
|
model.eval() |
|
src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1) |
|
num_tokens = src.shape[0] |
|
src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool) |
|
tgt_tokens = greedy_decode( |
|
model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten() |
|
return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "") |
|
``` |
|
This function first loads the saved model. Then it tokenizes and implements greedy_decode for getting the translated output. Then returns the |
|
output. |
|
|
|
# Hugging Face Interface: |
|
|
|
For creating interface gradio and torch as well as Seq2SeqTransformer, translate and greedy_decode function from the germantoenglish.py file was loaded. |
|
```bash |
|
import gradio as gr |
|
import torch |
|
from germantoenglish import Seq2SeqTransformer, translate, greedy_decode |
|
``` |
|
The the app takes input a german line and output shows the translated english text. |
|
```bash |
|
if __name__ == "__main__": |
|
iface = gr.Interface( |
|
fn=translate, |
|
inputs=[ |
|
gr.components.Textbox(label="Text") |
|
|
|
], |
|
outputs=["text"], |
|
cache_examples=False, |
|
title="GermanToEnglish", |
|
) |
|
iface.launch(share=True) |
|
|
|
``` |
|
The app interface looks like this: |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b2665fee3f66b2b0f7b765/J_Q4eqXiN7cNuhOM3NbjR.png) |
|
|
|
# Project Structure |
|
```bash |
|
|---Readme.md |
|
| |
|
|---germantoenglish.py-The full code for processing, training, evaluating is here |
|
| |
|
|---app.py- This is for creating the app interface |
|
| |
|
|---Modeltensors- needed tensor file for loading app |
|
| |
|
|---requirements.txt- necessary packages and dataset which needs to be downloaded for the app to work. |
|
| |
|
|--translate_model.pth- the model file which is loaded for the app |
|
|
|
``` |
|
|
|
# How to Run |
|
|
|
```bash |
|
|
|
git clone https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish |
|
|
|
cd GermanToEnglish |
|
|
|
pip install -r requirements.txt |
|
|
|
python app.py |
|
``` |
|
|
|
|
|
# License |
|
This project is licensed under the MIT License. |
|
|
|
# Contributor |
|
Neelima Monjusha Preeti - [email protected] |
|
|
|
App link: https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish |
|
|
|
|