InLegalTrans
This is the model card of InLegalTrans-En2Indic-1B translation model, a fine-tuned version of the IndicTrans2 model specifically tailored for translating Indian legal texts from English to Indian languages.
Training Data
We use the MILPaC (Multilingual Indian Legal Parallel Corpus) corpus for fine-tuning. It is the first high-quality Indian legal parallel corpus, containing parallel aligned text units in English (EN) and nine Indian (IN) languages -- Bengali (BN), Hindi (HI), Marathi (MR), Tamil (TA), Telugu (TE), Malayalam (ML), Panjabi (PA), Gujarati (GU), and Oriya (OR). Please refer to the paper for more details about this corpus.
For fine-tuning, we randomly split MILPaC language-wise in a 80 (train) - 10 (validation) - 10 (test) ratio. We use the 80% train split (combined 80% of each English-to-Indic language pair) to fine-tune the IndicTrans2 model and 10% validation split (combined 10% of each English-to-Indic language pair) to select the best checkpoint and to prevent overfitting.
Model Overview and Usage Instructions
This InLegalTrans model uses the same tokenizer as the IndicTrans2 model and has the same architecture with ~1.12B parameters.
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit import IndicProcessor # Install IndicTransToolkit from https://github.com/VarunGumma/IndicTransToolkit
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
src_lang, tgt_lang = "eng_Latn", "ben_Beng" # Use the BCP-47 language codes used by the FLORES-200 dataset
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B", trust_remote_code=True) # Use IndicTrans2 tokenizer to enable their custom tokenization script to be run
model = AutoModelForSeq2SeqLM.from_pretrained(
"law-ai/InLegalTrans-En2Indic-1B",
trust_remote_code=True,
attn_implementation="eager",
low_cpu_mem_usage=True,
).to(device)
ip = IndicProcessor(inference=True)
input_sentences = [
"(7) Any such allowance for the maintenance and expenses for proceeding shall be payable from the date of the order, or, if so ordered, from the date of the application for maintenance or expenses of proceeding, as the case may be.",
"(2) Where it appears to the Tribunal that, in consequence of any decision of a competent Civil Court, any order made under section 9 should be cancelled or varied, it shall cancel the order or, as the case may be, vary the same accordingly.",
]
batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)
input_text_encoding = tokenizer(
batch,
max_length=256,
truncation=True,
padding="longest",
return_tensors="pt",
return_attention_mask=True,
).to(device)
generated_tokens = model.generate(
**input_text_encoding,
max_length=256,
do_sample=True,
num_beams=4,
num_return_sequences=1,
early_stopping=False,
use_cache=True,
)
with tokenizer.as_target_tokenizer():
generated_tokens = tokenizer.batch_decode(
generated_tokens.detach().cpu().tolist(),
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)
for input_sentence, translation in zip(input_sentences, translations):
print(f"Sentence in {src_lang} language: {input_sentence}")
print(f"Translated Sentence in {tgt_lang} language: {translation}")
Fine-tuning Results
The following table contains the performance results of the InLegalTrans model compared to the IndicTrans2 model over the 10% test split of MILPaC. Performances are evaluated using BLEU, GLEU, and chrF++ metrics. For all English-to-Indic language pairs, InLegalTrans demonstrated a significant improvement over IndicTrans2, achieving consistently better performance across all evaluation metrics.
EN-to-IN | Model | BLEU | GLEU | chrF++ |
---|---|---|---|---|
EN-to-BN | IndicTrans2 | 25.4 | 28.8 | 53.7 |
InLegalTrans | 45.8 | 47.6 | 70.9 | |
EN-to-HI | IndicTrans2 | 41.0 | 42.5 | 59.9 |
InLegalTrans | 56.9 | 57.1 | 73.8 | |
EN-to-MR | IndicTrans2 | 25.2 | 28.7 | 55.4 |
InLegalTrans | 44.4 | 46.0 | 68.9 | |
EN-to-TA | IndicTrans2 | 32.8 | 35.3 | 62.3 |
InLegalTrans | 40.0 | 42.5 | 69.9 | |
EN-to-TE | IndicTrans2 | 10.7 | 14.2 | 37.9 |
InLegalTrans | 31.3 | 31.6 | 58.5 | |
EN-to-ML | IndicTrans2 | 21.9 | 25.8 | 52.9 |
InLegalTrans | 37.4 | 40.3 | 69.7 | |
EN-to-PA | IndicTrans2 | 27.8 | 31.6 | 51.5 |
InLegalTrans | 44.3 | 45.6 | 65.5 | |
EN-to-GU | IndicTrans2 | 27.5 | 31.1 | 55.7 |
InLegalTrans | 42.8 | 45.2 | 68.8 | |
EN-to-OR | IndicTrans2 | 06.6 | 12.6 | 37.1 |
InLegalTrans | 14.2 | 19.9 | 47.5 |
Citation
If you use this InLegalTrans translation model or the MILPaC corpus, please cite the following paper:
@article{mahapatra2024milpacnovelbenchmarkevaluating,
title = {MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages},
author = {Sayan Mahapatra and Debtanu Datta and Shubham Soni and Adrijit Goswami and Saptarshi Ghosh},
year = {2024},
journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
publisher = {Association for Computing Machinery},
}
About Us
We are a group of Natural Language Processing (NLP) researchers from the Indian Institute of Technology (IIT) Kharagpur. Our research interests are primarily ML, DL, and NLP applications for the legal domain, with a special focus on the challenges and oppurtunites for the Indian legal scenario. Our current and past projects include:
- Legal Statute Identification
- Semantic segmentation of legal documents
- Monolingual (e.g., English-to-English) and Cross-lingual (e.g., English-to-Hindi) Summarization of legal documents
- Translation in the Indian legal domain
- Court Judgment Prediction
- Legal Document Matching
Explore our publicly available codes and datasets at: Law and AI, IIT Kharagpur.
- Downloads last month
- 0
Model tree for law-ai/InLegalTrans-En2Indic-1B
Base model
ai4bharat/indictrans2-en-indic-1B