InLegalTrans

This is the model card of InLegalTrans-En2Indic-1B translation model, a fine-tuned version of the IndicTrans2 model specifically tailored for translating Indian legal texts from English to Indian languages.

Training Data

We use the MILPaC (Multilingual Indian Legal Parallel Corpus) corpus for fine-tuning. It is the first high-quality Indian legal parallel corpus, containing parallel aligned text units in English (EN) and nine Indian (IN) languages -- Bengali (BN), Hindi (HI), Marathi (MR), Tamil (TA), Telugu (TE), Malayalam (ML), Panjabi (PA), Gujarati (GU), and Oriya (OR). Please refer to the paper for more details about this corpus.

For fine-tuning, we randomly split MILPaC language-wise in a 80 (train) - 10 (validation) - 10 (test) ratio. We use the 80% train split (combined 80% of each English-to-Indic language pair) to fine-tune the IndicTrans2 model and 10% validation split (combined 10% of each English-to-Indic language pair) to select the best checkpoint and to prevent overfitting.

Model Overview and Usage Instructions

This InLegalTrans model uses the same tokenizer as the IndicTrans2 model and has the same architecture with ~1.12B parameters.

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit import IndicProcessor # Install IndicTransToolkit from https://github.com/VarunGumma/IndicTransToolkit

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
src_lang, tgt_lang = "eng_Latn", "ben_Beng" # Use the BCP-47 language codes used by the FLORES-200 dataset
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B", trust_remote_code=True) # Use IndicTrans2 tokenizer to enable their custom tokenization script to be run
model = AutoModelForSeq2SeqLM.from_pretrained(
    "law-ai/InLegalTrans-En2Indic-1B",
    trust_remote_code=True,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
).to(device)
ip = IndicProcessor(inference=True)

input_sentences = [
    "(7) Any such allowance for the maintenance and expenses for proceeding shall be payable from the date of the order, or, if so ordered, from the date of the application for maintenance or expenses of proceeding, as the case may be.",
    "(2) Where it appears to the Tribunal that, in consequence of any decision of a competent Civil Court, any order made under section 9 should be cancelled or varied, it shall cancel the order or, as the case may be, vary the same accordingly.",
]

batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)

input_text_encoding = tokenizer(
    batch,
    max_length=256,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(device)

generated_tokens = model.generate(
    **input_text_encoding,
    max_length=256,
    do_sample=True,
    num_beams=4,
    num_return_sequences=1,
    early_stopping=False,
    use_cache=True,
)

with tokenizer.as_target_tokenizer():
    generated_tokens = tokenizer.batch_decode(
        generated_tokens.detach().cpu().tolist(),
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )

translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"Sentence in {src_lang} language: {input_sentence}") 
    print(f"Translated Sentence in {tgt_lang} language: {translation}")

Fine-tuning Results

The following table contains the performance results of the InLegalTrans model compared to the IndicTrans2 model over the 10% test split of MILPaC. Performances are evaluated using BLEU, GLEU, and chrF++ metrics. For all English-to-Indic language pairs, InLegalTrans demonstrated a significant improvement over IndicTrans2, achieving consistently better performance across all evaluation metrics.

EN-to-IN	Model	BLEU	GLEU	chrF++
EN-to-BN	IndicTrans2	25.4	28.8	53.7
	InLegalTrans	45.8	47.6	70.9
EN-to-HI	IndicTrans2	41.0	42.5	59.9
	InLegalTrans	56.9	57.1	73.8
EN-to-MR	IndicTrans2	25.2	28.7	55.4
	InLegalTrans	44.4	46.0	68.9
EN-to-TA	IndicTrans2	32.8	35.3	62.3
	InLegalTrans	40.0	42.5	69.9
EN-to-TE	IndicTrans2	10.7	14.2	37.9
	InLegalTrans	31.3	31.6	58.5
EN-to-ML	IndicTrans2	21.9	25.8	52.9
	InLegalTrans	37.4	40.3	69.7
EN-to-PA	IndicTrans2	27.8	31.6	51.5
	InLegalTrans	44.3	45.6	65.5
EN-to-GU	IndicTrans2	27.5	31.1	55.7
	InLegalTrans	42.8	45.2	68.8
EN-to-OR	IndicTrans2	06.6	12.6	37.1
	InLegalTrans	14.2	19.9	47.5

Citation

If you use this InLegalTrans translation model or the MILPaC corpus, please cite the following paper:

@article{mahapatra2024milpacnovelbenchmarkevaluating,
      title = {MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages}, 
      author = {Sayan Mahapatra and Debtanu Datta and Shubham Soni and Adrijit Goswami and Saptarshi Ghosh},
      year = {2024},
      journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
      publisher = {Association for Computing Machinery},
}

About Us

We are a group of Natural Language Processing (NLP) researchers from the Indian Institute of Technology (IIT) Kharagpur. Our research interests are primarily ML, DL, and NLP applications for the legal domain, with a special focus on the challenges and oppurtunites for the Indian legal scenario. Our current and past projects include:

Legal Statute Identification
Semantic segmentation of legal documents
Monolingual (e.g., English-to-English) and Cross-lingual (e.g., English-to-Hindi) Summarization of legal documents
Translation in the Indian legal domain
Court Judgment Prediction
Legal Document Matching

Explore our publicly available codes and datasets at: Law and AI, IIT Kharagpur.