InLegalTrans

This is the model card of InLegalTrans-En2Indic-1B translation model, a fine-tuned version of the IndicTrans2 model specifically tailored for translating Indian legal texts from English to Indian languages.

Training Data

We use the MILPaC (Multilingual Indian Legal Parallel Corpus) corpus for fine-tuning. It is the first high-quality Indian legal parallel corpus, containing parallel aligned text units in English (EN) and nine Indian (IN) languages -- Bengali (BN), Hindi (HI), Marathi (MR), Tamil (TA), Telugu (TE), Malayalam (ML), Panjabi (PA), Gujarati (GU), and Oriya (OR). Please refer to the paper for more details about this corpus.

For fine-tuning, we randomly split MILPaC language-wise in a 80 (train) - 10 (validation) - 10 (test) ratio. We use the 80% train split (combined 80% of each English-to-Indic language pair) to fine-tune the IndicTrans2 model and 10% validation split (combined 10% of each English-to-Indic language pair) to select the best checkpoint and to prevent overfitting.

Model Overview and Usage Instructions

This InLegalTrans model uses the same tokenizer as the IndicTrans2 model and has the same architecture with ~1.12B parameters.

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit import IndicProcessor # Install IndicTransToolkit from https://github.com/VarunGumma/IndicTransToolkit

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
src_lang, tgt_lang = "eng_Latn", "ben_Beng" # Use the BCP-47 language codes used by the FLORES-200 dataset
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B", trust_remote_code=True) # Use IndicTrans2 tokenizer to enable their custom tokenization script to be run
model = AutoModelForSeq2SeqLM.from_pretrained(
    "law-ai/InLegalTrans-En2Indic-1B",
    trust_remote_code=True,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
).to(device)
ip = IndicProcessor(inference=True)

input_sentences = [
    "(7) Any such allowance for the maintenance and expenses for proceeding shall be payable from the date of the order, or, if so ordered, from the date of the application for maintenance or expenses of proceeding, as the case may be.",
    "(2) Where it appears to the Tribunal that, in consequence of any decision of a competent Civil Court, any order made under section 9 should be cancelled or varied, it shall cancel the order or, as the case may be, vary the same accordingly.",
]

batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)

input_text_encoding = tokenizer(
    batch,
    max_length=256,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(device)

generated_tokens = model.generate(
    **input_text_encoding,
    max_length=256,
    do_sample=True,
    num_beams=4,
    num_return_sequences=1,
    early_stopping=False,
    use_cache=True,
)

with tokenizer.as_target_tokenizer():
    generated_tokens = tokenizer.batch_decode(
        generated_tokens.detach().cpu().tolist(),
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )

translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"Sentence in {src_lang} language: {input_sentence}") 
    print(f"Translated Sentence in {tgt_lang} language: {translation}") 

Fine-tuning Results

The following table contains the performance results of the InLegalTrans model compared to the IndicTrans2 model over the 10% test split of MILPaC. Performances are evaluated using BLEU, GLEU, and chrF++ metrics. For all English-to-Indic language pairs, InLegalTrans demonstrated a significant improvement over IndicTrans2, achieving consistently better performance across all evaluation metrics.

EN-to-IN Model BLEU GLEU chrF++
EN-to-BN IndicTrans2 25.4 28.8 53.7
InLegalTrans 45.8 47.6 70.9
EN-to-HI IndicTrans2 41.0 42.5 59.9
InLegalTrans 56.9 57.1 73.8
EN-to-MR IndicTrans2 25.2 28.7 55.4
InLegalTrans 44.4 46.0 68.9
EN-to-TA IndicTrans2 32.8 35.3 62.3
InLegalTrans 40.0 42.5 69.9
EN-to-TE IndicTrans2 10.7 14.2 37.9
InLegalTrans 31.3 31.6 58.5
EN-to-ML IndicTrans2 21.9 25.8 52.9
InLegalTrans 37.4 40.3 69.7
EN-to-PA IndicTrans2 27.8 31.6 51.5
InLegalTrans 44.3 45.6 65.5
EN-to-GU IndicTrans2 27.5 31.1 55.7
InLegalTrans 42.8 45.2 68.8
EN-to-OR IndicTrans2 06.6 12.6 37.1
InLegalTrans 14.2 19.9 47.5

Citation

If you use this InLegalTrans translation model or the MILPaC corpus, please cite the following paper:

@article{mahapatra2024milpacnovelbenchmarkevaluating,
      title = {MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages}, 
      author = {Sayan Mahapatra and Debtanu Datta and Shubham Soni and Adrijit Goswami and Saptarshi Ghosh},
      year = {2024},
      journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
      publisher = {Association for Computing Machinery},
}

About Us

We are a group of Natural Language Processing (NLP) researchers from the Indian Institute of Technology (IIT) Kharagpur. Our research interests are primarily ML, DL, and NLP applications for the legal domain, with a special focus on the challenges and oppurtunites for the Indian legal scenario. Our current and past projects include:

  • Legal Statute Identification
  • Semantic segmentation of legal documents
  • Monolingual (e.g., English-to-English) and Cross-lingual (e.g., English-to-Hindi) Summarization of legal documents
  • Translation in the Indian legal domain
  • Court Judgment Prediction
  • Legal Document Matching

Explore our publicly available codes and datasets at: Law and AI, IIT Kharagpur.

Downloads last month
0
Safetensors
Model size
1.12B params
Tensor type
F32
·
Inference Examples
Inference API (serverless) has been turned off for this model.

Model tree for law-ai/InLegalTrans-En2Indic-1B

Finetuned
(1)
this model