File size: 3,161 Bytes
317620b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
language:
- en
tags:
- colbertx
- plaidx
- xlm-roberta-large
datasets:
- ms_marco
- hltcoe/tdist-msmarco-scores
task_categories:
- text-retrieval
- information-retrieval
task_ids:
- passage-retrieval
license: mit
---
# ColBERT-X for English MonoLingual Retrieval using Translate-Distill
## CLIR Model Setting
- Query language: English
- Query length: 32 token max
- Document language: English
- Document length: 180 token max (please use MaxP to aggregate the passage score if needed)
## Model Description
Translate-Distill is a training technique that produces state-of-the-art CLIR dense retrieval model through translation and distillation.
`plaidx-large-eng-tdist-mt5xxl-engeng` is trained with KL-Divergence from the mt5xxl MonoT5 reranker inferenced on
English MS MARCO training queries and English passages.
Despite using a multilingual language model as backcone, this model is trianed only on English text, which is
designed for English monolingual retrieval.
However, it has the ability to zero-shot to any language setup.
### Teacher Models:
- `t53b`: [`castorini/monot5-3b-msmarco-10k`](https://huggingface.co/castorini/monot5-3b-msmarco-10k)
- `mt5xxl`: [`unicamp-dl/mt5-13b-mmarco-100k`](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k)
### Training Parameters
- learning rate: 5e-6
- update steps: 200,000
- nway (number of passages per query): 6 (randomly selected from 50)
- per device batch size (number of query-passage set): 8
- training GPU: 8 NVIDIA V100 with 32 GB memory
## Usage
To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X.
```bash
pip install PLAID-X==0.3.1
```
Following code snippet loads the model through Huggingface API.
```python
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig
Checkpoint('hltcoe/plaidx-large-eng-tdist-mt5xxl-engeng', colbert_config=ColBERTConfig())
```
For full tutorial, please refer to the [PLAID-X Jupyter Notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb),
which is part of the [SIGIR 2023 CLIR Tutorial](https://github.com/hltcoe/clir-tutorial).
## BibTeX entry and Citation Info
Please cite the following two papers if you use the model.
```bibtex
@inproceedings{colbert-x,
author = {Suraj Nair and Eugene Yang and Dawn Lawrie and Kevin Duh and Paul McNamee and Kenton Murray and James Mayfield and Douglas W. Oard},
title = {Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models},
booktitle = {Proceedings of the 44th European Conference on Information Retrieval (ECIR)},
year = {2022},
url = {https://arxiv.org/abs/2201.08471}
}
```
```bibtex
@inproceedings{translate-distill,
author = {Eugene Yang and Dawn Lawrie and James Mayfield and Douglas W. Oard and Scott Miller},
title = {Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation},
booktitle = {Proceedings of the 46th European Conference on Information Retrieval (ECIR)},
year = {2024},
url = {https://arxiv.org/abs/2401.04810}
}
```
|