---
license: apache-2.0
datasets:
- castorini/mr-tydi
- AmazonScience/tydi-as2
language:
- bn
base_model:
- colbert-ir/colbertv2.0
tags:
- ColBERT
- bert
---

[ColBERT (v2)](https://huggingface.co/colbert-ir/colbertv2.0) Fine-tuned for Bengali document retrieval task, using [RAGatouille](https://github.com/AnswerDotAI/RAGatouille).


### Datasets used for fine-tuning:
Bengali train subsets of [castorini/mr-tydi](https://huggingface.co/datasets/castorini/mr-tydi) and [AmazonScience/tydi-as2](https://huggingface.co/datasets/AmazonScience/tydi-as2).

### Required packages:

```python
!pip install ragatouille

# Additional package to enable GPU for indexing. Ignore for CPU indexing (slow).
!pip uninstall faiss-cpu -y
!pip install faiss-gpu
```

### Example for a very basic Indexing and Retrieval task:
```python
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("turjo4nis/colbertv2.0-bn")

# define your desired documents as a list of strings.
my_documents = [
    "উইকিপিডিয়া হলো সম্মিলিতভাবে সম্পাদিত, বহুভাষিক, মুক্ত প্রবেশাধিকার, মুক্ত.....",
    "বিষয়বস্তু সংযুক্ত অনলাইন বিশ্বকোষ যা উইকিপিডিয়ান বলে.....",
    "পরিচিত স্বেচ্ছাসেবক সম্প্রদায় কর্তৃক লিখিত এবং রক্ষণাবেক্ষণকৃত। স্বেচ্ছাসেবকেরা.....",
    "মিডিয়াউইকি নামে একটি উইকি -ভিত্তিক সম্পাদনা ব্যবস্থা ব্যবহার করে সম্পাদনা করেন।.....",
]

# OPTIONAL - define document ids as a list of strings
docid_list = ['1', '2', '3', '4', ]

RAG.index(
    index_name="my_index", # local save location -> '.ragatouille/colbert/indexes/my_index' 
    collection=my_documents, 
    document_ids=docid_list, # OPTIONAL 
    split_documents=False, # if set True, then documents will be chunked to the token amount set in max_document_length 
    # max_document_length=512, # un-comment if split_documents is set True 
    use_faiss=True,
)

query = "উইকিপিডিয়া কি?"
RAG.search(query)
```

### Load a saved index:
```python
from ragatouille import RAGPretrainedModel

path_to_index = ".ragatouille/colbert/indexes/my_index"
RAG = RAGPretrainedModel.from_index(path_to_index)

query = "উইকিপিডিয়া কি?"
results = RAG.search(query, k=2) # k = number of top-ranked documents to be retrieved

results
```
### Output:
![image](https://github.com/user-attachments/assets/21fbf1bd-123b-46a6-bd13-0168846d7c32)