|
--- |
|
library_name: transformers |
|
tags: |
|
- Persian |
|
- Named Entity Recognition |
|
- NER |
|
- Albert |
|
--- |
|
|
|
# Model Card for Behpoyan-NER |
|
|
|
Behpoyan-NER is a fine-tuned Albert model for Named Entity Recognition (NER) in the Persian language. It is based on the `HooshvareLab/albert-fa-zwnj-base-v2-ner` model and identifies ten types of entities: Date (DAT), Event (EVE), Facility (FAC), Location (LOC), Money (MON), Organization (ORG), Percent (PCT), Person (PER), Product (PRO), and Time (TIM). |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
Behpoyan-NER is designed to recognize named entities in Persian text, improving upon the capabilities of its base model, `HooshvareLab/albert-fa-zwnj-base-v2-ner`. It was fine-tuned on a dataset combining ARMAN, PEYMA, and WikiANN datasets, which are widely used for NER in the Persian language. |
|
|
|
- **Developed by:** Behpoyan |
|
- **Model type:** Albert for Token Classification |
|
- **Language(s) (NLP):** Persian (fa) |
|
- **License:** MIT |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [Behpoyan/Behpoyan-NER](https://huggingface.co/Behpoyan/Behpoyan-NER) |
|
- **Base Model Repository:** [HooshvareLab/albert-fa-zwnj-base-v2-ner](https://huggingface.co/HooshvareLab/albert-fa-zwnj-base-v2-ner) |
|
|
|
|
|
### Direct Use |
|
|
|
This model can be directly used for Named Entity Recognition tasks in Persian text. Example applications include text analysis, information extraction, and Persian-language NLP applications. |
|
|
|
### Downstream Use |
|
|
|
The model can be fine-tuned further for domain-specific NER tasks or combined with other models for complex NLP pipelines. |
|
|
|
### Out-of-Scope Use |
|
|
|
The model is not designed for languages other than Persian or tasks outside token classification. Misuse for generating biased or harmful content is discouraged. |
|
|
|
### Recommendations |
|
|
|
While the model performs well for general-purpose NER in Persian, users should validate its performance on their specific datasets. Be cautious of biases in the training data, especially in identifying less-represented entities. |
|
|
|
## How to Get Started with the Model |
|
|
|
Here’s how you can use the model: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Behpouyan/Behpouyan-NER") |
|
model = AutoModelForTokenClassification.from_pretrained("Behpouyan/Behpouyan-NER") |
|
|
|
nlp = pipeline("ner", model=model, tokenizer=tokenizer) |
|
|
|
# Input example |
|
example = ''' |
|
"در سال ۱۴۰۱، شرکت علیبابا اعلام کرد که با همکاری بانک ملت، یک پروژه بزرگ برای توسعه زیرساختهای تجارت الکترونیک در ایران آغاز خواهد کرد. |
|
این پروژه در تهران و اصفهان اجرا میشود و پیشبینی میشود تا پایان سال ۱۴۰۲ تکمیل شود." |
|
''' |
|
# Get NER results |
|
ner_results = nlp(example) |
|
|
|
# Function to merge subword entities |
|
def merge_entities(entities): |
|
merged_results = [] |
|
current_entity = None |
|
|
|
for entity in entities: |
|
if entity['entity'].startswith("B-") or current_entity is None: |
|
# Start a new entity |
|
if current_entity: |
|
merged_results.append(current_entity) |
|
current_entity = { |
|
"word": entity['word'].strip(), |
|
"entity": entity['entity'][2:], # Remove "B-" prefix |
|
"score": entity['score'], |
|
"start": entity['start'], |
|
"end": entity['end'], |
|
} |
|
elif entity['entity'].startswith("I-") and current_entity: |
|
# Continue the current entity |
|
current_entity['word'] += entity['word'].strip() |
|
current_entity['score'] = min(current_entity['score'], entity['score']) # Use the lowest score |
|
current_entity['end'] = entity['end'] |
|
|
|
# Add the last entity if any |
|
if current_entity: |
|
merged_results.append(current_entity) |
|
|
|
return merged_results |
|
|
|
# Merge the entities |
|
merged_results = merge_entities(ner_results) |
|
|
|
# Display the merged results |
|
print("Named Entity Recognition Results:") |
|
for entity in merged_results: |
|
print(f"- Entity: {entity['word']}") |
|
print(f" Type: {entity['entity']}") |
|
print(f" Score: {entity['score']:.2f}") |
|
print(f" Start: {entity['start']}, End: {entity['end']}") |
|
print("-" * 40) |
|
|
|
|
|
|