|
--- |
|
license: mit |
|
language: |
|
- ko |
|
base_model: |
|
- klue/bert-base |
|
pipeline_tag: feature-extraction |
|
tags: |
|
- medical |
|
--- |
|
|
|
# ๐ Korean Medical DPR(Dense Passage Retrieval) |
|
|
|
## 1. Intro |
|
**์๋ฃ ๋ถ์ผ**์์ ์ฌ์ฉํ ์ ์๋ Bi-Encoder ๊ตฌ์กฐ์ ๊ฒ์ ๋ชจ๋ธ์
๋๋ค. |
|
ํยท์ ํผ์ฉ์ฒด์ ์๋ฃ ๊ธฐ๋ก์ ์ฒ๋ฆฌํ๊ธฐ ์ํด **SapBERT-KO-EN** ์ ๋ฒ ์ด์ค ๋ชจ๋ธ๋ก ์ด์ฉํ์ต๋๋ค. |
|
์ง๋ฌธ์ Question Encoder๋ก, ํ
์คํธ๋ Context Encoder๋ฅผ ์ด์ฉํด ์ธ์ฝ๋ฉํฉ๋๋ค. |
|
|
|
- Question Encoder : [https://huggingface.co/snumin44/medical-biencoder-ko-bert-question](https://huggingface.co/snumin44/medical-biencoder-ko-bert-question) |
|
|
|
(โป ์ด ๋ชจ๋ธ์ AI Hub์ [์ด๊ฑฐ๋ AI ํฌ์ค์ผ์ด ์ง์ ์๋ต ๋ฐ์ดํฐ](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71762)๋ก ํ์ตํ ๋ชจ๋ธ์
๋๋ค.) |
|
|
|
|
|
## 2. Model |
|
|
|
**(1) Self Alignment Pretraining (SAP)** |
|
|
|
ํ๊ตญ ์๋ฃ ๊ธฐ๋ก์ **ํยท์ ํผ์ฉ์ฒด**๋ก ์ฐ์ฌ, ์์ด ์ฉ์ด๋ ์ธ์ํ ์ ์๋ ๋ชจ๋ธ์ด ํ์ํฉ๋๋ค. |
|
Multi Similarity Loss๋ฅผ ์ด์ฉํด **๋์ผํ ์ฝ๋์ ์ฉ์ด** ๊ฐ์ ๋์ ์ ์ฌ๋๋ฅผ ๊ฐ๋๋ก ํ์ตํ์ต๋๋ค. |
|
``` |
|
์) C3843080 || ๊ณ ํ์ ์งํ |
|
C3843080 || Hypertension |
|
C3843080 || High Blood Pressure |
|
C3843080 || HTN |
|
C3843080 || HBP |
|
``` |
|
|
|
|
|
- SapBERT-KO-EN : [https://huggingface.co/snumin44/sap-bert-ko-en](https://huggingface.co/snumin44/sap-bert-ko-en) |
|
- Github : [https://github.com/snumin44/SapBERT-KO-EN](https://github.com/snumin44/SapBERT-KO-EN) |
|
|
|
**(2) Dense Passage Retrieval (DPR)** |
|
|
|
SapBERT-KO-EN์ ๊ฒ์ ๋ชจ๋ธ๋ก ๋ง๋ค๊ธฐ ์ํด ์ถ๊ฐ์ ์ธ Fine-tuning์ ํด์ผ ํฉ๋๋ค. |
|
Bi-Encoder ๊ตฌ์กฐ๋ก ์ง์์ ํ
์คํธ์ ์ ์ฌ๋๋ฅผ ๊ณ์ฐํ๋ DPR ๋ฐฉ์์ผ๋ก Fine-tuning ํ์ต๋๋ค. |
|
๋ค์๊ณผ ๊ฐ์ด ๊ธฐ์กด์ ๋ฐ์ดํฐ ์
์ **ํยท์ ํผ์ฉ์ฒด ์ํ์ ์ฆ๊ฐ**ํ ๋ฐ์ดํฐ ์
์ ์ฌ์ฉํ์ต๋๋ค. |
|
``` |
|
์) ํ๊ตญ์ด ๋ณ๋ช
: ๊ณ ํ์ |
|
์์ด ๋ณ๋ช
: Hypertenstion |
|
์ง์ (์๋ณธ): ์๋ฒ์ง๊ฐ ๊ณ ํ์์ธ๋ฐ ๊ทธ๊ฒ ๋ญ์ง ๋ชจ๋ฅด๊ฒ ์ด. ๊ณ ํ์์ด ๋ญ์ง ์ค๋ช
์ข ํด์ค. |
|
์ง์ (์ฆ๊ฐ): ์๋ฒ์ง๊ฐ Hypertenstion ์ธ๋ฐ ๊ทธ๊ฒ ๋ญ์ง ๋ชจ๋ฅด๊ฒ ์ด. Hypertenstion ์ด ๋ญ์ง ์ค๋ช
์ข ํด์ค. |
|
``` |
|
|
|
- Github : [https://github.com/snumin44/DPR-KO](https://github.com/snumin44/DPR-KO) |
|
|
|
|
|
## 3. Training |
|
|
|
**(1) Self Alignment Pretraining (SAP)** |
|
|
|
SapBERT-KO-EN ํ์ต์ ํ์ฉํ ๋ฒ ์ด์ค ๋ชจ๋ธ ๋ฐ ํ์ดํผ ํ๋ผ๋ฏธํฐ๋ ๋ค์๊ณผ ๊ฐ์ต๋๋ค. |
|
ํยท์ ์๋ฃ ์ฉ์ด๋ฅผ ์๋กํ ์๋ฃ ์ฉ์ด ์ฌ์ ์ธ **KOSTOM**์ ํ์ต ๋ฐ์ดํฐ๋ก ์ฌ์ฉํ์ต๋๋ค. |
|
|
|
- Model : klue/bert-base |
|
- Dataset : **KOSTOM** |
|
- Epochs : 1 |
|
- Batch Size : 64 |
|
- Max Length : 64 |
|
- Dropout : 0.1 |
|
- Pooler : 'cls' |
|
- Eval Step : 100 |
|
- Threshold : 0.8 |
|
- Scale Positive Sample : 1 |
|
- Scale Negative Sample : 60 |
|
|
|
**(2) Dense Passage Retrieval (DPR)** |
|
|
|
Fine-tuning์ ํ์ฉํ ๋ฒ ์ด์ค ๋ชจ๋ธ ๋ฐ ํ์ดํผ ํ๋ผ๋ฏธํฐ๋ ๋ค์๊ณผ ๊ฐ์ต๋๋ค. |
|
|
|
- Model : SapBERT-KO-EN(klue/bert-base) |
|
- Dataset : **์ด๊ฑฐ๋ AI ํฌ์ค์ผ์ด ์ง์ ์๋ต ๋ฐ์ดํฐ(AI Hub)** |
|
- Epochs : 10 |
|
- Batch Size : 64 |
|
- Dropout : 0.1 |
|
- Pooler : 'cls' |
|
|
|
|
|
## 4. Example |
|
์ด ๋ชจ๋ธ์ ์ง๋ฌธ์ ์ธ์ฝ๋ฉํ๋ ๋ชจ๋ธ๋ก, Context ๋ชจ๋ธ๊ณผ ํจ๊ป ์ฌ์ฉํด์ผ ํฉ๋๋ค. |
|
๋์ผํ ์ง๋ณ์ ๊ดํ ์ง๋ฌธ๊ณผ ํ
์คํธ๊ฐ ๋์ ์ ์ฌ๋๋ฅผ ๋ณด์ธ๋ค๋ ์ฌ์ค์ ํ์ธํ ์ ์์ต๋๋ค. |
|
|
|
(โป ์๋ ์ฝ๋์ ์์๋ ChatGPT๋ฅผ ์ด์ฉํด ์์ฑํ ์๋ฃ ํ
์คํธ์
๋๋ค.) |
|
(โป ํ์ต ๋ฐ์ดํฐ์ ํน์ฑ ์ ์์ ๋ณด๋ค ์ ์ ๋ ํ
์คํธ์ ๋ํด ๋ ์ ์๋ํฉ๋๋ค.) |
|
|
|
```python |
|
import numpy as np |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
# Question Model |
|
q_model_path = 'snumin44/medical-biencoder-ko-bert-question' |
|
q_model = AutoModel.from_pretrained(q_model_path) |
|
q_tokenizer = AutoTokenizer.from_pretrained(q_model_path) |
|
|
|
# Context Model |
|
c_model_path = 'snumin44/medical-biencoder-ko-bert-context' |
|
c_model = AutoModel.from_pretrained(c_model_path) |
|
c_tokenizer = AutoTokenizer.from_pretrained(c_model_path) |
|
|
|
|
|
query = 'high blood pressure ์ฒ๋ฐฉ ์ฌ๋ก' |
|
|
|
targets = [ |
|
"""๊ณ ํ์ ์ง๋จ. |
|
ํ์ ์๋ด ๋ฐ ์ํ์ต๊ด ๊ต์ ๊ถ๊ณ . ์ ์ผ์, ๊ท์น์ ์ธ ์ด๋, ๊ธ์ฐ, ๊ธ์ฃผ ์ง์. |
|
ํ์ ์ฌ๋ฐฉ๋ฌธ. ํ์: 150/95mmHg. ์ฝ๋ฌผ์น๋ฃ ์์. Amlodipine 5mg 1์ผ 1ํ ์ฒ๋ฐฉ.""", |
|
|
|
"""์๊ธ์ค ๋์ฐฉ ํ ์ ๋ด์๊ฒฝ ์งํ. |
|
์๊ฒฌ: Gastric ulcer์์ Forrest IIb ๊ด์ฐฐ๋จ. ์ถํ์ ์๋์ ์ผ์ถ์ฑ ์ถํ ํํ. |
|
์ฒ์น: ์ํผ๋คํ๋ฆฐ ์ฃผ์ฌ๋ก ์ถํ ๊ฐ์ ํ์ธ. Hemoclip 2๊ฐ๋ก ์ถํ ๋ถ์ ํด๋ฆฌํํ์ฌ ์งํ ์๋ฃ.""", |
|
|
|
"""ํ์ค ๋์ ์ง๋ฐฉ ์์น ๋ฐ ์ง๋ฐฉ๊ฐ ์๊ฒฌ. |
|
๋ค๋ฐ์ฑ gallstones ํ์ธ. ์ฆ์ ์์ ๊ฒฝ์ฐ ๊ฒฝ๊ณผ ๊ด์ฐฐ ๊ถ์ฅ. |
|
์ฐ์ธก renal cyst, ์์ฑ ๊ฐ๋ฅ์ฑ ๋์ผ๋ฉฐ ์ถ๊ฐ์ ์ธ ์ฒ์น ๋ถํ์ ํจ.""" |
|
] |
|
|
|
query_feature = q_tokenizer(query, return_tensors='pt') |
|
query_outputs = q_model(**query_feature, return_dict=True) |
|
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze() |
|
|
|
def cos_sim(A, B): |
|
return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) |
|
|
|
for idx, target in enumerate(targets): |
|
target_feature = c_tokenizer(target, return_tensors='pt') |
|
target_outputs = c_model(**target_feature, return_dict=True) |
|
target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze() |
|
similarity = cos_sim(query_embeddings, target_embeddings) |
|
print(f"Similarity between query and target {idx}: {similarity:.4f}") |
|
``` |
|
``` |
|
Similarity between query and target 0: 0.2674 |
|
Similarity between query and target 1: 0.0416 |
|
Similarity between query and target 2: 0.0476 |
|
``` |
|
|
|
|
|
## Citing |
|
``` |
|
@inproceedings{liu2021self, |
|
title={Self-Alignment Pretraining for Biomedical Entity Representations}, |
|
author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel}, |
|
booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, |
|
pages={4228--4238}, |
|
month = jun, |
|
year={2021} |
|
} |
|
@article{karpukhin2020dense, |
|
title={Dense Passage Retrieval for Open-Domain Question Answering}, |
|
author={Vladimir Karpukhin, Barlas Oฤuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih}, |
|
journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, |
|
year={2020} |
|
} |
|
``` |