Update README.md

da6bc35 verified 4 months ago

5.83 kB

	---
	license: apache-2.0
	base_model:
	- mistralai/Mistral-Nemo-Base-2407
	language:
	- en
	- ko
	- ja
	- zh
	datasets:
	- 4DR1455/finance_questions
	- Aratako/Synthetic-JP-Conversations-Magpie-Nemotron-4-10k
	- Aratako/Synthetic-JP-EN-Coding-Dataset-Magpie-69k
	- Aratako/Synthetic-Japanese-Roleplay-NSFW-Claude-3.5s-10.5k-formatted
	- BCCard/BCCard-Finance-Kor-QnA
	- CarrotAI/ko-code-alpaca-QA
	- ChuGyouk/AI_healthcare_QA_samples_Sonnet3.5
	- DavidLanz/medical_instruction
	- Dusker/lawyer-llama
	- Gryphe/Sonnet3.5-Charcard-Roleplay
	- HAERAE-HUB/qarv-instruct-ko
	- HachiML/alpaca_jp_math
	- Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-v0.1
	- Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese
	- beomi/KoAlpaca-v1.1a
	- codefuse-ai/Evol-instruction-66k
	- frankminors123/belle-math-zh
	- gbharti/wealth-alpaca_lora
	- iam-ajaymeena/Self-Instruct-Japanese-Elzya-13B
	- jihye-moon/LawQA-Ko
	- jondurbin/gutenberg-dpo-v0.1
	- junyeong-nero/kin_med_100K_edited
	- kyujinpy/KOR-OpenOrca-Platypus-v3
	- lavita/medical-qa-datasets
	- microsoft/orca-math-word-problems-200k
	- neural-bridge/rag-dataset-12000
	- p1atdev/ichikara-instruction
	- qiaojin/PubMedQA
	- shibing624/roleplay-zh-sharegpt-gpt4-data
	- team-hatakeyama-phase2/AutoMultiTurnByCalm3-22B-Corrected-reformatted
	- ymoslem/Law-StackExchange
	- zzunyang/LawQA_LawSee
	---
	# Mistral-Nemo-NT-Ko-12B-sft

	## Description

	Mistral-Nemo-NT-Ko-12B-sft is an instruction-tuned version of [mistralai/Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407), fine-tuned across four languages: English, Korean, Chinese, and Japanese.

	The primary goals of this model are language alignment, cross-lingual knowledge transfer and ChatML formatting. This is an intermediate version since preference optimization has not yet been applied.


	## Features

	- The base model supports a context length of 128K, while I fine-tuned this model with an 8K context size.

	- The model follows to the input language unless the user explicitly specifies an output language (If the language is set by a system role, it may be ignored).

	- Answer length tends to vary by language: English responses are generally longer than average, while Korean responses tend to be shorter. The behavior for Japanese and Chinese is still under observation.

	- Recommended temperature settings: 0.3 to 0.7.


	# Evaluation

	## LogicKor

	\| 모델 \| 방법 \| 추론 \| 수학 \| 글쓰기 \| 코딩 \| 이해 \| 문법 \| 싱글턴 \| 멀티턴 \| 총점 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\|Mistral-Nemo-NT-Ko-12B-sft\| cot-1-shot \|7.36 \| 6.57 \| 8.71 \| 8.57 \| 9.57 \| 6.43 \| 7.81 \| 7.93 \| 7.87 \|
	\|Mistral-Nemo-NT-Ko-12B-sft\| 1-shot \| 9.00 \| 5.71 \| 7.93 \| 8.29 \| 7.93 \| 5.21 \| 7.29 \| 7.40 \| 7.35 \|
	\| Mistral Nemo \| 1-shot \| 5.00, \| 6.50 \| 6.86 \| 8.07 \| 7.64 \| 8.43 \| 7.60 \| 6.57 \|7.08\|
	\| Mistral Nemo \| cot-1-shot \| 5.43, \| 6.86 \| 6.07 \| 7.57 \| 5.86 \| 7.57 \| 7.50 \| 5.62 \|6.56\|
	\|Mistral-Nemo-NT-Ko-12B-sft\| default \| 6.00 \| 4.93 \| 5.43 \| 7.14 \| 9.71 \| 4.00 \| 6.45 \| 5.95 \| 6.20 \|
	\| Mistral Nemo \| default \| 0.43, \| 7.64 \| 6.21 \| 7.14 \| 6.79 \| 7.21 \| 6.26 \| 5.55 \|5.90\|

	## MT-Bench

	\| Model \| First \| Second \| Average \|
	\| --- \| --- \| --- \| --- \|
	\|Mistral-Nemo-NT-Ko-12B-sft\| 8.39 \| 7.99 \| 8.19 \|
	\* ```judge-model: GPT-4```

	## Language-Confusion(Korean Only)

	\| Model \| Monolingual-LPR \| Monolingual-WPR \| Crosslingual-LPR \| Crosslingual-WPR \|
	\| --- \| --- \| --- \| --- \| --- \|
	\|Mistral-Nemo-NT-Ko-12B-sft\| 100.00% \| 99.00% \| 87.51% \| 96.96% \|
	\|Mistral-Nemo-Instruct-2407 \| 90.72% \| 93.18% \| 46.75% \| 92.84% \|
	\|Meta-Llama-3.1-8B-Instruct \| 99.00% \| 96.97% \| 91.45% \| 93.01% \|
	\|gemma-2-9b-it \| 100.00% \| 98.00% \| 87.93% \| 95.58% \|


	example:

	```
	<\|im_start\|>system
	You are a helpful AI assistant.<\|im_end\|>
	<\|im_start\|>user
	{prompt}<\|im_end\|>
	<\|im_start\|>assistant
	```

	I trained Mistral-Nemo-NT-Ko-12B with various system prompt from dozens of dataset. You can chat with/without your system prompt.


	# Dataset

	[werty1248/multilingual-instruct-balanced](https://huggingface.co/datasets/werty1248/multilingual-instruct-balanced)

	# Training Details

	- GPU: 8xA40
	- epoch: 3
	- total batch size: 8
	- learning rate: 7e-6
	- weight decay: 0.01

	[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
	<details><summary>See axolotl config</summary>

	axolotl version: `0.4.1`
	```yaml
	base_model: mistralai/Mistral-Nemo-Base-2407
	model_type: MistralForCausalLM
	tokenizer_config: nothingiisreal/MN-12B-Celeste-V1.9 ##axolotl-ai-co/Mistral-Nemo-Base-2407-chatml makes error, why?
	tokenizer_type: AutoTokenizer

	load_in_8bit: false
	load_in_4bit: false
	strict: false

	chat_template: chatml
	datasets:
	- path: werty1248/multilingual-instruct-balanced
	type: sharegpt
	chat_template: chatml

	dataset_prepared_path: ./data_preparation
	output_dir: /workspace/data

	hf_use_auth_token: true

	sequence_len: 8192
	sample_packing: true
	pad_to_sequence_len: true

	wandb_project:
	#wandb_entity:
	#wandb_watch:
	wandb_name:
	#wandb_log_model:

	gradient_accumulation_steps: 1 ## total_batch = 8
	micro_batch_size: 1
	num_epochs: 3
	optimizer: paged_adamw_32bit
	lr_scheduler: cosine
	learning_rate: 0.000007

	train_on_inputs: false
	group_by_length: false
	bf16: auto
	fp16:
	tf32: false

	gradient_checkpointing: true
	early_stopping_patience:
	resume_from_checkpoint:
	local_rank:
	logging_steps: 1
	xformers_attention:
	flash_attention: true

	warmup_steps: 1000
	evals_per_epoch: 1
	eval_table_size:
	save_steps: 1000
	debug:
	deepspeed: deepspeed_configs/zero3_bf16.json
	weight_decay: 0.01
	special_tokens:
	pad_token: <pad>
	```

	</details><br>


	- Training loss

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6629154d55d7c289634b8c5d/Xcat10ejYX1nU4cH94vJF.png)