--- language: - en license: mit datasets: - anon8231489123/ShareGPT_Vicuna_unfiltered model-index: - name: yi6B_Vicuna results: - task: type: text-generation name: Text Generation dataset: name: AI2 Reasoning Challenge (25-Shot) type: ai2_arc config: ARC-Challenge split: test args: num_few_shot: 25 metrics: - type: acc_norm value: 46.16 name: normalized accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: HellaSwag (10-Shot) type: hellaswag split: validation args: num_few_shot: 10 metrics: - type: acc_norm value: 69.3 name: normalized accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MMLU (5-Shot) type: cais/mmlu config: all split: test args: num_few_shot: 5 metrics: - type: acc value: 58.43 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: TruthfulQA (0-shot) type: truthful_qa config: multiple_choice split: validation args: num_few_shot: 0 metrics: - type: mc2 value: 48.11 source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: Winogrande (5-shot) type: winogrande config: winogrande_xl split: validation args: num_few_shot: 5 metrics: - type: acc value: 65.67 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: GSM8k (5-shot) type: gsm8k config: main split: test args: num_few_shot: 5 metrics: - type: acc value: 18.42 name: accuracy source: url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna name: Open LLM Leaderboard --- **Bug**: Having a bit issue with the tokenizer, still figuring out...You can use the original Yi tokenizer configuratin. Reproduce Vicuna, but based on yi-6B. The training data I used was ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json. The training framework I used https://github.com/shibing624/MedicalGPT , train shell: ``` CUDA_VISIBLE_DEVICES=0,1,2,3,5 torchrun --nproc_per_node 5 ../supervised_finetuning.py \ --model_type auto \ --model_name_or_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B \ --tokenizer_name_or_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B \ --train_file_dir ../data/finetune/vicuna/ \ --per_device_train_batch_size 2\ --do_train \ --max_train_samples -1 \ --num_train_epochs 3 \ --learning_rate 2e-5 \ --weight_decay 0. \ --bf16 \ --use_peft False \ --logging_strategy steps \ --logging_steps 10 \ --save_strategy epoch \ --save_total_limit 5 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 8 \ --output_dir ../outputs/20240106_yi6B_vicuna \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --torch_dtype bfloat16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True \ --cache_dir ./cache \ --model_max_length 4096 \ --deepspeed ../deepspeed_zero_stage2_config_no16.json \ --template_name yi ``` The training used 5*A800 for 3 epochs ``` ***** train metrics ***** epoch = 3.0 train_loss = 0.3785 train_runtime = 1 day, 10:01:13.95 train_samples = 93204 train_samples_per_second = 2.24 train_steps_per_second = 0.224 ``` Post-training inference is also using this repository: ``` CUDA_VISIBLE_DEVICES=4 python gradio_demo.py --model_type auto --base_model /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240106_yi6B_vicuna --tokenizer_path /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240106_yi6B_vicuna --template_name yi --gpus 4 CUDA_VISIBLE_DEVICES=6 python inference.py --model_type auto --base_model /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240106_yi6B_vicuna --template_name yi --gpus 6 --interactive --tokenizer_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B ``` We can see from some preliminary results, the conversation is natural and informative (unsurprisingly). ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/WfQYyyLxtXA2KlePmIPQJ.png) Also we observe the unfiltering seems to be working! **Heads up** some examples are unsafe and inappropriate, this is entirely for research purposes, to test how alignment-filtered SFT data affect LLM's final output. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/pklSsljCRN34QuL2ZF2zU.png) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/22pTSVkBCVlQ5N8A8JBkF.png) **Update:** Evaluate on Open LLM Leaderboard: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/Xp11HLQqwh0HMSJgpr19n.png) # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_lorinma__yi6B_Vicuna) | Metric |Value| |---------------------------------|----:| |Avg. |51.02| |AI2 Reasoning Challenge (25-Shot)|46.16| |HellaSwag (10-Shot) |69.30| |MMLU (5-Shot) |58.43| |TruthfulQA (0-shot) |48.11| |Winogrande (5-shot) |65.67| |GSM8k (5-shot) |18.42|