yhyhy3's picture
Added pretraining datasets from open llama v2 to README
cabb47a
|
raw
history blame
4.73 kB
metadata
license: apache-2.0
datasets:
  - ehartford/dolphin
  - LinhDuong/chatdoctor-200k
  - sahil2801/code_instructions_120k
  - medalpaca/medical_meadow_mediqa
  - kaiokendev/SuperCOT-dataset
  - tiiuae/falcon-refinedweb
  - bigcode/starcoderdata
  - togethercomputer/RedPajama-Data-1T
language:
  - en
library_name: transformers
pipeline_tag: text-generation
tags:
  - medical
  - code

Model Card for Model ID

This model is an instruction-tuned Open LLaMa model with 7B parameters, with specialities in medical QA and code instruction.

Model Details

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM

model_path = 'yhyhy3/open_llama_7b_v2_med_dolphin_qlora_merged'

tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map='auto',
)

prompt = '''### Instruction: Answer the following question.

### Input: What is the capital of New Jersey?

### Response:'''
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generation_output = model.generate(
    input_ids=input_ids, max_new_tokens=32
)
print(tokenizer.decode(generation_output[0]))

Training Details

Training Data

Converted the following datasets to alpaca:instruction format.

  1. ehartford/dolphin
  1. LinhDuong/chatdoctor-200k
  • Refined dataset sourced from icliniq medical QA forum
  1. sahil2801/code_instructions_120k
  • Code instruction dataset generously created by Sahil Chaudhary from ThreeSixty AI
  1. medalpaca/medical_meadow_mediqa
  • MEDIQA is a dataset of manually generated, question-driven summaries of multi and single document answers to consumer health questions from medalpaca group.
  1. kaiokendev/SuperCOT-dataset
  • Code instruction dataset generously created by Kaio Ken

Training Procedure

Trained using axolotl QLoRa on RunPod 8x A6000 on Community Cloud for 3 epochs (~14 hours - ~$70).

axolotl training config:
base_model: openlm-research/open_llama_7b_v2
base_model_config: openlm-research/open_llama_7b_v2
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: false
load_in_4bit: true
strict: false

push_dataset_to_hub:
hub_model_id:
hf_use_auth_token:

datasets:
  - path: json
    type: alpaca
    data_files: /disk/flan1m-alpaca-uncensored.jsonl
    shards: 8
  - path: sahil2801/code_instructions_120k
    type: alpaca
  - path: LinhDuong/chatdoctor-200k
    type: alpaca
    shards: 2
  - path: kaiokendev/SuperCOT-dataset
    type: alpaca
  - path: medalpaca/medical_meadow_mediqa
    type: alpaca

dataset_prepared_path: last_run_prepared
val_set_size: 0.01
adapter: qlora
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len: 2048
lora_r: 8
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_mode: true
wandb_project:
wandb_watch:
wandb_run_id:
wandb_log_model: 'openllama_checkpoint'
output_dir: /disk/open_llama_7b_v2_dolphin_qlora
gradient_accumulation_steps: 2
micro_batch_size: 16
num_epochs: 3
optimizer: paged_adamw_32bit
torchdistx_path:
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: true
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 1000
eval_steps: 5000
save_steps:
debug:
deepspeed:
weight_decay: 0.0000001
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"